×

Multiply imputing missing values in data sets with mixed measurement scales using a sequence of generalised linear models. (English) Zbl 1468.62113

Summary: Multiple imputation is a commonly used approach to deal with missing values. In this approach, an imputer repeatedly imputes the missing values by taking draws from the posterior predictive distribution for the missing values conditional on the observed values, and releases these completed data sets to analysts. With each completed data set the analyst performs the analysis of interest, treating the data as if it were fully observed. These analyses are then combined with standard combining rules, allowing the analyst to make appropriate inferences which take into account the uncertainty present due to the missing data. In order to preserve the statistical properties present in the data, the imputer must use a plausible distribution to generate the imputed values. In data sets containing variables with different measurement scales, e.g. some categorical and some continuous variables, this is a challenging problem. A method is proposed to multiply impute missing values in such data sets by modelling the joint distribution of the variables in the data through a sequence of generalised linear models, and data augmentation methods are used to draw imputations from a proper posterior distribution using Markov Chain Monte Carlo (MCMC). The performance of the proposed method is illustrated using simulation studies and on a data set taken from a breast feeding study.

MSC:

62-08 Computational methods for problems pertaining to statistics
62D05 Sampling theory, sample surveys
62F15 Bayesian inference
62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

BayesDA; MICE

References:

[1] Albert, J. H.; Chib, S., Bayesian analysis of binary and polychotomous response data, J. Amer. Statist. Assoc., 88, 422, 669-679, (1993) · Zbl 0774.62031
[2] Azur, M. J.; Stuart, E. A.; Frangakis, C.; Leaf, P. J., Multiple imputation by chained equations: what is it and how does it work?, Int. J. Methods Psychiatric Res., 20, 1, 40-49, (2011)
[3] Bartlett, J.W., Seaman, S.R., White, I.R., Carpenter, J.R., 2012. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. arXiv Preprint arXiv:1210.6799.
[4] Bernhardt, P. W.; Wang, H. J.; Zhang, D., Flexible modeling of survival data with covariates subject to detection limits via multiple imputation, Comput. Statist. Data Anal., 69, 81-91, (2014) · Zbl 1471.62028
[5] Box, G. E.; Tiao, G. C., Bayesian inference in statistical analysis, vol 40, (2011), John Wiley & Sons
[6] Chantry, C. J.; Howard, C. R.; Auinger, P., Full breastfeeding duration and associated decrease in respiratory tract infection in US children, Pediatrics, 117, 2, 425-432, (2006)
[7] Chen, M.-H.; Ibrahim, J. G., Maximum likelihood methods for cure rate models with missing covariates, Biometrics, 57, 1, 43-52, (2001) · Zbl 1209.62022
[8] Consentino, F.; Claeskens, G., Order selection tests with multiply imputed data, Comput. Statist. Data Anal., 54, 10, 2284-2295, (2010) · Zbl 1284.62044
[9] D’Agostino, R. B.; Rubin, D. B., Estimating and using propensity scores with partially missing data, J. Amer. Statist. Assoc., 95, 451, 749-759, (2000)
[10] De Leon, A. R.; Carriègre, K., General mixed data model: extension of general location and grouped continuous models, Canad. J. Statist., 35, 4, 533-548, (2007) · Zbl 1143.62323
[11] Gelman, A.; Carlin, J. B.; Stern, H. S.; Rubin, D. B., Bayesian data analysis, (2004), Chapman & Hall/CRC · Zbl 1039.62018
[12] Gelman, A.; Speed, T., Characterizing a joint probability distribution by conditionals, J. R. Stat. Soc. Ser. B Stat. Methodol., 55, 1, 185-188, (1993) · Zbl 0780.62013
[13] Gelman, A.; Van Mechelen, I.; Verbeke, G.; Heitjan, D.; Meulders, M., Multiple imputation for model checking: completed-data plots with missing and latent data, Biometrics, 61, 1, 74-85, (2005) · Zbl 1077.62091
[14] Gilks, W. R.; Wild, P., Adaptive rejection sampling for Gibbs sampling, Appl. Stat., 337-348, (1992) · Zbl 0825.62407
[15] Goldstein, H.; Carpenter, J.; Kenward, M. G.; Levin, K. A., Multilevel models with multivariate mixed response types, Stat. Model., 9, 3, 173-197, (2009) · Zbl 07257700
[16] Hapfelmeier, A.; Ulm, K., Variable selection by random forests using data with missing values, Comput. Statist. Data Anal., 80, 129-139, (2014) · Zbl 1506.62075
[17] Ibrahim, J. G.; Chen, M.-H.; Lipsitz, S. R., Bayesian methods for generalized linear models with covariates missing at random, Canad. J. Statist., 30, 1, 55-78, (2002) · Zbl 0999.62021
[18] Ibrahim, J. G.; Chen, M.-H.; Lipsitz, S. R.; Herring, A. H., Missing-data methods for generalized linear models: A comparative review, J. Amer. Statist. Assoc., 100, 469, 332-346, (2005) · Zbl 1117.62360
[19] Ibrahim, J. G.; Lipsitz, S. R.; Chen, M.-H., Missing covariates in generalized linear models when the missing data mechanism is non-ignorable, J. R. Stat. Soc. Ser. B Stat. Methodol., 61, 173-190, (1999) · Zbl 0917.62060
[20] Li, F., Yu, Y., Rubin, D.B., 2012. Imputing Missing Data by Fully Conditional Models: Some Cautionary Examples and Guidelines. Duke University Department of Statistical Science Discussion Paper, pp. 11-24.
[21] Little, R. J., Regression with missing X’s: A review, J. Amer. Statist. Assoc., 87, 420, 1227-1237, (1992)
[22] Little, R. J.; Rubin, D. B., Statistical analysis with missing data, (2002), Wiley-Interscience · Zbl 1011.62004
[23] Liu, J., Gelman, A., Hill, J., Su, Y.-S., 2010. On the stationary distribution of iterative imputations. arXiv Preprint arXiv:1012.2902. · Zbl 1285.62058
[24] Mitra, R.; Dunson, D., Two-level stochastic search variable selection in glms with missing predictors, Int. J. Biostat., 6, 1, (2010)
[25] Mitra, R.; Reiter, J. P., Estimating propensity scores with missing covariate data using general location mixture models, Stat. Med., 30, 627-641, (2011)
[26] Raghunathan, T. E.; Lepkowski, J. M.; Hoewyk, J. V.; Solenberger, P., A multivariate technique for multiply imputing missing values using a sequence of regression models, Surv. Methodol., 27, 1, 85-95, (2001)
[27] Rashid, S.; Mitra, R.; Steele, R., Using mixtures of t densities to make inferences in the presence of missing data with a small number of multiply imputed data sets, Comput. Statist. Data Anal., (2015) · Zbl 1468.62167
[28] Rubin, D. B., Multiple imputation for nonresponse in surveys, (1987), Wiley-Interscience · Zbl 1070.62007
[29] Rubin, D. B., Statistical disclosure limitation, J. Off. Stat., 9, 2, 461-468, (1993)
[30] Rubin, D., Multiple imputation after 18+ years, J. Amer. Statist. Assoc., 91, 434, 473-489, (1996) · Zbl 0869.62014
[31] Rubin, D. B., Nested multiple imputation of NMES via partially incompatible MCMC, Stat. Neerl., 57, 1, 3-18, (2003)
[32] Rubin, D. B.; Barnard, J., Small-sample degrees of freedom with multiple imputation, Biometrika, 86, 4, 948-955, (1999) · Zbl 0942.62025
[33] Rubin, D.; Schenker, N., Multiple imputation for interval estimation from simple random samples with ignorable nonresponse, J. Amer. Statist. Assoc., 81, 366-374, (1986) · Zbl 0615.62011
[34] Schafer, J. L., Analysis of incomplete multivariate data, (1997), Chapman & Hall/CRC · Zbl 0997.62510
[35] Tanner, M. A.; Wong, W. H., The calculation of posterior distributions by data augmentation, J. Amer. Statist. Assoc., 82, 398, 528-540, (1987) · Zbl 0619.62029
[36] Van Buuren, S., Mice: multivariate imputation by chained equations, J. Stat. Softw., 45, 3, (2011)
[37] Van Buuren, S.; Boshuizen, H.; Knook, D., Multiple imputation of missing blood pressure covariates in survival analysis, Stat. Med., 18, 6, 681-694, (1999)
[38] Wallace, M. L.; Anderson, S. J.; Mazumdar, S., A stochastic multiple imputation algorithm for missing covariate data in tree-structured survival analysis, Stat. Med., 29, 29, 3004-3016, (2010)
[39] White, I. R.; Royston, P.; Wood, A. M., Multiple imputation using chained equations: issues and guidance for practice, Stat. Med., 30, 4, 377-399, (2011)
[40] Xu, L.; Zhang, J., Multiple imputation method for the semiparametric accelerated failure time mixture cure model, Comput. Statist. Data Anal., 54, 7, 1808-1816, (2010) · Zbl 1284.62634
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.