×

Multiple imputation: a review of practical and theoretical findings. (English) Zbl 1397.62052

Summary: Multiple imputation is a straightforward method for handling missing data in a principled fashion. This paper presents an overview of multiple imputation, including important theoretical results and their practical implications for generating and using multiple imputations. A review of strategies for generating imputations follows, including recent developments in flexible joint modeling and sequential regression/chained equations/fully conditional specification approaches. Finally, we compare and contrast different methods for generating imputations on a range of criteria before identifying promising avenues for future research.

MSC:

62D05 Sampling theory, sample surveys
62F15 Bayesian inference

Software:

BayesDA; MICE; mi

References:

[1] Abayomi, K., Gelman, A. and Levy, M. (2008). Diagnostics for multivariate imputations. J. Roy. Statist. Soc. Ser. C57 273–291. · Zbl 1273.62257 · doi:10.1111/j.1467-9876.2007.00613.x
[2] Akande, O., Li, F. and Reiter, J. (2017). An empirical comparison of multiple imputation methods for categorical data. Amer. Statist.71 162–170. · Zbl 07671795
[3] Andridge, R. R. and Little, R. J. A. (2010). A review of hot deck imputation for survey non-response. Int. Stat. Rev.78 40–64. · Zbl 07883358
[4] Arnold, B. C., Castillo, E. and Sarabia, J. M. (2001). Conditionally specified distributions: An introduction. Statist. Sci.16 249–274. · Zbl 1059.62511 · doi:10.1214/ss/1009213728
[5] Arnold, B. C. and Press, J. S. (1989). Compatible conditional distributions. J. Amer. Statist. Assoc.84 152–156. · Zbl 0676.62011 · doi:10.1080/01621459.1989.10478750
[6] Audigier, V., Husson, F. and Josse, J. (2016). Multiple imputation for continuous variables using a Bayesian principal component analysis. J. Stat. Comput. Simul.86 2140–2156. · Zbl 1510.62262 · doi:10.1080/00949655.2015.1104683
[7] Audigier, V., Husson, F. and Josse, J. (2017). MIMCA: Multiple imputation for categorical variables with multiple correspondence analysis. Stat. Comput.27 501–518. · Zbl 1505.62485 · doi:10.1007/s11222-016-9635-4
[8] Banerjee, A., Murray, J. and Dunson, D. B. (2013). Bayesian learning of joint distributions of objects. In Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS), Scottsdale, AZ.
[9] Barnard, J. and Rubin, D. B. (1999). Miscellanea. Small-sample degrees of freedom with multiple imputation. Biometrika86 948–955. · Zbl 0942.62025 · doi:10.1093/biomet/86.4.948
[10] Bernaards, C. A., Belin, T. R. and Schafer, J. L. (2007). Robustness of a multivariate normal approximation for imputation of incomplete binary data. Stat. Med.26 1368–1382.
[11] Blackwell, M., Honaker, J. and King, G. (2015). A unified approach to measurement error and missing data. Sociol. Methods Res.46 303–341.
[12] Böhning, D., Seidel, W., Alfó, M., Garel, B., Patilea, V., Walther, G., Di Zio, M., Guarnera, U. and Luzi, O. (2007). Imputation through finite Gaussian mixture models. Comput. Statist. Data Anal.51 5305–5316. · Zbl 1445.62021
[13] Bondarenko, I. and Raghunathan, T. (2016). Graphical and numerical diagnostic tools to assess suitability of multiple imputations and imputation models. Stat. Med.35 3007–3020.
[14] Breiman, L. (2001). Random forests. Mach. Learn.45 5–32. · Zbl 1007.68152 · doi:10.1023/A:1010933404324
[15] Breiman, L., Friedman, J., Stone, C. J. and Olshen, R. A. (1984). Classification and Regression Trees. Wadsworth Advanced Books and Software, Belmont, CA. · Zbl 0541.62042
[16] Burgette, L. F. and Reiter, J. P. (2010). Multiple imputation for missing data via sequential regression trees. Am. J. Epidemiol.172 1070–1076.
[17] Carpenter, J. and Kenward, M. (2013). Multiple Imputation and Its Application, 1st ed. Wiley, New York. · Zbl 1352.62008
[18] Chen, J. and Shao, J. (2000). Nearest neighbor imputation for survey data. J. Off. Stat.16 113–131.
[19] Cole, S. R., Chu, H. and Greenland, S. (2006). Multiple-imputation for measurement-error correction. Int. J. Epidemiol.35 1074–1081.
[20] Collins, L. M., Schafer, J. L. and Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol. Methods6 330–351.
[21] DeYoreo, M., Reiter, J. P. and Hillygus, D. S. (2017). Bayesian mixture models with focused clustering for mixed ordinal and nominal data. Bayesian Anal.12 679–730. · Zbl 1384.62192 · doi:10.1214/16-BA1020
[22] Doove, L. L., Van Buuren, S. and Dusseldorp, E. (2014). Recursive partitioning for missing data imputation in the presence of interaction effects. Comput. Statist. Data Anal.72 92–104. · Zbl 1506.62056
[23] Dunson, D. B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. J. Amer. Statist. Assoc.104 1042–1051. · Zbl 1388.62151 · doi:10.1198/jasa.2009.tm08439
[24] Elliott, M. R. and Stettler, N. (2007). Using a mixture model for multiple imputation in the presence of outliers: The “healthy for life” project. J. Roy. Statist. Soc. Ser. C56 63–78. · Zbl 1490.62356 · doi:10.1111/j.1467-9876.2007.00565.x
[25] Fithian, W. and Josse, J. (2017). Multiple correspondence analysis and the multilogit bilinear model. J. Multivariate Anal.157 87–102. · Zbl 1362.62126 · doi:10.1016/j.jmva.2017.02.009
[26] Fosdick, B. K., DeYoreo, M. and Reiter, J. P. (2016). Categorical data fusion using auxiliary information. Ann. Appl. Stat.10 1907–1929. · Zbl 1454.62047 · doi:10.1214/16-AOAS925
[27] Gebregziabher, M. and DeSantis, S. M. (2010). Latent class based multiple imputation approach for missing categorical data. J. Statist. Plann. Inference140 3252–3262. · Zbl 1204.62125 · doi:10.1016/j.jspi.2010.04.020
[28] Gelman, A., Carlin, J. B., Rubin, D. B., Vehtari, A., Dunson, D. B. and Stern, H. S. (2014). Bayesian Data Analysis, 3rd ed. CRC Press, Boca Raton, FL. · Zbl 1279.62004
[29] He, Y. and Zaslavsky, A. M. (2012). Diagnosing imputation models by applying target analyses to posterior replicates of completed data. Stat. Med.31 1–18.
[30] He, Y., Zaslavsky, A. M., Landrum, M. B., Harrington, D. P. and Catalano, P. (2010). Multiple imputation in a large-scale complex survey: A practical guide. Stat. Methods Med. Res.19 653–670.
[31] Heitjan, D. F. and Little, R. J. A. (1991). Multiple imputation for the fatal accident reporting system. J. Roy. Statist. Soc. Ser. C40 13–29. · Zbl 0825.62984
[32] Horton, N. J., Lipsitz, S. R. and Parzen, M. (2003). A potential for bias when rounding in multiple imputation. Amer. Statist.57 229–232. · Zbl 1182.62002 · doi:10.1198/0003130032314
[33] Hu, J., Reiter, J. P. and Wang, Q. (2017). Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Anal.12 679–703. · Zbl 06873723 · doi:10.1214/16-BA1047
[34] Hughes, R. A., White, I. R., Seaman, S. R., Carpenter, J. R., Tilling, K. and Sterne, J. A. C. (2014). Joint modelling rationale for chained equations. BMC Med. Res. Methodol.14 28.
[35] Ibrahim, J. G., Lipsitz, S. R. and Chen, M. H. (1999). Missing covariates in generalized linear models when the missing data mechanism is non-ignorable. J. R. Stat. Soc. Ser. B. Stat. Methodol.61 173–190. · Zbl 0917.62060 · doi:10.1111/1467-9868.00170
[36] Ibrahim, J. G., Chen, M. H., Lipsitz, S. R. and Herring, A. H. (2005). Missing data methods for generalized linear models: A comparative review. J. Amer. Statist. Assoc.100 332–346. · Zbl 1117.62360 · doi:10.1198/016214504000001844
[37] Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc.96 161–173. · Zbl 1014.62006 · doi:10.1198/016214501750332758
[38] Kim, J. K. (2002). A note on approximate Bayesian bootstrap imputation. Biometrika89 470–477. · Zbl 1017.62021 · doi:10.1093/biomet/89.2.470
[39] Kim, J. K., Brick, J. M., Fuller, W. A. and Kalton, G. (2006). On the bias of the multiple-imputation variance estimator in survey sampling. J. R. Stat. Soc. Ser. B. Stat. Methodol.68 509–521. · Zbl 1110.62008 · doi:10.1111/j.1467-9868.2006.00546.x
[40] Kim, H. J., Reiter, J. P., Wang, Q., Cox, L. H. and Karr, A. F. (2014). Multiple imputation of missing or faulty values under linear constraints. J. Bus. Econom. Statist.32 375–386.
[41] Kim, H. J., Cox, L. H., Karr, A. F., Reiter, J. P. and Wang, Q. (2015). Simultaneous edit-imputation for continuous microdata. J. Amer. Statist. Assoc.110 987–999.
[42] Kropko, J., Goodrich, B., Gelman, A. and Hill, J. (2014). Multiple imputation for continuous and categorical data: Comparing joint multivariate normal and conditional approaches. Polit. Anal.22 497–519.
[43] Lee, M. C. and Mitra, R. (2016). Multiply imputing missing values in data sets with mixed measurement scales using a sequence of generalised linear models. Comput. Statist. Data Anal.95 24–38. · Zbl 1468.62113
[44] Li, F., Baccini, M., Mealli, F., Zell, E. R., Frangakis, C. E. and Rubin, D. B. (2014). Multiple imputation by ordered monotone blocks with application to the anthrax vaccine research program. J. Comput. Graph. Statist.23 877–892. · doi:10.1080/10618600.2013.826583
[45] Lipsitz, S. R. and Ibrahim, J. G. (1996). A conditional model for incomplete covariates in parametric regression models. Biometrika83 916–922. · Zbl 0885.62026 · doi:10.1093/biomet/83.4.916
[46] Little, R. J. A. (1988). Missing-data adjustments in large surveys. J. Bus. Econom. Statist.6 287–296.
[47] Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley-Interscience, Hoboken, NJ. · Zbl 1011.62004
[48] Little, R. J. A. and Schluchter, M. D. (1985). Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika72 497–512. · Zbl 0609.62082 · doi:10.1093/biomet/72.3.497
[49] Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Amer. Statist. Assoc.89 958–966. · Zbl 0804.62033 · doi:10.1080/01621459.1994.10476829
[50] Liu, C. and Rubin, D. B. (1998). Ellipsoidally symmetric extensions of the general location model for mixed categorical and continuous data. Biometrika85 673–688. · Zbl 0954.62071 · doi:10.1093/biomet/85.3.673
[51] Liu, J., Gelman, A., Hill, J., Su, Y.-S. and Kropko, J. (2014). On the stationary distribution of iterative imputations. Biometrika101 155–173. · Zbl 1285.62058 · doi:10.1093/biomet/ast044
[52] Manrique-Vallier, D. and Reiter, J. P. (2014a). Bayesian estimation of discrete multivariate latent structure models with structural zeros. J. Comput. Graph. Statist.23 1061–1079.
[53] Manrique-Vallier, D. and Reiter, J. P. (2014b). Bayesian multiple imputation for large-scale categorical data with structural zeros. Surv. Methodol.40 125–134.
[54] Manrique-Vallier, D. and Reiter, J. P. (2016). Bayesian simultaneous edit and imputation for multivariate categorical data. J. Amer. Statist. Assoc.112 1708–1719.
[55] Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statist. Sci.9 538–558.
[56] Meng, X.-L. and Romero, M. (2003). Discussion: Efficiency and self-efficiency with multiple imputation inference. Int. Stat. Rev.71 607–618.
[57] Morris, T. P., White, I. R. and Royston, P. (2014). Tuning multiple imputation by predictive mean matching and local residual draws. BMC Med. Res. Methodol.14 75.
[58] Murray, J. S. and Reiter, J. P. (2016). Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence. J. Amer. Statist. Assoc.111 1466–1479.
[59] Nguyen, C. D., Lee, K. J. and Carlin, J. B. (2015). Posterior predictive checking of multiple imputation models. Biom. J.57 676–694. · Zbl 1329.62034 · doi:10.1002/bimj.201400034
[60] Nielsen, S. F. (2003). Proper and improper multiple imputation. Int. Stat. Rev.71 593–607. · Zbl 1114.62323 · doi:10.1111/j.1751-5823.2003.tb00214.x
[61] Olkin, I. and Tate, R. F. (1961). Multivariate correlation models with mixed discrete and continuous variables. Ann. Math. Stat.32 448–465. · Zbl 0113.35101 · doi:10.1214/aoms/1177705052
[62] Paddock, S. M. (2002). Bayesian nonparametric multiple imputation of partially observed data with ignorable nonresponse. Biometrika89 529–538. · Zbl 1036.62002 · doi:10.1093/biomet/89.3.529
[63] Raghunathan, T. E., Reiter, J. P. and Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. J. Off. Stat.19 1–16.
[64] Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J. and Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol.27 85–96.
[65] Rässler, S. (2004). Data fusion: Identification problems, validity, and multiple imputation. Aust. J. Stat.33 153–171.
[66] Reiter, J. P. (2002). Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat.18 531.
[67] Reiter, J. P. (2005). Using CART to generate partially synthetic public use microdata. J. Off. Stat.21 441.
[68] Reiter, J. P. (2012). Bayesian finite population imputation for data fusion. Statist. Sinica22 795–811. · Zbl 1238.62030 · doi:10.5705/ss.2010.140
[69] Reiter, J. (2017). Discussion: Dissecting multiple imputation from a multi-phase inference perspective: What happens when God’s, imputer’s and analyst’s models are uncongenial? Statist. Sinica. 27 1578–1583. · Zbl 1392.62036
[70] Reiter, J. P. and Raghunathan, T. E. (2007). The multiple adaptations of multiple imputation. J. Amer. Statist. Assoc.102 1462–1471. · Zbl 1332.62044 · doi:10.1198/016214507000000932
[71] Reiter, J. P., Raghunathan, T. E. and Kinney, S. K. (2006). The importance of modeling the sampling design in multiple imputation for missing data. Surv. Methodol.32 143.
[72] Robins, J. M. and Wang, N. (2000). Inference for imputation estimators. Biometrika87 113–124. · Zbl 0974.62016 · doi:10.1093/biomet/87.1.113
[73] Rousseau, J. (2016). On the frequentist properties of Bayesian nonparametric methods. Annual Review of Statistics and Its Application3 211–231.
[74] Rubin, D. B. (1981). The Bayesian bootstrap. Ann. Statist.9 130–134. · doi:10.1214/aos/1176345338
[75] Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley. New York. · Zbl 1070.62007
[76] Rubin, D. B. (1993). Discussion: Statistical disclosure limitation. J. Off. Stat.9 461–468. · Zbl 1416.62057 · doi:10.1111/j.1751-5823.2012.00190.x
[77] Rubin, D. B. (1996). Multiple imputation after 18+ years. J. Amer. Statist. Assoc.91 473–489. · Zbl 0869.62014 · doi:10.1080/01621459.1996.10476908
[78] Rubin, D. B. (2003a). Discussion on multiple imputation. Int. Stat. Rev.71 619–625.
[79] Rubin, D. B. (2003b). Nested multiple imputation of NMES via partially incompatible MCMC. Stat. Neerl.57 3–18. · doi:10.1111/1467-9574.00217
[80] Rubin, D. B. and Schafer, J. L. (1990). Efficiently creating multiple imputations for incomplete multivariate normal data. In Proc. Statistical Computing Section of the American Statistical Association 83–88. Amer. Statist. Assoc., Alexandria, VA.
[81] Rubin, D. B. and Schenker, N. (1986). Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J. Amer. Statist. Assoc.81 366–374. · Zbl 0615.62011 · doi:10.1080/01621459.1986.10478280
[82] Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. Chapman & Hall, London. · Zbl 0997.62510
[83] Schafer, J. L. (2003). Multiple imputation in multivariate problems when the imputation and analysis models differ. Stat. Neerl.57 19–35. · doi:10.1111/1467-9574.00218
[84] Schenker, N. and Taylor, J. M. G. (1996). Partially parametric techniques for multiple imputation. Comput. Statist. Data Anal.22 425–446. · Zbl 0875.62095 · doi:10.1016/0167-9473(95)00057-7
[85] Schifeling, T. A. and Reiter, J. P. (2016). Incorporating marginal prior information in latent class models. Bayesian Anal.11 499–518. · Zbl 1357.62130 · doi:10.1214/15-BA959
[86] Seaman, S. R. and Hughes, R. A. (2016). Relative efficiency of joint-model and full-conditional-specification multiple imputation when conditional models are compatible: The general location model. Stat. Methods Med. Res.DOI:10.1177/0962280216665872.
[87] Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica4 639–650. · Zbl 0823.62007
[88] Shah, A. D., Bartlett, J. W., Carpenter, J., Nicholas, O. and Hemingway, H. (2014). Comparison of random forest and parametric imputation models for imputing missing data using MICE: A CALIBER study. Am. J. Epidemiol.179 764–774.
[89] Si, Y. and Reiter, J. P. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. J. Educ. Behav. Stat.38 499–521.
[90] Stuart, E. A., Azur, M., Frangakis, C. and Leaf, P. (2009). Multiple imputation with large data sets: A case study of the children’s mental health initiative. Am. J. Epidemiol.169 1133–1139.
[91] Su, Y.-S., Gelman, A., Hill, J., Yajima, M. et al. (2011). Multiple imputation with diagnostics (mi) in R: Opening windows into the black box. J. Stat. Softw.45 1–31.
[92] Van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res.16 219–42. · Zbl 1122.62382 · doi:10.1177/0962280206074463
[93] Van Buuren, S. (2012). Flexible Imputation of Missing Data. CRC Press, Boca Raton, FL. · Zbl 1256.62005
[94] Van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. J. Stat. Softw.45 1–67.
[95] Van Buuren, S. and Oudshoorn, K. (1999). Flexible Multivariate Imputation by MICE. TNO Prevention Center, Leiden, The Netherlands.
[96] Van Buuren, S., Brand, J. P. L., Groothuis-Oudshoorn, C. G. M. and Rubin, D. B. (2006). Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul.76 1049–1064. · Zbl 1144.62332 · doi:10.1080/10629360600810434
[97] Vermunt, J. K., Van Ginkel, J. R., Van Der Ark, L. A. and Sijtsma, K. (2008). Multiple imputation of incomplete categorial data using latent class analysis. Sociol. Method.38 369–397.
[98] Vidotto, D., Vermunt, J. K. and Kaptein, M. C. (2015). Multiple imputation of missing categorical data using latent class models: State of art. Psychol. Test Assess. Model.57 542–576.
[99] Vink, G., Frank, L. E., Pannekoek, J. and van Buuren, S. (2014). Predictive mean matching imputation of semicontinuous variables. Stat. Neerl.68 61–90. DOI:10.1111/stan.12023. · Zbl 1541.62040
[100] Wang, N. and Robins, J. M. (1998). Large-sample theory for parametric multiple imputation procedures. Biometrika85 935–948. · Zbl 1054.62524 · doi:10.1093/biomet/85.4.935
[101] Xie, X. and Meng, X.-L. (2017). Dissecting multiple imputation from a multi-phase inference perspective: What happens when God’s, imputer’s and analyst’s models are uncongenial? Statist. Sinica. 27 1485–1545. · Zbl 1392.62040
[102] Xu, D., Daniels, M. J. and Winterstein, A. G. (2016). Sequential BART for imputation of missing covariates. Biostatistics17 589–602.
[103] Zhu, J. and Raghunathan, T. E. (2015). Convergence properties of a sequential regression multiple imputation algorithm. J. Amer. Statist. Assoc.110 1112–1124. · Zbl 1373.62393 · doi:10.1080/01621459.2014.948117
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.