×

Nonparametric imputation by data depth. (English) Zbl 1437.62071

Summary: We present single imputation method for missing values which borrows the idea of data depth – a measure of centrality defined for an arbitrary point of a space with respect to a probability distribution or data cloud. This consists in iterative maximization of the depth of each observation with missing values, and can be employed with any properly defined statistical depth function. For each single iteration, imputation reverts to optimization of quadratic, linear, or quasiconcave functions that are solved analytically by linear programming or the Nelder-Mead method. As it accounts for the underlying data topology, the procedure is distribution free, allows imputation close to the data geometry, can make prediction in situations where local imputation \((k\)-nearest neighbors, random forest) cannot, and has attractive robustness and asymptotic properties under elliptical symmetry. It is shown that a special case – when using the Mahalanobis depth – has direct connection to well-known methods for the multivariate normal model, such as iterated regression and regularized PCA. The methodology is extended to multiple imputation for data stemming from an elliptically symmetric distribution. Simulation and real data studies show good results compared with existing popular alternatives. The method has been implemented as an R-package.

MSC:

62D10 Missing data
90C05 Linear programming
62-04 Software, source code, etc. for problems pertaining to statistics
90C25 Convex programming

Software:

R; missForest; impute

References:

[1] Azzalini, A.; Capitanio, A., “Statistical Applications of the Multivariate Skew Normal Distribution,”, Journal of the Royal Statistical Society, Series B, 61, 579-602 (1999) · Zbl 0924.62050 · doi:10.1111/1467-9868.00194
[2] Bazovkin, P.; Mosler, K., “A General Solution for Robust Linear Programs With Distortion Risk Constraints,”, Annals of Operations Research, 229, 103-120 (2015) · Zbl 1318.90052 · doi:10.1007/s10479-015-1786-8
[3] Bertsekas, P. D., Nonlinear Programming (1999), Cambridge, MA: MIT Press, Cambridge, MA · Zbl 1015.90077
[4] Cascos, I.; Molchanov, I., “Multivariate Risks and Depth-Trimmed Regions,”, Finance and Stochastics, 11, 373-397 (2007) · Zbl 1164.91027 · doi:10.1007/s00780-007-0043-7
[5] Dempster, A. P.; Laird, N. M.; Rubin, D. B., “Maximum Likelihood From Incomplete Data via the EM Algorithm,”, Journal of the Royal Statistical Society, Series B, 39, 1-38 (1977) · Zbl 0364.62022 · doi:10.1111/j.2517-6161.1977.tb01600.x
[6] Donoho, D. L.; Gasko, M., “Breakdown Properties of Location Estimates Based on Halfspace Depth and Projected Outlyingness,”, The Annals of Statistics, 20, 1803-1827 (1992) · Zbl 0776.62031 · doi:10.1214/aos/1176348890
[7] Dua, D.; Karra Taniskidou, E. (2017)
[8] Dyckerhoff, R., “Data Depths Satisfying the Projection Property,”, Advances in Statistical Analysis, 88, 163-190 (2004) · Zbl 1294.62112 · doi:10.1007/s101820400167
[9] Dyckerhoff, R.; Mozharovskyi, P., “Exact Computation of the Halfspace Depth,”, Computational Statistics and Data Analysis, 98, 19-30 (2016) · Zbl 1468.62048 · doi:10.1016/j.csda.2015.12.011
[10] Efron, B., “Missing Data, Imputation, and the Bootstrap,”, Journal of the American Statistical Association, 89, 463-475 (1994) · Zbl 0806.62033 · doi:10.1080/01621459.1994.10476768
[11] Einmahl, J. H. J.; Li, J.; Liu, R. Y., “Bridging Centrality and Extremity: Refining Empirical Data Depth Using Extreme Value Statistics,”, The Annals of Statistics, 43, 2738-2765 (2015) · Zbl 1327.62205 · doi:10.1214/15-AOS1359
[12] Fang, K.; Kotz, S.; Ng, K., Symmetric Multivariate and Related Distributions (Monographs on Statistics and Applied Probability (1990), New York: Chapman and Hall, New York · Zbl 0699.62048
[13] Hastie, T.; Mazumder, R.; Lee, D. J.; Zadeh, R., “Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares,”, Journal of Machine Learning Research, 16, 3367-3402 (2015) · Zbl 1352.65117
[14] Jörnsten, R., “Clustering and Classification Based on the L1 Data Depth,”, Journal of Multivariate Analysis, 90, 67-89 (2004) · Zbl 1047.62064 · doi:10.1016/j.jmva.2004.02.013
[15] Josse, J.; Husson, F., “Handling Missing Values in Exploratory Multivariate Data Analysis Methods,”, Journal de la Société Française de Statistique, 153, 79-99 (2012) · Zbl 1316.62006
[16] Josse, J.; Reiter, J. P., “Introduction to the Special Section on Missing Data,”, Statistical Science, 33, 139-141 (2018) · doi:10.1214/18-STS332IN
[17] Koshevoy, G.; Mosler, K., “Zonoid Trimming for Multivariate Distributions,”, The Annals of Statistics, 25, 1998-2017 (1997) · Zbl 0881.62059 · doi:10.1214/aos/1069362382
[18] Lange, T.; Mosler, K.; Mozharovskyi, P., “Fast Nonparametric Classification Based on Data Depth,”, Statistical Papers, 55, 49-69 (2014) · Zbl 1283.62128 · doi:10.1007/s00362-012-0488-4
[19] Little, R.; Rubin, D., Statistical Analysis With Missing Data (Wiley Series in Probability and Mathematical Statistics. Probability and Mathematical Statistics (2002), Hoboken, NJ: Wiley, Hoboken, NJ · Zbl 1011.62004
[20] Liu, R. Y.; Parelius, J. M.; Singh, K., Multivariate Analysis by Data Depth: Descriptive Statistics, Graphics and Inference” (with discussion and a rejoinder by Liu and Singh), The Annals of Statistics, 27, 783-858 (1999) · Zbl 0984.62037
[21] Liu, R. Y.; Singh, K., “A Quality Index Based on Data Depth and Multivariate Rank Tests,”, Journal of the American Statistical Association, 88, 252-260 (1993) · Zbl 0772.62031 · doi:10.1080/01621459.1993.10594317
[22] Mahalanobis, P. C., On the Generalised Distance in Statistics, Proceedings of the National Institute of Sciences of India, 2, 49-55 (1936) · Zbl 0015.03302
[23] Mosler, K., Multivariate Dispersion, Central Regions, and Depth: The Lift Zonoid Approach (Lecture Notes in Statistics (2002), New York: Springer, New York · Zbl 1027.62033
[24] Nagy, S., “Monotonicity Properties of Spatial Depth,”, Statistics and Probability Letters, 129, 373-378 (2017) · Zbl 1380.62228 · doi:10.1016/j.spl.2017.06.025
[25] Paindaveine, D.; Bever, G. V., “From Depth to Local Depth: A Focus on Centrality,”, Journal of the American Statistical Association, 108, 1105-1119 (2013) · Zbl 06224990 · doi:10.1080/01621459.2013.813390
[26] Rousseeuw, P. J.; Van Driessen, K., “A Fast Algorithm for the Minimum Covariance Determinant Estimator,”, Technometrics, 41, 212-223 (1999) · doi:10.1080/00401706.1999.10485670
[27] Schafer, J., Analysis of Incomplete Multivariate Data (Chapman & Hall/CRC Monographs on Statistics & Applied Probability (1997), New York: CRC Press, New York · Zbl 0997.62510
[28] Stekhoven, D. J.; Bühlmann, P., “MissForest—Non-parametric Missing Value Imputation for Mixed-Type Data,”, Bioinformatics, 28, 112-118 (2012) · doi:10.1093/bioinformatics/btr597
[29] Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R. B., “Missing Value Estimation Methods for DNA Microarrays,”, Bioinformatics, 17, 520-525 (2001) · doi:10.1093/bioinformatics/17.6.520
[30] Tukey, J. W.; James, R. D., International Congress of Mathematicians 1974, 2, Mathematics and the Picturing of Data, 523-532 (1975) · Zbl 0347.62002
[31] Udell, M.; Townsend, A., “Nice Latent Variable Models Have Log-Rank,”, arXiv:1705.07474 (2017)
[32] van Buuren, S., Flexible Imputation of Missing Data (Chapman & Hall/CRC Interdisciplinary Statistics (2012), Boca Raton, FL: Chapman and Hall/CRC Press, Boca Raton, FL · Zbl 1256.62005
[33] Vardi, Y.; Zhang, C.-H, “The Multivariate L1-Median and Associated Data Depth,”, Proceedings of the National Academy of Sciences, 97, 1423-1426 (2000) · Zbl 1054.62067
[34] Yeh, I.-C.; Yang, K.-J.; Ting, T.-M, “Knowledge Discovery on RFM model using bernoulli Sequence,”, Expert Systems with Applications, 36, 5866-5871 (2009) · doi:10.1016/j.eswa.2008.07.018
[35] Zuo, Y.; Serfling, R., “General Notions of Statistical Depth Function,”, The Annals of Statistics, 28, 461-482 (2000) · Zbl 1106.62334 · doi:10.1214/aos/1016218226
[36] Zuo, Y.; Serfling, R., “Structural Properties and Convergence Results for Contours of Sample Statistical Depth Functions, The Annals of Statistics, 28, 483-499 (2000) · Zbl 1105.62343
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.