×

Nearest neighbours in least-squares data imputation algorithms with different missing patterns. (English) Zbl 1431.62044

Summary: Methods for imputation of missing data in the so-called least-squares approximation approach, a non-parametric computationally efficient multidimensional technique, are experimentally compared. Contributions are made to each of the three components of the experiment setting: (a) algorithms to be compared, (b) data generation, and (c) patterns of missing data. Specifically, “global” methods for least-squares data imputation are reviewed and extensions to them are proposed based on the nearest neighbours (NN) approach. A conventional generator of mixtures of Gaussian distributions is theoretically analysed and, then, modified to scale clusters differently. Patterns of missing data are defined in terms of rows and columns according to three different mechanisms that are referred to as Random missings, Restricted random missings, and Merged database. It appears that NN-based versions almost always outperform their global counterparts. With the Random missings pattern, the winner is always the authors’ two-stage method INI, which combines global and local imputation algorithms.

MSC:

62D10 Missing data
62H25 Factor analysis and principal components; correspondence analysis
62F10 Point estimation
62-08 Computational methods for problems pertaining to statistics

Software:

impute; MULTIMIX

References:

[1] Aha, D., Editorial, 1997. Artif. Intell. Rev. 11, 1-6.; Aha, D., Editorial, 1997. Artif. Intell. Rev. 11, 1-6.
[2] Atkeson, C. G.; Moore, A. W.; Schaal, S., Locally weighted learning, Artif. Intell. Rev., 11, 11-73 (1997)
[3] Benzecri, J.P., 1973. Analyse des Donnees. Paris, Dunod.; Benzecri, J.P., 1973. Analyse des Donnees. Paris, Dunod. · Zbl 0297.62038
[4] Berry, M.; Dumais, S.; Landauer, T.; O’Brien, G., Using linear algebra for intelligent information retrieval, SIAM Rev., 37, 573-595 (1995) · Zbl 0842.68026
[5] Christoffersson, A., 1970. The one component model with incomplete data. Ph.D. Thesis, Uppsala University.; Christoffersson, A., 1970. The one component model with incomplete data. Ph.D. Thesis, Uppsala University.
[6] Davies, P.; Smith, P., Model Quality Reports in Business Statistics (1999), ONS: ONS UK
[7] Dempster, A. P.; Laird, N. M.; Rubin, D. B., Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Statist. Soc., 39, 1-38 (1977) · Zbl 0364.62022
[8] EM based imputation software: http://www.stat.psu.edu/jls/misoftwa.htmlhttp://methcenter.psu.edu/EMCOV.html; EM based imputation software: http://www.stat.psu.edu/jls/misoftwa.htmlhttp://methcenter.psu.edu/EMCOV.html
[9] Everrit, B. S.; Hand, D. J., Finite Mixture Distributions (1981), Chapman & Hall: Chapman & Hall London · Zbl 0466.62018
[10] Gabriel, K. R.; Zamir, S., Lower rank approximation of matrices by least squares with any choices of weights, Technometrics, 21, 298-489 (1979) · Zbl 0471.62004
[11] Generation of Gaussian mixture distributed data, NETLAB neural network software, http://www.ncrg.aston.ac.uk/netlab; Generation of Gaussian mixture distributed data, NETLAB neural network software, http://www.ncrg.aston.ac.uk/netlab
[12] Golub, G. H.; Loan, C. F., Matrix Computation (1986), John Hopkins University Press
[13] Grung, B.; Manne, R., Missing values in principal component analysis, Chemometr. Intell. Lab. System, 42, 125-139 (1998)
[14] Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P., Botstein, D., 1999. Imputing missing data for gene expression arrays. Technical Report, Division of Biostatistics, Stanford University.; Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P., Botstein, D., 1999. Imputing missing data for gene expression arrays. Technical Report, Division of Biostatistics, Stanford University.
[15] Heiser, W. J., Convergent computation by iterative majorizationtheory and applications in multidimensional analysis, (Krzanowski, W. J., Recent Advances in Descriptive Multivariate Analysis (1995), Oxford University Press: Oxford University Press Oxford), 157-189
[16] Holter, N. S.; Maritan, A.; Cieplak, M.; Fedoroff, N. V.; Banavar, J. R., Dynamic modeling of gene expression data, Proc. Natl. Acad. Sci., 98, 1693-1698 (2001)
[17] Holzinger, K. J.; Harman, H. H., Factor Analysis (1941), University of Chicago Press: University of Chicago Press Chicago · Zbl 0060.31208
[18] Hunt, L.; Jorgensen, M., Mixture model clustering for mixed data with missing information, Comput. Statist. Data Anal., 41, 193-210 (2003)
[19] Jollife, I. T., Principal Component Analysis (1986), Springer: Springer New York
[20] Kamakashi, L.; Harp, S. A.; Samad, T.; Goldman, R. P., Imputation of missing data using machine learning techniques, (Simoudis, E.; Han, J.; Fayyad, U., Second International Conference on Knowledge Discovery and Data Mining (1996), Oregon), 140-145
[21] Kenney, N.; Macfarlane, A., Identifying problems with data collection at a local levelsurvey of NHS maternity units in England, Br. Med. J., 319, 619-622 (1999)
[22] Kiers, H. A.L., Weighted least squares fitting using ordinary least squares algorithms, Psychometrika, 62, 251-266 (1997) · Zbl 0873.62058
[23] Krzanowski, W. J., Missing value imputation in multivariate data using the singular value decomposition of a matrix, Biometr. Lett., 25, 31-39 (1988)
[24] Laaksonen, S., Regression-based nearest neighbour hot decking, Comput. Statist., 15, 65-71 (2000) · Zbl 0953.62002
[25] Little, R. J.A.; Rubin, D. B., Statistical Analysis with Missing Data (1987), Wiley: Wiley New York · Zbl 0665.62004
[26] Mirkin, B., Mathematical Classification and Clustering (1996), Kluwer Academic Publishers: Kluwer Academic Publishers Dordrecht · Zbl 0874.90198
[27] Myrtveit, I.; Stensrud, E.; Olsson, U. H., Analyzing data sets with missing dataan empirical evaluation of imputation methods and likelihood-based methods, IEEE Trans. Software Eng., 27, 999-1013 (2001)
[28] Quinlan, J.R., 1989. Unknown attribute values in induction. Sixth International Machine Learning Workshop, New York.; Quinlan, J.R., 1989. Unknown attribute values in induction. Sixth International Machine Learning Workshop, New York.
[29] Roweis, S., EM algorithms for PCA and SPCA, (Jordan, M.; Kearns, M.; Solla, S., Advances in Neural Information Processing Systems, vol. 10 (1998), MIT Press: MIT Press Cambridge, MA), 626-632
[30] Rubin, D. B., Multiple Imputation for Nonresponse in Surveys (1987), Wiley: Wiley New York · Zbl 1070.62007
[31] Rubin, D. B., Multiple imputation after 18+ years, J. Amer. Statist. Assoc., 91, 473-489 (1996) · Zbl 0869.62014
[32] Schafer, J. L., Analysis of Incomplete Multivariate Data (1997), Chapman & Hall: Chapman & Hall London · Zbl 0997.62510
[33] Shum, H. Y.; Ikeuchi, K.; Reddy, R., PCA with missing data and its application to polyhedral object modelling, IEEE Trans. Pattern Anal. Mach. Intell., 17, 854-867 (1995)
[34] Strauss, R.E., Atanassov, M.N., De Oliveira, J.A., 2003. Evaluation of the principal-component and expectation-maximization methods for estimating missing data in morphometric studies. J. Vertebrate Paleontol. 23, 284-296.; Strauss, R.E., Atanassov, M.N., De Oliveira, J.A., 2003. Evaluation of the principal-component and expectation-maximization methods for estimating missing data in morphometric studies. J. Vertebrate Paleontol. 23, 284-296.
[35] Tipping, M. E.; Bishop, C. M., Probabilistic principal component analysis, J. Roy. Statist. Soc. Ser. B, 61, 611-622 (1999) · Zbl 0924.62068
[36] Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Hastie, R.; Tibshirani, R.; Botstein, D.; Altman, R. B., Missing value estimation methods for DNA microarrays, Bioinformatics, 17, 520-525 (2001)
[37] Wasito, I., Mirkin, B., 2005. Nearest neighbour approach in the least-squares data imputation algorithms. Inform. Sci. 169 (1).; Wasito, I., Mirkin, B., 2005. Nearest neighbour approach in the least-squares data imputation algorithms. Inform. Sci. 169 (1). · Zbl 1084.62043
[38] Wold, H., Estimation of principal components and related models by iterative least square, (Krishnaiah, P. R., Multivariate Analysis Proceedings of International Symposium in Dayton (1966), Academic Press: Academic Press New York), 391-402
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.