×

FragmGAN: generative adversarial nets for fragmentary data imputation and prediction. (English) Zbl 07834320

Summary: Modern scientific research and applications very often encounter ‘fragmentary data’ which brings big challenges to imputation and prediction. By leveraging the structure of response patterns, we propose a unified and flexible framework based on Generative Adversarial Nets (GAN) to deal with fragmentary data imputation and label prediction at the same time. Unlike most of the other generative model based imputation methods that either have no theoretical guarantee or only consider Missing Completed At Random (MCAR), the proposed FragmGAN has theoretical guarantees for imputation with data Missing At Random (MAR) while no hint mechanism is needed. FragmGAN trains a predictor with the generator and discriminator simultaneously. This linkage mechanism shows significant advantages for predictive performances in extensive experiments.

MSC:

62-XX Statistics

References:

[1] Awan, S. E., Bennamoun, M., Sohel, F., Sanfilippo, F., & Dwivedi, G. (2021). Imputation of missing data with class imbalance using conditional generative adversarial networks. Neurocomputing, 453(17), 164-171.
[2] Camino, R. D., Hammerschmidt, C. A., & State, R. (2019). Improving missing data imputation with deep generative models. arXiv:1902.10666v1.
[3] Dalca, A. V., Guttag, J., & Sabuncu, M. R. (2019). Unsupervised data imputation via variational inference of deep subspaces. arXiv:1903.03503v1.
[4] Deng, G., Han, C., & Matteson, D. S. (2020). Learning to rank with missing data via generative adversarial networks. arXiv:2011.02089v2.
[5] Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussions). Journal of Royal Statistical Society Series B, 70(5), 849-911. · Zbl 1411.62187
[6] Fang, F., Lan, W., Tong, J., & Shao, J. (2019). Model averaging for prediction with fragmentary data. Journal of Business & Economic Statistics, 37(3), 517-527.
[7] Friedjungová, M., Vasata, D., Balatsko, M., & Jirina, M. (2020). Missing features reconstruction using a Wasserstein generative adversarial imputation network. In International Conference on Computational Science (ICCS 2020). pp. 225-239.
[8] García-Laencina, P. J., Sancho-Gómez, J. -L., & Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: a review. Neural Computing and Applications, 19(2), 263-282.
[9] Ghalebikesabi, S., Cornish, R., Holmes, C., & Kelly, L. (2021). Deep generative missingness pattern-set mixture models. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021). pp. 3727-3735.
[10] Gondara, L., & Wang, K. (2017). Multiple imputation using deep denoising autoencoders. arXiv:1705.02737.
[11] Gong, Y., Hajimirsadeghi, H., He, J., Durand, T., & Mori, G. (2021). Variational selective autoencoder: Learning from partially-observed heterogeneous data. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021). pp. 2377-2385.
[12] Hwang, U., Jung, D., & Yoon, S. (2019). HexaGAN: Generative adversarial nets for real world classification. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019). pp. 2921-2930.
[13] Ipsen, N. B., Mattei, P. -A., & Frellsen, J. (2021). NOT-MIWAE: Deep generative modelling with missing not at random data. In International Conference on Learning Representations (ICLR 2021).
[14] Ivanov, O., Figurnov, M., & Vetrov, D. (2019). Variational autoencoder with arbitrary conditioning. In International Conference on Learning Representations (ICLR 2019).
[15] Lee, D., Kim, J., Moon, W.-J., & Ye, J. C. (2019). CollaGAN: Collaborative gan for missing image data imputation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019). pp. 2487-2496.
[16] Li, Q., & Li, L. (2021). Integrative factor regression and its inference for multimodal data analysis. Journal of the American Statistical Association, .
[17] Li, S. C. -X., Jiang, B., & Marlin, B. (2019). MisGAN: Learning from incomplete data with generative adversarial networks. In International Conference on Learning Representations (ICLR 2019).
[18] Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml.
[19] Lin, H., Liu, W., & Lan, W. (2021). Regression analysis with individual-specific patterns of missing covariates. Journal of Business Economic Statistics, 39(1), 179-188.
[20] Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data. (2nd ed.).John Wiley & Sons.
[21] Ma, W., & Chen, H. G. (2019). Missing not at random in matrix completion: The effectiveness of estimating missingness probabilities under a low nuclear norm assumption. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019).
[22] Mattei, P.-A., & Frellsen, J. (2019). MIWAE: Deep generative modelling and imputation of incomplete data sets. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019). pp. 4413-4423.
[23] Mazumder, R., Hastie, T., & Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11(Aug), 2287-2322. · Zbl 1242.68237
[24] Neves, D. T., Naik, M. G., & Proenca, A. (2021). SGAIN, WSGAIN-CP and WSGAIN-GP: Novel GAN methods for missing data imputation. In International Conference on Computational Science (ICCS 2021). pp. 98-113.
[25] Qiu, W., Huang, Y., & Li, Q. (2020). IFGAN: Missing value imputation using feature-specific generative adversarial networks. In IEEE International Conference on Big Data (Big Data2020). pp. 4715-4723.
[26] Richardson, T. W., Wu, W., Lin, L., Xu, B., & Bernal, E. A. (2020). MCFlow: Monte carlo flow models for data imputation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020). pp. 14205-14214.
[27] Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys. John Wiley & Sons. · Zbl 1070.62007
[28] Smieja, M., Kolomycki, M., Struski, L., Juda, M., & Figueiredo, M. A. T. (2020). Iterative imputation of missing data using auto-encoder dynamics. In International Conference on Neural Information Processing (ICONIP 2020).
[29] Stekhoven, D. J., & Buhlmann, P. (2011). MissForest — non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.
[30] van Buuren, S., & Groothuis-Oudshoorn, K. (2011). MICE: multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1-67.
[31] Wang, Y., Li, D., Li, X., & Yang, M. (2021). PC-GAIN: pseudo-label conditional generative adversarial imputation networks for incomplete data. Neural Networks, 141(Sep), 395-403. · Zbl 1521.68206
[32] Xue, F., & Qu, A. (2021). Integrating multi-source block-wise missing data in model selection. Journal of the American Statistical Association, 116(536), 1914-1927. · Zbl 1506.62242
[33] Yoon, J., Jordon, J., & van der Schaar, M. (2018). GAIN: Missing data imputation using generative adversarial nets. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018). pp. 5689-5698.
[34] Yoon, S., & Sull, S. (2020). GAMIN: Generative adversarial multiple imputation network for highly missing data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020). pp. 8456-8464.
[35] You, J., Ma, X., Ding, D., Kochenderfer, M., & Leskovec, J. (2020). Handling missing data with graph representation learning. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020).
[36] Zhang, Y., Tang, N., & Annie, Q. (2020). Imputed factor regression for high-dimensional blockwise missing data. Statistica Sinica, 30(2), 631-651. · Zbl 1439.62233
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.