×

Extended missing data imputation via GANs for ranking applications. (English) Zbl 1507.68243

Summary: We propose Conditional Imputation GAN, an extended missing data imputation method based on Generative Adversarial Networks (GANs). The motivating use case is learning-to-rank, the cornerstone of modern search, recommendation system, and information retrieval applications. Empirical ranking datasets do not always follow standard Gaussian distributions or Missing Completely At Random (MCAR) mechanism, which are standard assumptions of classic missing data imputation methods. Our methodology provides a simple solution that offers compatible imputation guarantees while relaxing assumptions for missing mechanisms and sidesteps approximating intractable distributions to improve imputation quality. We prove that the optimal GAN imputation is achieved for Extended Missing At Random and Extended Always Missing At Random mechanisms, beyond the naive MCAR. Our method demonstrates the highest imputation quality on the open-source Microsoft Research Ranking Dataset and a synthetic ranking dataset compared to state-of-the-art benchmarks and across various feature distributions. Using a proprietary Amazon Search ranking dataset, we also demonstrate comparable ranking quality metrics for ranking models trained on GAN-imputed data compared to ground-truth data.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62D10 Missing data

References:

[1] Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: International conference on machine learning, PMLR, pp 214-223
[2] Burges, CJ, From RankNet to LambdaRank to LambdaMART: an overview, Learning, 11, 23-581, 81 (2010)
[3] Camino RD, Hammerschmidt CA, State R (2019) Improving missing data imputation with deep generative models. arXiv preprint arXiv:1902.10666
[4] Doretti, M.; Geneletti, S.; Stanghellini, E., Missing data: a unified taxonomy guided by conditional independence, Int Stat Rev, 86, 2, 189-204 (2018) · Zbl 07763590 · doi:10.1111/insr.12242
[5] Goodfellow I, Pouget-Abadie J, Mirza M, et al (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672-2680
[6] Guo, Z.; Wan, Y.; Ye, H., A data imputation method for multivariate time series based on generative adversarial network, Neurocomputing, 360, 185-197 (2019) · doi:10.1016/j.neucom.2019.06.007
[7] Heitjan, DF; Basu, S., Distinguishing missing at random and missing completely at random, Am Stat, 50, 3, 207-213 (1996)
[8] Ke G, Meng Q, Finley T, et al (2017) LightGBM: a highly efficient gradient boosting decision tree. In: Advances in neural information processing systems, pp 3146-3154
[9] Kim J, Tae D, Seok J (2020) A survey of missing data imputation using generative adversarial networks. In: 2020 International conference on artificial intelligence in information and communication (ICAIIC), IEEE, pp 454-456
[10] Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: ICLR (Poster)
[11] Lee D, Kim J, Moon WJ, et al (2019) CollaGAN: Collaborative gan for missing image data imputation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2487-2496
[12] Li, H., A short introduction to learning to rank, IEICE Trans Inf Syst, 94, 10, 1854-1862 (2011) · doi:10.1587/transinf.E94.D.1854
[13] Li SCX, Jiang B, Marlin B (2018) MisGAN: learning from incomplete data with generative adversarial networks. In: International conference on learning representations
[14] Little RJ, Rubin DB (2019) Statistical analysis with missing data, vol 793. Wiley · Zbl 1411.62006
[15] Luo Y, Cai X, Zhang Y, et al (2018) Multivariate time series imputation with generative adversarial networks. In: Advances in neural information processing systems, pp 1596-1607
[16] Marlin BM, Zemel RS (2009) Collaborative prediction and ranking with non-random missing data. In: Proceedings of the third ACM conference on recommender systems, pp 5-12
[17] Mealli, F.; Rubin, DB, Clarifying missing at random and related definitions, and implications when coupled with exchangeability, Biometrika, 102, 4, 995-1000 (2015) · Zbl 1390.62042 · doi:10.1093/biomet/asv035
[18] Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784
[19] Oza M, Vaghela H, Srivastava K (2020) Progressive generative adversarial binary networks for music generation. In: International conference on innovative computing and communications. Springer, pp 181-192
[20] Qin T, Liu T (2013) Introducing LETOR 4.0 datasets. CoRR abs/1306.2597. arXiv:1306.2597
[21] Radev DR, Qi H, Wu H, et al (2002) Evaluating web-based question answering systems. In: LREC, Citeseer
[22] Radford A, Metz L, Chintala S (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In: 4th International conference on learning representations, ICLR 2016, conference track proceedings, San Juan, Puerto Rico
[23] Salimans T, Goodfellow I, Zaremba W, et al (2016) Improved techniques for training gans. In: Advances in neural information processing systems, pp 2234-2242
[24] Seaman S, Galati J, Jackson D, et al (2013) What is meant by “missing at random”? Stat Sci 28(2):257-268 · Zbl 1331.62036
[25] Sheng L, Pan J, Guo J, et al (2019) Unsupervised bi-directional flow-based video generation from one snapshot. arXiv preprint arXiv:1903.00913
[26] Stekhoven, DJ; Bühlmann, P., MissForest-non-parametric missing value imputation for mixed-type data, Bioinformatics, 28, 1, 112-118 (2012) · doi:10.1093/bioinformatics/btr597
[27] Tang, F.; Ishwaran, H., Random forest missing data algorithms, Stat Anal Data Min ASA Data Sci J, 10, 6, 363-377 (2017) · Zbl 07260721 · doi:10.1002/sam.11348
[28] Thanh-Tung H, Tran T (2020) Catastrophic forgetting and mode collapse in GANs. In: 2020 International joint conference on neural networks (IJCNN), IEEE, pp 1-10
[29] Valizadegan H, Jin R, Zhang R, et al (2009) Learning to rank by optimizing NDCG measure. In: Advances in neural information processing systems, pp 1883-1891
[30] Van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45:1-67
[31] Van Buuren S (2018) Flexible imputation of missing data. CRC Press · Zbl 1416.62030
[32] Xu B, Wang N, Chen T, et al (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853
[33] Yoon S, Sull S (2020) GAMIN: generative adversarial multiple imputation network for highly missing data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8456-8464
[34] Yoon J, Jordon J, Schaar M (2018) GAIN: missing data imputation using generative adversarial nets. In: International conference on machine learning, PMLR, pp 5689-5698
[35] Zhang, Y.; Zhou, B.; Cai, X., Missing value imputation in multivariate time series with end-to-end generative adversarial networks, Inf Sci, 551, 67-82 (2021) · Zbl 1484.62116 · doi:10.1016/j.ins.2020.11.035
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.