×

Counterfactual mean embeddings. (English) Zbl 07415105

Summary: Counterfactual inference has become a ubiquitous tool in online advertisement, recommendation systems, medical diagnosis, and econometrics. Accurate modelling of outcome distributions associated with different interventions-known as counterfactual distributions-is crucial for the success of these applications. In this work, we propose to model counterfactual distributions using a novel Hilbert space representation called counterfactual mean embedding (CME). The CME embeds the associated counterfactual distribution into a reproducing kernel Hilbert space (RKHS) endowed with a positive definite kernel, which allows us to perform causal inference over the entire landscape of the counterfactual distribution. Based on this representation, we propose a distributional treatment effect (DTE) which can quantify the causal effect over entire outcome distributions. Our approach is nonparametric as the CME can be estimated under the unconfoundedness assumption from observational data without requiring any parametric assumption about the underlying distributions. We also establish a rate of convergence of the proposed estimator which depends on the smoothness of the conditional mean and the Radon-Nikodym derivative of the underlying marginal distributions. Furthermore, our framework allows for more complex outcomes such as images, sequences, and graphs. Our experimental results on synthetic data and off-policy evaluation tasks demonstrate the advantages of the proposed estimator.

MSC:

68T05 Learning and adaptive systems in artificial intelligence

Software:

MMD GAN

References:

[1] Ahmed M. Alaa and Mihaela van der Schaar. Bayesian inference of individualized treatment effects using multi-task gaussian processes. InProceedings of the 31st International Conference on Neural Information Processing Systems, pages 3427-3435. Curran Associates Inc., 2017.
[2] Nachman Aronszajn. Theory of reproducing kernels.Transactions of the American Mathematical Society, 68(3):337-404, 1950. · Zbl 0037.20701
[3] Onur Atan, James Jordon, and Mihaela van der Schaar. Deep-treat: Learning optimal personalized treatments from observational data using neural networks. InAAAI Conference on Artificial Intelligence, 2018.
[4] Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353-7360, 2016. · Zbl 1357.62190
[5] F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equivalence between herding and conditional gradient algorithms. InProceedings of the 29th International Conference on Machine Learning (ICML2012), pages 1359-1366, 2012.
[6] Charles R. Baker. Joint measures and cross-covariance operators.Transactions of the American Mathematical Society, 186:pp. 273-289, 1973. · Zbl 0304.28008
[7] Alain Berlinet and Christine Thomas-Agnan.Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, 2004. · Zbl 1145.62002
[8] Karsten Borgwardt, Arthur Gretton, Malte Rasch, Hans-Peter Kriegel, Bernhard Schölkopf, and Alex Smola. Integrating structured biological data by kernel maximum mean discrepancy.Bioinformatics, 22(14):49-57, 2006.
[9] Léon Bottou, Jonas Peters, Joaquin Qui‘nonero-Candela, Denis Charles, Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising.Journal of Machine Learning Research, 14:3207-3260, 2013. · Zbl 1318.62206
[10] A. Caponnetto and E. De Vito. Optimal rates for regularized least-squares algorithm.Found. Comput. Math. J., 7(4):331-368, 2007. · Zbl 1129.68058
[11] Claes M. Cassel, Carl E. Särndal, and Jan H. Wretman. Some results on generalized difference estimation and generalized regression estimation for finite populations.Biometrika, 63(3):615-620, 1976. · Zbl 0344.62011
[12] Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. Expected reciprocal rank for graded relevance. InProceedings of the 18th ACM Conference on Information and Knowledge Management, pages 621-630. ACM, 2009.
[13] Y. Chen, M. Welling, and A. Smola. Super samples from kernel-herding. InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence, pages 109-116. AUAI Press, 2010.
[14] Victor Chernozhukov, Iván Fernández-Val, and Blaise Melly. Inference on counterfactual distributions.Econometrica, 81(6):2205-2268, 2013. · Zbl 1326.62223
[15] J. Dick, F. Y. Kuo, and I. H. Sloan. High dimensional numerical integration - the QuasiMonte Carlo way.Acta Numerica, 22(133-288), 2013. · Zbl 1296.65004
[16] J. Diestel and J. Uhl.Vector Measures. American Mathematical Society, Providence, 1977. · Zbl 0369.46039
[17] N. Dinculeanu.Vector Integration and Stochastic Integration in Banach Spaces. Wiley, 2000. · Zbl 0974.28006
[18] G. Doran, K. Muandet, K. Zhang, and B. Schölkopf. A permutation-based kernel conditional independence test. InProceedings of the 30th Conference on Uncertainty in Artificial Intelligence, pages 132-141. AUAI Press, 2014.
[19] Miroslav Dudík, John Langford, and Lihong Li. Doubly Robust Policy Evaluation and Learning.InProceedings of the 28th International Conference on Machine Learning, pages 1097-1104. Omnipress, 2011.
[20] Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. InUAI, 2015.
[21] Bradley Efron and Robert Tibshirani.An Introduction to the Bootstrap. Number 57 in Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, 1993. · Zbl 0835.62038
[22] R.A. Fisher.The Design of Experiments. Oliver and Boyd, 1935.
[23] G. B. Folland.Real Analysis: Modern Techniques and Their Applications, 2nd Edition. Wiley, 1999. · Zbl 0924.28001
[24] K Fukumizu, A Gretton, X Sun, and B Schölkopf. Kernel measures of conditional dependence. InAdvances in Neural Information Processing Systems 20, pages 489-496. Curran Associates, Inc., 2008.
[25] Kenji Fukumizu, Francis Bach, and Michael Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces.Journal of Machine Learning Research, 5:73-99, 2004. · Zbl 1222.62069
[26] Kenji Fukumizu, Le Song, and Arthur Gretton. Kernel Bayes’ rule: Bayesian inference with positive definite kernels.Journal of Machine Learning Research, 14:3753-3783, 2013. · Zbl 1318.62131
[27] Damien Garreau, Wittawat Jitkrittum, and Motonobu Kanagawa. Large sample analysis of the median heuristic.arXiv preprint arXiv:1707.07269, 2017.
[28] Thomas Gärtner. A survey of kernels for structured data.SIGKDD Explor. Newsl., 5(1): 49-58, 2003.
[29] Marc Genton. Classes of kernels for machine learning: A statistics perspective.Journal of Machine Learning Research, 2:299-312, 2002. · Zbl 1037.68113
[30] Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.Journal of Machine Learning Research, 13:723-773, 2012. · Zbl 1283.62095
[31] Steffen Grünewälder, Guy Lever, Arthur Gretton, Luca Baldassarre, Sam Patterson, and Massimiliano Pontil. Conditional mean embeddings as regressors. InProceedings of the 29th International Conference on Machine Learning, pages 1823-1830, New York, NY, USA, 2012. Omnipress.
[32] Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep IV: A flexible approach for counterfactual prediction. InProceedings of the 34th International Conference on Machine Learning, volume 70, pages 1414-1423. PMLR, 2017.
[33] James J Heckman and Edward J Vytlacil. Econometric evaluation of social programs, part I: Causal models, structural models and econometric policy evaluation.Handbook of Econometrics, 6:4779-4874, 2007.
[34] Jennifer Hill. Bayesian nonparametric modeling for causal inference.Journal of Computational and Graphical Statistics, 20(1):217-240, 2011.
[35] T. Hofmann, B. Schölkopf, and A. J. Smola. Kernel methods in machine learning.Annals of Statistics, 36(3):1171-1220, 2008. · Zbl 1151.30007
[36] Paul Holland. Statistics and causal inference.Journal of the American Statistical Association, 81(396):945-960, 1986. · Zbl 0607.62001
[37] D. Horvitz and D. Thompson. A generalization of sampling without replacement from a finite universe.Journal of the American Statistical Association, 47(260):663-685, 1952. · Zbl 0047.38301
[38] Guido Imbens. Nonparametric estimation of average treatment effects under exogeneity: A review.The Review of Economics and Statistics, 86(1):4-29, 2004.
[39] Guido W. Imbens and Donald B. Rubin.Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, New York, NY, USA, 2015. · Zbl 1355.62002
[40] Wittawat Jitkrittum, Zoltán Szabó, Kacper P Chwialkowski, and Arthur Gretton.Interpretable distribution features with maximum testing power. InAdvances in Neural Information Processing Systems 29, pages 181-189. Curran Associates, Inc., 2016.
[41] Fredrik Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. InProceedings of the 33rd International Conference on Machine Learning, pages 3020-3029, 2016.
[42] Takafumi Kajihara, Motonobu Kanagawa, Keisuke Yamazaki, and Kenji Fukumizu. Kernel recursive ABC: Point estimation with intractable likelihood. InProceedings of the 35th International Conference on Machine Learning, volume 80, pages 2400-2409. PMLR, 2018.
[43] Nathan Kallus. A framework for optimal matching for causal inference. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54, pages 372-381, 2017.
[44] Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treatments. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), pages 1243-1251, 2018.
[45] M. Kanagawa, Y. Nishiyama, A. Gretton, and K. Fukumizu. Filtering with state-observation examples via kernel monte carlo filter.Neural Computation, 28(2):382-444, 2016. · Zbl 1418.62361
[46] S. Lacoste-Julien, F. Lindsten, and F. Bach. Sequential kernel herding: Frank-wolfe optimization for particle filtering. InProc. AISTATS 2015, 2015.
[47] John Langford, Alexander Strehl, and Jennifer Wortman. Exploration scavenging. InProceedings of the 25th International Conference on Machine Learning, pages 528-535, 2008.
[48] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. MMD GAN: Towards deeper understanding of moment matching network. InAdvances in Neural Information Processing Systems, pages 2203-2213, 2017.
[49] Yujia Li, Kevin Swersky, and Rich Zemel.Generative moment matching networks.In International Conference on Machine Learning, pages 1718-1727, 2015.
[50] David Lopez-Paz, Krikamol Muandet, Bernhard Schölkopf, and Ilya Tolstikhin. Towards a learning theory of cause-effect inference. InProceedings of the 32nd International Conference on Machine Learning (ICML), volume 37, pages 1452-1461. JMLR, 2015.
[51] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. Kernel mean embedding of distributions: A review and beyond.Foundations and Trends in Machine Learning, 10(1-2):1-141, 2017.
[52] Krikamol Muandet, Wittawat Jitkrittum, and Jonas Kübler. Kernel conditional moment test via maximum moment restriction. InProceedings of the 36th Conference on Uncertainty in Artificial Intelligence, volume 124 ofProceedings of Machine Learning Research, pages 41-50. PMLR, 2020a.
[53] Krikamol Muandet, Arash Mehrjou, Si Kai Lee, and Anant Raj. Dual instrumental variable regression. InAdvances in Neural Information Processing Systems 33. Curran Associates, Inc., 2020b. Forthcoming.
[54] Jerzy Neyman. Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes. Master’s thesis, 7 1923. Excerpts reprinted in English, Statistical Science, Vol. 5, pp. 463-472. (D. M. Dabrowska, and T. P. Speed, Translators.).
[55] Yu Nishiyama, Motonobu Kanagawa, Arthur Gretton, and Kenji Fukumizu. Model-based kernel sum rule: kernel bayesian inference with probabilistic models.Machine Learning, pages 1-34, 2020.
[56] Doina Precup, Richard S Sutton, and Satinder Singh. Eligibility traces for off-policy policy evaluation. InProceedings of the 17th International Conference on Machine Learning, page 759-766, 2000.
[57] Tao Qin and Tie-Yan Liu. Introducing LETOR 4.0 datasets.CoRR, abs/1306.2597, 2013.
[58] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InAdvances in Neural Information Processing Systems 20, pages 1177-1184. Curran Associates, Inc., 2008.
[59] Paul Rosenbaum.Observational Studies. Springer Series in Statistics. Springer-Verlag, New York, 2nd edition, 2002. · Zbl 0985.62091
[60] Paul Rosenbaum and Donald Rubin. The central role of the propensity score in observational studies for causal effects.Biometrika, 70(1):41-55, 1983. · Zbl 0522.62091
[61] C. Rothe. Nonparametric estimation of distributional policy effects.Journal of Econometrics, 155:56-70, 2010. · Zbl 1431.62664
[62] D.B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688-701, 1974.
[63] Donald Rubin. Causal inference using potential outcomes.Journal of the American Statistical Association, 100(469):322-331, 2005. · Zbl 1117.62418
[64] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Recommendations as treatments: Debiasing learning and evaluation. InProceedings of The 33rd International Conference on Machine Learning, volume 48, pages 1670-1679, 2016.
[65] Bernhard Schölkopf and Alexander Smola.Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2002.
[66] Uri Shalit, Fredrik Johansson, and David Sontag. Bounding and minimizing counterfactual error. arXiv:1606.03976 Preprint, 2016.
[67] Uri Shalit, Fredrik D. Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. InProceedings of the 34th International Conference on Machine Learning, volume 70, pages 3076-3085. PMLR, 2017.
[68] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function.Journal of Statistical Planning and Inference, 90(2):227-244, 2000. · Zbl 0958.62011
[69] C.-J. Simon-Gabriel and B. Schölkopf. Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions.Journal of Machine Learning Research, 19(44):1-29, 2018. · Zbl 1467.62057
[70] Rahul Singh, Maneesh Sahani, and Arthur Gretton. Kernel instrumental variable regression. InAdvances in Neural Information Processing Systems 32, pages 4593-4605. Curran Associates, Inc., 2019.
[71] S. Smale and D. Zhou. Learning theory estimates via integral operators and their approximations.Constructive Approximation, 26:153-172, 2007. · Zbl 1127.68088
[72] Alexander J. Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A Hilbert space embedding for distributions. InProceedings of the 18th International Conference on Algorithmic Learning Theory (ALT), pages 13-31. Springer-Verlag, 2007. · Zbl 1142.68407
[73] L. Song.Learning via Hilbert Space Embedding of Distributions. PhD thesis, The University of Sydney, 2008.
[74] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. InProceedings of the 26th International Conference on Machine Learning (ICML), June 2009.
[75] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models.IEEE Signal Processing Magazine, 30(4):98-111, 2013.
[76] Bharath Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert Lanckriet. Hilbert space embeddings and metrics on probability measures.Journal of Machine Learning Research, 99:1517-1561, 2010. · Zbl 1242.60005
[77] I. Steinwart and A. Christmann.Support Vector Machines. Springer, 2008. · Zbl 1203.68171
[78] I. Steinwart and C. Scovel. Mercer’s theorem on general domains: on the interaction between measures, kernels, and RKHS.Constructive Approximation, 35(363-417), 2012. · Zbl 1252.46018
[79] Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. Covariate shift adaptation by importance weighted cross validation.Journal of Machine Learning Research, 8: 985-1005, 2007. · Zbl 1222.68313
[80] Dougal Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alex Smola, and Arthur Gretton. Generative models and model criticism via optimized maximum mean discrepancy. InInternational Conference on Learning Representations, 2017.
[81] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. InICML, volume 37 ofJMLR Workshop and Conference Proceedings, pages 814-823. JMLR.org, 2015.
[82] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. InAdvances in Neural Information Processing Systems 30, pages 3632-3642. Curran Associates, Inc., 2017.
[83] Ilya Tolstikhin, Bharath Sriperumbudur, and Krikamol Muandet. Minimax estimation of kernel mean embeddings.Journal of Machine Learning Research, 18:86:1-86:47, 2017. · Zbl 1441.62148
[84] Alexandre Tsybakov.Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008.
[85] Masatoshi Uehara, Masahiro Kato, and Shota Yasui. Off-policy evaluation and learning for external validity under a covariate shift.Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020.
[86] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests.Journal of the American Statistical Association, 113(523):1228- 1242, 2018. · Zbl 1402.62056
[87] Christopher Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. InAdvances in Neural Information Processing Systems 13, pages 682-688. MIT Press, 2001.
[88] Raymond K W Wong and Kwun Chuen Gary Chan. Kernel-based covariate functional balancing for observational studies.Biometrika, 105(1):199-213, 2017. · Zbl 07072401
[89] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. GANITE: Estimation of individualized treatment effects using generative adversarial nets. InInternational Conference on Learning Representations, 2018.
[90] Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Kernel-based conditional independence test and application in causal discovery. InProceedings of the 27th Conference on Uncertainty in Artificial Intelligence, pages 804-813. AUAI Press, 2011.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.