×

Testing conditional independence in supervised learning algorithms. (English) Zbl 07465666

Summary: We propose the conditional predictive impact (CPI), a consistent and unbiased estimator of the association between one or several features and a given outcome, conditional on a reduced feature set. Building on the knockoff framework of E. Candès et al. [J. R. Stat. Soc., Ser. B, Stat. Methodol. 80, No. 3, 551–577 (2018; Zbl 1398.62335)], we develop a novel testing procedure that works in conjunction with any valid knockoff sampler, supervised learning algorithm, and loss function. The CPI can be efficiently computed for high-dimensional data without any sparsity constraints. We demonstrate convergence criteria for the CPI and develop statistical inference procedures for evaluating its magnitude, significance, and precision. These tests aid in feature and model selection, extending traditional frequentist and Bayesian techniques to general supervised learning tasks. The CPI may also be applied in causal discovery to identify underlying multivariate graph structures. We test our method using various algorithms, including linear regression, neural networks, random forests, and support vector machines. Empirical results show that the CPI compares favorably to alternative variable importance measures and other nonparametric tests of conditional independence on a diverse array of real and synthetic datasets. Simulations confirm that our inference procedures successfully control Type I error with competitive power in a range of settings. Our method has been implemented in an R package, cpi, which can be downloaded from https://github.com/dswatson/cpi.

MSC:

68T05 Learning and adaptive systems in artificial intelligence

Citations:

Zbl 1398.62335

References:

[1] Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, KR; Samek, W., On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation, PLoS ONE, 10, 7, 1-46 (2015)
[2] Barber, RF; Candès, EJ, Controlling the false discovery rate via knockoffs, Annals of Statistics, 43, 5, 2055-2085 (2015) · Zbl 1327.62082 · doi:10.1214/15-AOS1337
[3] Bates, S., Candès, E., Janson, L., & Wang, W. (2020). Metropolized knockoff sampling. Journal of the American Statistical Association, 1-15.
[4] Benjamini, Y.; Hochberg, Y., Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society: Series B (statistical Methodology), 57, 1, 289-300 (1995) · Zbl 0809.62014
[5] Benjamini, Y.; Yekutieli, D., The control of the false discovery rate in multiple testing under dependency, Annals of Statistics, 29, 4, 1165-1188 (2001) · Zbl 1041.62061 · doi:10.1214/aos/1013699998
[6] Berrett, TB; Wang, Y.; Barber, RF; Samworth, RJ, The conditional permutation test for independence while controlling for confounders, Journal of the Royal Statistical Society: Series B (statistical Methodology), 82, 1, 175-197 (2020) · Zbl 1440.62223 · doi:10.1111/rssb.12340
[7] Bischl, B.; Lang, M.; Kotthoff, L.; Schiffner, J.; Richter, J.; Studerus, E., mlr: Machine learning in R, Journal of Machine Learning Research, 17, 170, 1-5 (2016) · Zbl 1392.68007
[8] Breiman, L., Random forests, Machine Learning, 45, 1, 1-33 (2001) · Zbl 1007.68152 · doi:10.1023/A:1010933404324
[9] Candès, E.; Fan, Y.; Janson, L.; Lv, J., Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection, Journal of the Royal Statistical Society: Series B (statistical Methodology), 80, 3, 551-577 (2018) · Zbl 1398.62335 · doi:10.1111/rssb.12265
[10] Doran, G., Muandet, K., Zhang, K., & Schölkopf, B. (2014). A permutation-based kernel conditional independence test. In Proceedings of the International Conference on Uncertainty in Artificial Intelligence (pp. 132-141).
[11] Dua, D.; Graff, C., UCI machine learning repository (2017), University of California, School of Information and Computer Science
[12] Feng, J., Williamson, B., Simon, N., & Carone, M. (2018). Nonparametric variable importance using an augmented neural network with multi-task learning. In Proceedings of the International Conference on Machine Learning (pp. 1496-1505).
[13] Fisher, RA, The design of experiments (1935), Oliver & Boyd
[14] Fisher, A.; Rudin, C.; Dominici, F., All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously, Journal of Machine Learning Research, 20, 177, 1-81 (2019) · Zbl 1436.62019
[15] Fleuret, F., Fast binary feature selection with conditional mutual information, Journal of Machine Learning Research, 5, 1531-1555 (2004) · Zbl 1222.68200
[16] Friedman, JH; Popescu, BE, Predictive learning via rule ensembles, The Annals of Applied Statistics, 2, 3, 916-954 (2008) · Zbl 1149.62051 · doi:10.1214/07-AOAS148
[17] Friedman, J.; Hastie, T.; Tibshirani, R., Regularization paths for generalized linear models via coordinate descent, Journal of Statistical Software, 33, 1, 1-41 (2010) · doi:10.18637/jss.v033.i01
[18] Fukumizu, K.; Gretton, A.; Sun, X.; Schölkopf, B., Kernel measures of conditional dependence, Advances in Neural Information Processing Systems, 20, 489-496 (2008)
[19] Gevrey, M.; Dimopoulos, I.; Lek, S., Review and comparison of methods to study the contribution of variables in artificial neural network models, Ecological Modelling, 160, 3, 249-264 (2003) · doi:10.1016/S0304-3800(02)00257-0
[20] Gregorutti, B.; Michel, B.; Saint-Pierre, P., Grouped variable importance with random forests and application to multiple functional data analysis, Computational Statistics & Data Analysis, 90, 15-35 (2015) · Zbl 1468.62069 · doi:10.1016/j.csda.2015.04.002
[21] Grömping, U., Estimators of relative importance in linear regression based on variance decomposition, The American Statistician, 61, 2, 139-147 (2007) · doi:10.1198/000313007X188252
[22] Guedj, B. (2019). A primer on PAC-Bayesian learning. arXiv preprint, 1901.05353. · Zbl 1523.68057
[23] Guyon, I.; Elisseeff, A., An introduction to variable and feature selection, Journal of Machine Learning Research, 3, 7-8, 1157-1182 (2003) · Zbl 1102.68556
[24] Hansen, D., Manzo, B., & Regier, J. (2021). Normalizing flows for knockoff-free controlled feature selection. arXiv preprint, 2106.01528.
[25] Harrison, D.; Rubinfeld, DL, Hedonic housing prices and the demand for clean air, Journal of Environmental Economics and Management, 5, 1, 81-102 (1978) · Zbl 0375.90023 · doi:10.1016/0095-0696(78)90006-2
[26] Herschkowitz, JI; Simin, K.; Weigman, VJ; Mikaelian, I.; Usary, J.; Hu, Z., Identification of conserved gene expression features between murine mammary carcinoma models and human breast tumors, Genome Biology, 8, 5, R76 (2007) · doi:10.1186/gb-2007-8-5-r76
[27] Holm, S., A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics, 6, 2, 65-70 (1979) · Zbl 0402.62058
[28] Hubbard, AE; Kennedy, CJ; van der Laan, MJ; van der Laan, MJ; Rose, S., Data-adaptive target parameters, Targeted learning in data science, 125-142 (2018), Springer
[29] Kalisch, M.; Mächler, M.; Colombo, D.; Maathuis, MH; Bühlmann, P., Causal inference using graphical models with the R package pcalg, Journal of Statistical Software, 47, 11, 1-26 (2012) · doi:10.18637/jss.v047.i11
[30] Koller, D.; Friedman, N., Probabilistic graphical models: Principles and techniques (2009), MIT Press · Zbl 1183.68483
[31] Korb, KB; Nicholson, AE, Bayesian artificial Intelligence (2009), Chapman and Hall/CRC · Zbl 1080.68100
[32] Kruschke, JK, Bayesian estimation supersedes the t test., Journal of Experimental Psychology: General, 142, 2, 573-603 (2013) · doi:10.1037/a0029146
[33] Kuhn, M.; Johnson, K., Feature engineering and selection: A practical approach for predictive models (2019), Chapman and Hall/CRC · doi:10.1201/9781315108230
[34] Kursa, MB; Rudnicki, WR, Feature selection with the Boruta package, Journal of Statistical Software, 36, 11, 1-13 (2010) · doi:10.18637/jss.v036.i11
[35] Lei, J.; G’Sell, M.; Rinaldo, A.; Tibshirani, RJ; Wasserman, L., Distribution-free predictive inference for regression, Journal of the American Statistical Association, 113, 523, 1094-1111 (2018) · Zbl 1402.62155 · doi:10.1080/01621459.2017.1307116
[36] Lim, E.; Vaillant, F.; Wu, D.; Forrest, NC; Pal, B.; Hart, AH, Aberrant luminal progenitors as the candidate target population for basal tumor development in BRCA1 mutation carriers, Nature Medicine, 15, 907 (2009) · doi:10.1038/nm.2000
[37] Lindeman, RH; Merenda, PF; Gold, RZ, Introduction to bivariate and multivariate analysis (1980), Longman · Zbl 0455.62039
[38] Lundberg, SM; Lee, S-I, A unified approach to interpreting model predictions, Advances in Neural Information Processing Systems, 30, 4765-4774 (2017)
[39] Maathuis, MH; Kalisch, M.; Bühlmann, P., Estimating high-dimensional intervention effects from observational data, Annals of Statistics, 37, 6, 3133-3164 (2009) · Zbl 1191.62118 · doi:10.1214/09-AOS685
[40] Martínez Sotoca, J.; Pla, F., Supervised feature selection by clustering using conditional mutual information-based distances, Pattern Recognition, 43, 6, 2068-2081 (2010) · Zbl 1191.68514 · doi:10.1016/j.patcog.2009.12.013
[41] Meinshausen, N.; Bühlmann, P., Stability selection, Journal of the Royal Statistical Society: Series B (statistical Methodology), 72, 4, 417-473 (2010) · Zbl 1411.62142 · doi:10.1111/j.1467-9868.2010.00740.x
[42] Mentch, L.; Hooker, G., Quantifying uncertainty in random forests via confidence intervals and hypothesis tests, Journal of Machine Learning Research, 17, 1, 841-881 (2016) · Zbl 1360.62095
[43] Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., & Leisch, F. (2018). e1071: Misc functions of the department of statistics, probability theory group. CRAN. R package version 1.7-0.
[44] Nicodemus, KK; Malley, JD; Strobl, C.; Ziegler, A., The behaviour of random forest permutation-based variable importance measures under predictor correlation, BMC Bioinformatics, 11, 1, 110 (2010) · doi:10.1186/1471-2105-11-110
[45] Patterson, E., & Sesia, M. (2018). knockoff. CRAN. R package version 0.3.2.
[46] Pearl, J., Probabilistic reasoning in intelligent systems: Networks of plausible inference (1988), Morgan Kaufmann · Zbl 0746.68089
[47] Phipson, B., & Smyth, G. (2010). Permutation P-values should never be zero: Calculating exact P-values when permutations are randomly drawn. Statistical Applications in Genetics and Molecular Biology, 9(1). · Zbl 1304.92098
[48] Ramsey, J. D. (2014). A scalable conditional independence test for nonlinear, non-Gaussian data. arXiv preprint, arXiv:1401.5031
[49] Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144).
[50] Rinaldo, A.; Wasserman, L.; G’Sell, M., Bootstrapping and sample splitting for high-dimensional, assumption-lean inference, Annals of Statistics, 47, 6, 3438-3469 (2019) · Zbl 1436.62107 · doi:10.1214/18-AOS1784
[51] Romano, Y.; Sesia, M.; Candès, E., Deep Knockoffs, Journal of the American Statistical Association, 115, 532, 1861-1872 (2020) · Zbl 1452.62710 · doi:10.1080/01621459.2019.1660174
[52] Rouder, JN; Speckman, PL; Sun, D.; Morey, RD; Iverson, G., Bayesian t tests for accepting and rejecting the null hypothesis, Psychonomic Bulletin & Review, 16, 2, 225-237 (2009) · doi:10.3758/PBR.16.2.225
[53] Sauer, N., On the density of families of sets, Journal of Combinatorial Theory Series A, 13, 1, 145-147 (1972) · Zbl 0248.05005 · doi:10.1016/0097-3165(72)90019-2
[54] Scutari, M., Learning Bayesian networks with the bnlearnR package, Journal of Statistical Software, 35, 3, 1-22 (2010) · doi:10.18637/jss.v035.i03
[55] Scutari, M.; Denis, J-B, Bayesian networks: With examples in R (2014), Chapman and Hall/CRC · Zbl 1341.62025 · doi:10.1201/b17065
[56] Sesia, M.; Sabatti, C.; Candès, EJ, Gene hunting with hidden Markov model knockoffs, Biometrika, 106, 1, 1-18 (2019) · Zbl 1506.62463 · doi:10.1093/biomet/asy033
[57] Shah, R.; Peters, J., The hardness of conditional independence testing and the generalised covariance measure, Annals of Statistics, 48, 3, 1514-1538 (2020) · Zbl 1451.62081 · doi:10.1214/19-AOS1857
[58] Shalev-Shwartz, S.; Ben-David, S., Understanding machine learning: From theory to algorithms (2014), Cambridge University Press · Zbl 1305.68005 · doi:10.1017/CBO9781107298019
[59] Shelah, S., A combinatorial problem: Stability and orders for models and theories in infinitariy languages, Pacific Journal of Mathematics, 41, 1, 247-261 (1972) · Zbl 0239.02024 · doi:10.2140/pjm.1972.41.247
[60] Shrikumar, A., Greenside, P., & Kundaje, A. (2017). Learning important features through propagating activation differences. In Proceedings of the International Conference on Machine Learning (Vol. 70, pp. 3145-3153).
[61] Sørlie, T.; Tibshirani, R.; Parker, J.; Hastie, T.; Marron, JS; Nobel, A., Repeated observation of breast tumor subtypes in independent gene expression data sets, Proceedings of the National Academy of Sciences, 100, 14, 8418-8423 (2003) · doi:10.1073/pnas.0932692100
[62] Spirtes, P.; Glymour, CN; Scheines, R., Causation, prediction, and search (2000), The MIT Press · Zbl 0806.62001
[63] Steinke, T., & Zakynthinou, L. (2020). Reasoning about generalization via conditional mutual information. In Proceedings of the International Conference on Learning Theory (pp. 3437-3452).
[64] Storey, JD, A direct approach to false discovery rates, Journal of the Royal Statistical Society: Series B (statistical Methodology), 64, 3, 479-498 (2002) · Zbl 1090.62073 · doi:10.1111/1467-9868.00346
[65] Strobl, C.; Boulesteix, A-L; Kneib, T.; Augustin, T.; Zeileis, A., Conditional variable importance for random forests, BMC Bioinformatics, 9, 1, 307 (2008) · doi:10.1186/1471-2105-9-307
[66] Strobl, E. V., Zhang, K., & Visweswaran, S. (2018). Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference, 7(1), 20180017.
[67] Subramanian, A.; Tamayo, P.; Mootha, VK; Mukherjee, S.; Ebert, BL; Gillette, MA, Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles, Proceedings of the National Academy of Sciences, 102, 43, 15545-15550 (2005) · doi:10.1073/pnas.0506580102
[68] Tansey, W., Veitch, V., Zhang, H., Rabadan, R., & Blei, D.M. (2021). The holdout randomization test for feature selection in black box models. Journal of Computational and Graphical Statistics, 1-37.
[69] Team, R. C. (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing.
[70] Tibshirani, R., Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society B, 58, 1, 267-288 (1996) · Zbl 0850.62538
[71] Turner, NC; Reis-Filho, JS, Basal-like breast cancer and the BRCA1 phenotype, Oncogene, 25, 5846 (2006) · doi:10.1038/sj.onc.1209876
[72] van der Laan, M. J. (2006). Statistical inference for variable importance. The International Journal of Biostatistics, 2(1).
[73] van der Laan, MJ; Rose, S., Targeted learning in data science: Causal inference for complex longitudinal studies (2018), Springer · Zbl 1408.62005
[74] Vapnik, V.; Chervonenkis, A., On the uniform convergence of relative frequencies to their probabilities, Theory of Probabability & Its Applications, 16, 2, 264-280 (1971) · Zbl 0247.60005 · doi:10.1137/1116025
[75] Vejmelka, M.; Paluš, M., Inferring the directionality of coupling with conditional mutual information, Physical Review E, 77, 2, 26214 (2008) · doi:10.1103/PhysRevE.77.026214
[76] Venables, WN; Ripley, BD, Modern applied statistics with S (2002), Springer · Zbl 1006.62003 · doi:10.1007/978-0-387-21706-2
[77] Verma, T., & Pearl, J. (1991). Equivalence and synthesis of causal models. In Proceedings of the International Conference on Uncertainty in Artificial Intelligence (pp. 255-270).
[78] Wachter, S.; Mittelstadt, B.; Russell, C., Counterfactual explanations without opening the black box: Automated decisions and the GDPR, Harvard Journal of Law & Technology, 31, 2, 841-887 (2018)
[79] Wetzels, R.; Raaijmakers, JGW; Jakab, E.; Wagenmakers, E-J, How to quantify support for and against the null hypothesis: A flexible WinBUGS implementation of a default Bayesian t test, Psychonomic Bulletin & Review, 16, 4, 752-760 (2009) · doi:10.3758/PBR.16.4.752
[80] Williamson, BD; Gilbert, PB; Carone, M.; Simon, N., Nonparametric variable importance assessment using machine learning techniques, Biometrics, 77, 1, 9-22 (2021) · Zbl 1520.62369 · doi:10.1111/biom.13392
[81] Wolpert, DH; Macready, WG, No free lunch theorems for optimization, IEEE Transactions on Evolutionary Computation, 1, 1, 67-82 (1997) · doi:10.1109/4235.585893
[82] Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1).
[83] Wu, D.; Smyth, GK, Camera: A competitive gene set test accounting for inter-gene correlation, Nucleic Acids Research, 40, 17, e133 (2012) · doi:10.1093/nar/gks461
[84] Zhang, K., Peters, J., Janzing, D., & Schölkopf, B. (2011). Kernel-based conditional independence test and application in causal discovery. In Proceedings of the International Conference on Uncertainty in Artificial Intelligence (pp. 804-813).
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.