×

Understanding complex predictive models with ghost variables. (English) Zbl 1516.62094

Summary: Framed in the literature on Interpretable Machine Learning, we propose a new procedure to assign a measure of relevance to each explanatory variable in a complex predictive model. We assume that we have a training set to fit the model and a test set to check its out-of-sample performance. We propose to measure the individual relevance of each variable by comparing the predictions of the model in the test set with those obtained when the variable of interest is substituted (in the test set) by its ghost variable, defined as the prediction of this variable by using the rest of explanatory variables. In linear models it is shown that, on the one hand, the proposed measure gives similar results to leave-one-covariate-out (loco, with a lowest computational cost) and outperforms random permutations, and on the other hand, it is strongly related to the usual \(F\)-statistic measuring the significance of a variable. In nonlinear predictive models (as neural networks or random forests) the proposed measure shows the relevance of the variables in an efficient way, as shown by a simulation study comparing ghost variables with other alternative methods (including loco and random permutations, and also knockoff variables and estimated conditional distributions). Finally, we study the joint relevance of the variables by defining the relevance matrix as the covariance matrix of the vectors of effects on predictions when using every ghost variable. Our proposal is illustrated with simulated examples and the analysis of a large real data set.

MSC:

62R07 Statistical aspects of big data and data science
68T09 Computational aspects of data analysis and big data
62G08 Nonparametric regression and quantile regression

References:

[1] Barber, RF; Candès, EJ, Controlling the false discovery rate via knockoffs, Ann Stat, 43, 5, 2055-2085 (2015) · Zbl 1327.62082 · doi:10.1214/15-AOS1337
[2] Barredo-Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R., Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI, Inform Fus, 58, 82-115 (2020) · doi:10.1016/j.inffus.2019.12.012
[3] Biecek, P.; Burzykowski, T., Explanatory model analysis: explore, explain and examine predictive models (2021), London: Chapman and Hall/CRC, London · doi:10.1201/9780429027192
[4] Bishop, CM, Mixture density networks (1994), London: Aston University, London
[5] Breiman, L., Statistical modeling: the two cultures, Stat Sci, 16, 199-231 (2001) · Zbl 1059.62505 · doi:10.1214/ss/1009213726
[6] Candès, E.; Fan, Y.; Janson, L.; Lv, J., Panning for gold: ‘model-x’ knockoffs for high dimensional controlled variable selection, J R Stat Soc Ser B (Stat Methodol), 80, 3, 551-577 (2018) · Zbl 1398.62335 · doi:10.1111/rssb.12265
[7] Friedman, J.; Hastie, T.; Tibshirani, R., Regularization paths for generalized linear models via coordinate descent, J Stat Softw, 33, 1, 1-22 (2010) · doi:10.18637/jss.v033.i01
[8] Gregorutti, B.; Michel, B.; Saint-Pierre, P., Grouped variable importance with random forests and application to multiple functional data analysis, Comput Stat Data Anal, 90, 15-35 (2015) · Zbl 1468.62069 · doi:10.1016/j.csda.2015.04.002
[9] Gregorutti, B.; Michel, B.; Saint-Pierre, P., Correlation and variable importance in random forests, Stat Comput, 27, 3, 659-678 (2017) · Zbl 1505.62167 · doi:10.1007/s11222-016-9646-1
[10] Hooker, G.; Mentch, L.; Zhou, S., Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance, Stat Comput, 31, 82, 1120 (2021) · Zbl 1477.62008 · doi:10.1007/s11222-021-10057-z
[11] Johnson, RA; Wichern, DW, Applied multivariate statistical analysis (2002), London: Prentice Hall, London
[12] Kuhn M (2018) Caret: Classification and Regression Training. R package version 6.0-81. Contributions from J. Wing, S. Weston, A. Williams, C. Keefer, A. Engelhardt, T. Cooper, Z. Mayer, B. Kenkel, the R Core Team, M. Benesty, R. Lescarbeau, A. Ziem, L. Scrucca, Y. Tang, C. Candan and T. Hunt
[13] Lei, J.; G’Sell, M.; Rinaldo, A.; Tibshirani, RJ; Wasserman, L., Distribution-free predictive inference for regression, J Am Stat Assoc, 113, 523, 1094-1111 (2018) · Zbl 1402.62155 · doi:10.1080/01621459.2017.1307116
[14] Liaw, A.; Wiener, M., Classification and regression by randomforest, R News, 2, 3, 18-22 (2002)
[15] Masís, S., Interpretable machine learning with python (2021), London: Packt Publishing Ltd, London
[16] Molnar C (2019) Interpretable machine learning. Lulu. com
[17] Patterson E, Sesia M (2022) knockoff: The Knockoff Filter for Controlled Variable Selection. R package version 0.3.5
[18] Peña, D.; Yohai, VJ, The detection of influential subsets in linear regression by using an influence matrix, J R Stat Soc Ser B (Methodological), 57, 145-156 (1995) · Zbl 0825.62579
[19] Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E., Scikit-learn: machine learning in python, J Mach Learn Res, 12, 2825-2830 (2011) · Zbl 1280.68189
[20] Ribeiro MT, Singh S, Guestrin C (2016a) Model-agnostic interpretability of machine learning. In: ICML workshop on human interpretability in machine learning (WHI 2016). NY, USA, New York
[21] Ribeiro MT, Singh S, Guestrin C (2016b) Why should I trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1135-1144. ACM
[22] Seber, GA; Lee, AJ, Linear regression analysis (2003), New York: Wiley, New York · Zbl 1029.62059 · doi:10.1002/9780471722199
[23] Tansey, W.; Veitch, V.; Zhang, H.; Rabadan, R.; Blei, DM, The holdout randomization test for feature selection in black box models, J Comput Graph Stat, 31, 1, 151-162 (2022) · Zbl 07546466 · doi:10.1080/10618600.2021.1923520
[24] Venables, WN; Ripley, BD, Modern applied statistics with S (2002), New York: Springer, New York · Zbl 1006.62003 · doi:10.1007/978-0-387-21706-2
[25] Wood, SN, Generalized additive models: an introduction with R (2017), New York: Chapman and Hall/CRC Press, New York · Zbl 1368.62004 · doi:10.1201/9781315370279
[26] Zhu, R.; Zeng, D.; Kosorok, MR, Reinforcement learning trees, J Am Stat Assoc, 110, 512, 1770-1784 (2015) · Zbl 1374.68466 · doi:10.1080/01621459.2015.1036994
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.