×

High-dimensional influence measure. (English) Zbl 1360.62411

Summary: Influence diagnosis is important since presence of influential observations could lead to distorted analysis and misleading interpretations. For high-dimensional data, it is particularly so, as the increased dimensionality and complexity may amplify both the chance of an observation being influential, and its potential impact on the analysis. In this article, we propose a novel high-dimensional influence measure for regressions with the number of predictors far exceeding the sample size. Our proposal can be viewed as a high-dimensional counterpart to the classical Cook’s distance. However, whereas the Cook’s distance quantifies the individual observation’s influence on the least squares regression coefficient estimate, our new diagnosis measure captures the influence on the marginal correlations, which in turn exerts serious influence on downstream analysis including coefficient estimation, variable selection and screening. Moreover, we establish the asymptotic distribution of the proposed influence measure by letting the predictor dimension go to infinity. Availability of this asymptotic distribution leads to a principled rule to determine the critical value for influential observation detection. Both simulations and real data analysis demonstrate usefulness of the new influence diagnosis measure.

MSC:

62J20 Diagnostics, and linear inference and regression
62E20 Asymptotic distribution theory in statistics

References:

[1] Anderson, E. B. (1992). Diagnostics in categorical data analysis. J. R. Stat. Soc. Ser. B Stat. Methodol. 54 781-791.
[2] Banerjee, M. (1998). Cook’s distance in linear longitudinal models. Comm. Statist. Theory Methods 27 2973-2983. · Zbl 0956.62054 · doi:10.1080/03610929808832267
[3] Banerjee, M. and Frees, E. W. (1997). Influence diagnostics for linear longitudinal models. J. Amer. Statist. Assoc. 92 999-1005. · Zbl 0889.62063 · doi:10.2307/2965564
[4] Belloni, A. and Chernozhukov, V. (2011). \(\ell _{1}\)-penalized quantile regression in high-dimensional sparse models. Ann. Statist. 39 82-130. · Zbl 1209.62064 · doi:10.1214/10-AOS827
[5] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57 289-300. · Zbl 0809.62014
[6] Benjamini, Y. and Hochberg, Y. (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. J. Educ. Behav. Stat. 25 60-83.
[7] Bickel, P. J. and Levina, E. (2008). Covariance regularization by thresholding. Ann. Statist. 36 2577-2604. · Zbl 1196.62062 · doi:10.1214/08-AOS600
[8] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when \(p\) is much larger than \(n\). Ann. Statist. 35 2313-2351. · Zbl 1139.62019 · doi:10.1214/009053606000001523
[9] Chatterjee, S. and Hadi, A. S. (1988). Sensitivity Analysis in Linear Regression . Wiley, New York. · Zbl 0648.62066
[10] Chiang, A. P., Beck, J. S., Yen, H. J., Tayeh, M. K., Scheetz, T. E., Swiderski, R., Nishimura, D., Braun, T. A., Kim, K. Y., Huang, J., Elbedour, K., Carmi, R., Slusarski, D. C., Casavant, T. L., Stone, E. M. and Sheffield, V. C. (2006). Homozygosity mapping with SNP arrays identifies a novel gene for Bardet-Biedl syndrome (BBS11). Proc. Natl. Acad. Sci. USA 103 6287-6292.
[11] Christensen, R., Pearson, L. M. and Johnson, W. (1992). Case-deletion diagnostics for mixed models. Technometrics 34 38-45. · Zbl 0761.62098 · doi:10.2307/1269550
[12] Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics 19 15-18. · Zbl 0371.62096 · doi:10.2307/1268249
[13] Cook, R. D. (1979). Influential observations in linear regression. J. Amer. Statist. Assoc. 74 169-174. · Zbl 0398.62057 · doi:10.2307/2286747
[14] Cook, R. D. and Weisberg, S. (1982). Residuals and Influence in Regression . Chapman & Hall, London. · Zbl 0564.62054
[15] Critchley, F., Atkinson, R. A., Lu, G. and Biazi, E. (2001). Influence analysis based on the case sensitivity function. J. R. Stat. Soc. Ser. B Stat. Methodol. 63 307-323. · Zbl 0979.62050 · doi:10.1111/1467-9868.00287
[16] Davison, A. C. and Tsai, C. L. (1992). Regression model diagnostics. Int. Stat. Rev. 55 337-353. · Zbl 0775.62201 · doi:10.2307/1403682
[17] Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of dimensionality. Technical report, Stanford Univ.
[18] Draper, N. R. and Smith, H. (1998). Applied Regression Analysis , 3rd ed. Wiley, New York. · Zbl 0895.62073
[19] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407-499. · Zbl 1091.62054 · doi:10.1214/009053604000000067
[20] Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151-1160. · Zbl 1073.62511 · doi:10.1198/016214501753382129
[21] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348-1360. · Zbl 1073.62547 · doi:10.1198/016214501753382273
[22] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849-911. · doi:10.1111/j.1467-9868.2008.00674.x
[23] Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Ann. Statist. 38 3567-3604. · Zbl 1206.68157 · doi:10.1214/10-AOS798
[24] Fu, W. J. (1998). Penalized regressions: The bridge versus the Lasso. J. Comput. Graph. Statist. 7 397-416.
[25] Fung, W.-K., Zhu, Z.-Y., Wei, B.-C. and He, X. (2002). Influence diagnostics and outlier tests for semiparametric mixed models. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 565-579. · Zbl 1090.62039 · doi:10.1111/1467-9868.00351
[26] Huang, J., Horowitz, J. L. and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Statist. 36 587-613. · Zbl 1133.62048 · doi:10.1214/009053607000000875
[27] Huang, J., Ma, S. and Zhang, C.-H. (2008). Adaptive Lasso for sparse high-dimensional regression models. Statist. Sinica 18 1603-1618. · Zbl 1255.62198
[28] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295-327. · Zbl 1016.62078 · doi:10.1214/aos/1009210544
[29] Pan, J.-X. and Fang, K.-T. (2002). Growth Curve Models and Statistical Diagnostics . Springer, New York. · Zbl 1024.62025
[30] Preisser, J. S. and Qaqish, B. F. (1996). Deletion diagnostics for generalised estimating equations. Biometrika 83 551-562. · Zbl 0866.62041 · doi:10.1093/biomet/83.3.551
[31] Scheetz, T., Kim, K., Swiderski, R., Philp, A., Braun, T., Knudtson, K., Dorrance, A., DiBona, G., Huang, J., Casavant, T. et al. (2006). Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc. Natl. Acad. Sci. USA 103 14429-14434.
[32] Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 479-498. · Zbl 1090.62073 · doi:10.1111/1467-9868.00346
[33] Thomas, W. and Cook, R. D. (1989). Assessing influence on regression coefficients in generalized linear models. Biometrika 76 741-749. · Zbl 0681.62056 · doi:10.1093/biomet/76.4.741
[34] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 267-288. · Zbl 0850.62538
[35] Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. J. Amer. Statist. Assoc. 104 1512-1524. · Zbl 1205.62103 · doi:10.1198/jasa.2008.tm08516
[36] Wang, H. and Leng, C. (2007). Unified Lasso estimation via least squares approximation. J. Amer. Statist. Assoc. 101 1418-1429. · Zbl 1306.62167 · doi:10.1198/016214507000000509
[37] Wang, H., Li, G. and Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the LAD-Lasso. J. Bus. Econom. Statist. 25 347-355.
[38] Williams, D. A. (1987). Generalized linear model diagnostics using the deviance and single case deletions. J. R. Stat. Soc. Ser. C. Appl. Stat. 36 181-191. · Zbl 0646.62062 · doi:10.2307/2347550
[39] Xiang, L., Tse, S.-K. and Lee, A. H. (2002). Influence diagnostics for generalized linear mixed models: Applications to clustered data. Comput. Statist. Data Anal. 40 759-774. · Zbl 1103.62356 · doi:10.1016/S0167-9473(02)00075-0
[40] Zhang, H. H. and Lu, W. (2007). Adaptive Lasso for Cox’s proportional hazards model. Biometrika 94 691-703. · Zbl 1135.62083 · doi:10.1093/biomet/asm037
[41] Zhao, J., Leng, C., Li, L. and Wang, H. (2013). Supplement to “High-dimensional influence measure.” . · Zbl 1360.62411
[42] Zhu, H., Ibrahim, J. G. and Cho, H. (2012). Perturbation and scaled Cook’s distance. Ann. Statist. 40 785-811. · Zbl 1273.62180 · doi:10.1214/12-AOS978
[43] Zhu, H., Ibrahim, J. G., Lee, S. and Zhang, H. (2007). Perturbation selection and influence measures in local influence analysis. Ann. Statist. 35 2565-2588. · Zbl 1129.62068 · doi:10.1214/009053607000000343
[44] Zhu, H., Lee, S.-Y., Wei, B.-C. and Zhou, J. (2001). Case-deletion measures for models with incomplete data. Biometrika 88 727-737. · Zbl 1006.62021 · doi:10.1093/biomet/88.3.727
[45] Zou, H. (2006). The adaptive LASSO and its oracle properties. J. Amer. Statist. Assoc. 101 1418-1429. · Zbl 1171.62326 · doi:10.1198/016214506000000735
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.