×

A tradeoff between false discovery and true positive proportions for sparse high-dimensional logistic regression. (English) Zbl 07823217

Summary: The logistic regression model is a simple and classic approach to binary classification, where in sparse high-dimensional settings, one believes that only a small proportion of the predictive variables are relevant to the response variable with nonnull regression coefficients. We focus on regularized logistic regression models and the analysis is valid for a large group of regularizers, including folded-concave regularizers such as MCP and SCAD. For finite samples, the discrepancy between the estimated and true nonnull coefficients is evaluated by the false discovery and true positive rates. We show that the false discovery rate can be described using a nonlinear tradeoff function of power asymptotically using a system of equations with six parameters. The analysis is conducted in an “average-over-components” fashion for the unknown parameter and follows the conventional assumptions of the literature in the relevant field. More specifically, we assume a linear growth rate \(n/p \to \delta > 0\) covering not only the typical high dimensional settings where \(p \geq n\) but also for \(n > p\). Further, we propose two applications of this tradeoff function that improve the reproducibility of variable selection: (1) a sample size calculation procedure to achieve a certain power under a prespecified level of false discovery rate using the tradeoff; (2) calibration of the false discovery rate for variable selection taking power into consideration. A similar asymptotic analysis for the model-X knockoff, which provides a selection with a controlled false discovery rate, is investigated to show how to compare two selection methods by comparing the tradeoff curves. We illustrate the tradeoff analysis and its corresponding applications using simulated and real data.

MSC:

62J12 Generalized linear models (logistic models)
62F99 Parametric inference

References:

[1] ABBASI, E. (2020). Universality Laws and Performance Analysis of the Generalized Linear Models, PhD thesis, California Institute of Technology. MathSciNet: MR4639752
[2] BARBER, R. F. and CANDÈS, E. J. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics 43 2055-2085. MathSciNet: MR3375876 · Zbl 1327.62082
[3] BAYATI, M. and MONTANARI, A. (2011). The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory 57 764-785. MathSciNet: MR2810285 · Zbl 1366.94079
[4] BAYATI, M. and MONTANARI, A. (2012). The LASSO risk for Gaussian matrices. IEEE Transactions on Information Theory 58 1997-2017. MathSciNet: MR2951312 · Zbl 1365.62196
[5] BENJAMINI, Y. and HOCHBERG, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological) 57 289-300. MathSciNet: MR1325392 · Zbl 0809.62014
[6] BLANCHARD, G. and ROQUAIN, É. (2009). Adaptive false discovery rate control under independence and dependence. Journal of Machine Learning Research 10. MathSciNet: MR2579914 · Zbl 1235.62093
[7] BOGDAN, M., VAN DEN BERG, E., SABATTI, C., SU, W. and CANDÈS, E. J. (2015). SLOPE—adaptive variable selection via convex optimization. The Annals of Applied Statistics 9 1103. MathSciNet: MR3418717 · Zbl 1454.62212
[8] BRADIC, J. (2016). Robustness in sparse high-dimensional linear models: Relative efficiency and robust approximate message passing. Electronic Journal of Statistics 10 3894-3944. MathSciNet: MR3581957 · Zbl 1357.62215
[9] BU, Z., KLUSOWSKI, J., RUSH, C. and SU, W. (2019). Algorithmic analysis and statistical estimation of slope via approximate message passing. Advances in Neural Information Processing Systems 32 9366-9376. MathSciNet: MR4231969
[10] CAI, Z., LI, R. and ZHANG, Y. (2022). A distribution free conditional independence test with applications to causal discovery. Journal of Machine Learning Research 23 1-41. MathSciNet: MR4576670
[11] CANDÈS, E., FAN, Y., JANSON, L. and LV, J. (2018). Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 551-577. MathSciNet: MR3798878 · Zbl 1398.62335
[12] CELENTANO, M. and MONTANARI, A. (2021). CAD: Debiasing the Lasso with inaccurate covariate model. arXiv preprint arXiv:2107.14172.
[13] CELENTANO, M. and MONTANARI, A. (2022). Fundamental barriers to high-dimensional regression with convex penalties. The Annals of Statistics 50 170-196. MathSciNet: MR4382013 · Zbl 1486.62198
[14] DONOHO, D. and MONTANARI, A. (2016). High dimensional robust M-estimation: Asymptotic variance via approximate message passing. Probability Theory and Related Fields 166 935-969. MathSciNet: MR3568043 · Zbl 1357.62220
[15] DONOHO, D. L., MALEKI, A. and MONTANARI, A. (2009). Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences 106 18914-18919. MathSciNet: MR4158199
[16] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96 1348-1360. MathSciNet: MR1946581 · Zbl 1073.62547
[17] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 849-911. MathSciNet: MR2530322 · Zbl 1411.62187
[18] Fan, J., Ma, Y. and Dai, W. (2014). Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. Journal of the American Statistical Association 109 1270-1284. MathSciNet: MR3265696 · Zbl 1368.62095
[19] FAN, J. and PENG, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics 32 928-961. MathSciNet: MR2065194 · Zbl 1092.62031
[20] Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics 38 3567-3604. Digital Object Identifier: 10.1214/10-AOS798 Google Scholar: Lookup Link MathSciNet: MR2766861 · Zbl 1206.68157 · doi:10.1214/10-AOS798
[21] FAN, J., XUE, L. and ZOU, H. (2014). Strong oracle optimality of folded concave penalized estimation. Annals of Statistics 42 819. MathSciNet: MR3210988 · Zbl 1305.62252
[22] FAN, Z. (2022). Approximate Message Passing algorithms for rotationally invariant matrices. The Annals of Statistics 50 197-224. MathSciNet: MR4382014 · Zbl 1486.94026
[23] FARCOMENI, A. (2006). More powerful control of the false discovery rate under dependence. Statistical Methods and Applications 15 43-73. MathSciNet: MR2281214 · Zbl 1187.62130
[24] FENG, O. Y., VENKATARAMANAN, R., RUSH, C. and SAMWORTH, R. J. (2022). A unifying tutorial on approximate message passing. Foundations and Trends® in Machine Learning 15 335-536. · Zbl 1491.68152
[25] FITHIAN, W. and LEI, L. (2020). Conditional calibration for false discovery rate control under dependence. arXiv preprint arXiv:2007.10438. MathSciNet: MR4524490
[26] GENOVESE, C. and WASSERMAN, L. (2004). A stochastic process approach to false discovery control. The Annals of Statistics 32 1035-1061. MathSciNet: MR2065197 · Zbl 1092.62065
[27] GORDON, Y. (1985). Some inequalities for Gaussian processes and applications. Israel Journal of Mathematics 50 265-289. MathSciNet: MR0800188 · Zbl 0663.60034
[28] GORDON, Y. (1988). On Milman’s inequality and random subspaces which escape through a mesh in \(R^n\). In Geometric Aspects of Functional Analysis 84-106. Springer. MathSciNet: MR0950977 · Zbl 0651.46021
[29] He, X., Wang, L. and Hong, H. G. (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. The Annals of Statistics 41 342-369. Digital Object Identifier: 10.1214/13-AOS1087 Google Scholar: Lookup Link MathSciNet: MR3059421 · Zbl 1295.62053 · doi:10.1214/13-AOS1087
[30] JANSON, L. and SU, W. (2016). Familywise error rate control via knockoffs. Electronic Journal of Statistics 10 960-975. MathSciNet: MR3486422 · Zbl 1341.62245
[31] KELNER, J. A., KOEHLER, F., MEKA, R. and ROHATGI, D. (2022). On the power of preconditioning in sparse linear regression. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS) 550-561. IEEE. MathSciNet: MR4399714
[32] LEE, J. D., SUN, Y. and TAYLOR, J. E. (2015). On model selection consistency of regularized M-estimators. Electronic Journal of Statistics 9 608-642. MathSciNet: MR3331852 · Zbl 1309.62044
[33] Li, R., Zhong, W. and Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association 107 1129-1139. MathSciNet: MR3010900 · Zbl 1443.62184
[34] LIU, W., KE, Y., LIU, J. and LI, R. (2022). Model-free feature screening and FDR control with knockoff features. Journal of the American Statistical Association 117 428-443. MathSciNet: MR4399096 · Zbl 1506.62303
[35] Mai, Q. and Zou, H. (2015). The fused Kolmogorov filter: A nonparametric model-free screening method. The Annals of Statistics 43 1471-1497. MathSciNet: MR3357868 · Zbl 1431.62216
[36] MEINSHAUSEN, N. and BÜHLMANN, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics 3 1436-1462. Digital Object Identifier: 10.1214/009053606000000281 Google Scholar: Lookup Link MathSciNet: MR2278363 · Zbl 1113.62082 · doi:10.1214/009053606000000281
[37] PAN, W., WANG, X., XIAO, W. and ZHU, H. (2018). A generic sure independence screening procedure. Journal of the American Statistical Association. MathSciNet: MR3963192
[38] RANGAN, S., SCHNITER, P., FLETCHER, A. K. and SARKAR, S. (2019). On the convergence of approximate message passing with arbitrary matrices. IEEE Transactions on Information Theory 65 5339-5351. MathSciNet: MR4009237 · Zbl 1432.94037
[39] RANGAN, S., SCHNITER, P., RIEGLER, E., FLETCHER, A. K. and CEVHER, V. (2016). Fixed points of generalized approximate message passing with arbitrary matrices. IEEE Transactions on Information Theory 62 7464-7474. MathSciNet: MR3599094 · Zbl 1359.94158
[40] SALEHI, F., ABBASI, E. and HASSIBI, B. (2019). The impact of regularization on high-dimensional logistic regression. Advances in Neural Information Processing Systems 32.
[41] SU, W., BOGDAN, M. and CANDES, E. (2017). False discoveries occur early on the lasso path. The Annals of Statistics 45 2133-2150. MathSciNet: MR3718164 · Zbl 1459.62142
[42] Sur, P. and Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences 116 14516-14525. MathSciNet: MR3984492 · Zbl 1431.62084
[43] SUR, P. and CANDÈS, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression, PhD thesis, Stanford University. MathSciNet: MR4197622
[44] Sur, P., Chen, Y. and Candès, E. J. (2019). The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square. Probability Theory and Related Fields 175 487-558. MathSciNet: MR4009715 · Zbl 1431.62319
[45] THRAMPOULIDIS, C., ABBASI, E. and HASSIBI, B. (2018). Precise error analysis of regularized M-estimators in high dimensions. IEEE Transactions on Information Theory 64 5592-5628. MathSciNet: MR3832326 · Zbl 1401.94051
[46] THRAMPOULIDIS, C. and HASSIBI, B. (2015). Isotropically random orthogonal matrices: Performance of lasso and minimum conic singular values. In 2015 IEEE International Symposium on Information Theory (ISIT) 556-560. IEEE.
[47] TONG, Z., CAI, Z., YANG, S. and LI, R. (2022). Model-free conditional feature screening with FDR control. Journal of the American Statistical Association 1-13. MathSciNet: MR4681605
[48] WANG, S., WENG, H. and MALEKI, A. (2020). Which bridge estimator is the best for variable selection? The Annals of Statistics 48 2791-2823. Digital Object Identifier: 10.1214/19-AOS1906 Google Scholar: Lookup Link MathSciNet: MR4152121 · Zbl 1456.62147 · doi:10.1214/19-AOS1906
[49] WEINSTEIN, A., BARBER, R. and CANDES, E. (2017). A power and prediction analysis for knockoffs with lasso statistics. arXiv preprint arXiv:1712.06465.
[50] WEINSTEIN, A., SU, W. J., BOGDAN, M., BARBER, R. F. and CANDÈS, E. J. (2020). A power analysis for knockoffs with the lasso coefficient-difference statistic. arXiv preprint arXiv:2007.15346.
[51] WU, Y. and YIN, G. (2015). Conditional quantile screening in ultrahigh-dimensional heterogeneous data. Biometrika 102 65-76. Digital Object Identifier: 10.1093/biomet/asu068 Google Scholar: Lookup Link MathSciNet: MR3335096 · Zbl 1345.62097 · doi:10.1093/biomet/asu068
[52] XU, J., MALEKI, A., RAD, K. R. and HSU, D. (2021). Consistent risk estimation in moderately high-dimensional linear regression. IEEE Transactions on Information Theory 67 5997-6030. MathSciNet: MR4345048 · Zbl 1486.62208
[53] YANG, G., YU, Y., LI, R. and BUU, A. (2016). Feature screening in ultrahigh dimensional Cox’s model. Statistica Sinica 26 881. MathSciNet: MR3559935 · Zbl 1356.62175
[54] ZHANG, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38 894-942. Digital Object Identifier: 10.1214/09-AOS729 Google Scholar: Lookup Link MathSciNet: MR2604701 · Zbl 1183.62120 · doi:10.1214/09-AOS729
[55] ZHAO, P. and YU, B. (2006). On model selection consistency of Lasso. The Journal of Machine Learning Research 7 2541-2563. MathSciNet: MR2274449 · Zbl 1222.62008
[56] ZHAO, Q., SUR, P. and CANDÈS, E. J. (2023). The asymptotic distribution of the MLE in high-dimensional logistic models: Arbitrary covariance. Bernoulli 28. MathSciNet: MR4411513
[57] ZHOU, J., CLAESKENS, G. and BRADIC, J. (2020). Detangling robustness in high dimensions: composite versus model-averaged estimation. Electronic Journal of Statistics 14 2551-2599. MathSciNet: MR4122516 · Zbl 1450.62087
[58] ZOU, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association 101 1418-1429. MathSciNet: MR2279469 · Zbl 1171.62326
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.