×

Statistical inference for high-dimensional generalized linear models with binary outcomes. (English) Zbl 07707243

Summary: This article develops a unified statistical inference framework for high-dimensional binary generalized linear models (GLMs) with general link functions. Both unknown and known design distribution settings are considered. A two-step weighted bias-correction method is proposed for constructing confidence intervals (CIs) and simultaneous hypothesis tests for individual components of the regression vector. Minimax lower bound for the expected length is established and the proposed CIs are shown to be rate-optimal up to a logarithmic factor. The numerical performance of the proposed procedure is demonstrated through simulation studies and an analysis of a single cell RNA-seq dataset, which yields interesting biological insights that integrate well into the current literature on the cellular immune response mechanisms as characterized by single-cell transcriptomics. The theoretical analysis provides important insights on the adaptivity of optimal CIs with respect to the sparsity of the regression vector. New lower bound techniques are introduced and they can be of independent interest to solve other inference problems in high-dimensional binary GLMs.

MSC:

62-XX Statistics

Software:

hdi; SIHR; RSEM
Full Text: DOI

References:

[1] Bach, F., “Self-Concordant Analysis for Logistic Regression, Electronic Journal of Statistics, 4, 384-414 (2010) · Zbl 1329.62324 · doi:10.1214/09-EJS521
[2] Belloni, A.; Chernozhukov, V.; Y., “Post-Selection Inference for Generalized Linear Models With Many Controls, Journal of Business & Economic Statistics, 34, 606-619 (2016)
[3] Bühlmann, P., “Statistical Significance in High-Dimensional Linear Models, Bernoulli, 19, 1212-1242 (2013) · Zbl 1273.62173 · doi:10.3150/12-BEJSP11
[4] Cai, T. T.; Guo, Z., “Confidence Intervals for High-Dimensional Linear Regression: Minimax Rates and Adaptivity, The Annals of. Statistics, 45, 615-646 (2017) · Zbl 1371.62045
[5] Cai, T. T.; Guo, Z., “Accuracy Assessment for High-Dimensional Linear Regression, The Annals of Statistics, 46, 1807-1836 (2018) · Zbl 1403.62131
[6] Cai, T. T.; Guo, Z., “Semi-Supervised Inference for Explained Variance in High-Dimensional Regression and Its Applications, Journal of the Royal Statistical Society, Series B, 82, 391-419 (2020) · Zbl 07554759
[7] Candès, E.; Fan, Y.; Janson, L.; Lv, J., “Panning for Gold:‘Model-x’Knockoffs for High Dimensional Controlled Variable Selection, Journal of the Royal Statistical Society, Series B, 80, 551-577 (2018) · Zbl 1398.62335 · doi:10.1111/rssb.12265
[8] Chernozhukov, V.; Chetverikov, D.; Kato, K., “Central Limit Theorems and Bootstrap in High Dimensions,”, Annals of Probability, 45, 2309-2352 (2017) · Zbl 1377.60040
[9] Dezeure, R.; Bühlmann, P.; Zhang, C.-H., “High-Dimensional Simultaneous Inference With the Bootstrap, Test, 26, 685-719 (2017) · Zbl 06833591 · doi:10.1007/s11749-017-0554-2
[10] Enderlin, M.; Kleinmann, E.; Struyf, S.; Buracchi, C.; Vecchi, A.; Kinscherf, R.; Kiessling, F.; Paschek, S.; Sozzani, S.; Rommelaere, J., “Tnf-α and the ifn-γ-Inducible Protein 10 (ip-10/cxcl-10) Delivered by Parvoviral Vectors Act in Synergy to Induce Antitumor Effects in Mouse Glioblastoma, Cancer Gene Therapy, 16, 149-160 (2009) · doi:10.1038/cgt.2008.62
[11] Guo, Z., Rakshit, P., Herman, D. S., and Chen, J. (2020), “Inference for the Case Probability in High-Dimensional Logistic Regression,” arXiv:2012.07133. · Zbl 07626769
[12] Guo, Z., and Zhang, C.-H. (2019), “Extreme Nonlinear Correlation for Multiple Random Variables and Stochastic Processes With Applications to Additive Models,” arXiv:1904.12897.
[13] Huang, J.; Zhang, C.-H., “Estimation and Selection Via Absolute Penalized Convex Minimization and Its Multistage Adaptive Applications, Journal of Machine Learning Research, 13, 1839-1864 (2012) · Zbl 1435.62091
[14] Jang, J.-S.; Lee, J.-H.; Jung, N.-C.; Choi, S.-Y.; Park, S.-Y.; Yoo, J.-Y.; Song, J.-Y.; Seo, H. G.; Lee, H. S.; Lim, D.-S., “Rsad2 is Necessary for Mouse Dendritic Cell Maturation Via the irf7-Mediated Signaling Pathway,”, Cell Death & Disease, 9, 1-11 (2018)
[15] Janková, J., and van de Geer, S. (2018), “De-biased Sparse PCA: Inference and Testing for Eigenstructure of Large Covariance Matrices,” arXiv:1801.10567. · Zbl 1473.62204
[16] Jankova, J.; van de Geer, S., “Semiparametric Efficiency Bounds for High-Dimensional Models, The Annals of Statistics, 46, 2336-2359 (2018) · Zbl 1420.62308 · doi:10.1214/17-AOS1622
[17] Javanmard, A.; Javadi, H., “False Discovery Rate Control Via Debiased Lasso, Electronic Journal of Statistics, 13, 1212-1253 (2019) · Zbl 1418.62061 · doi:10.1214/19-EJS1554
[18] Javanmard, A.; Montanari, A., “Confidence Intervals and Hypothesis Testing for High-Dimensional Regression,”, Journal of Machine Learning Research, 15, 2869-2909 (2014) · Zbl 1319.62145
[19] Javanmard, A.; Montanari, A., “Hypothesis Testing in High-Dimensional Regression Under the Gaussian Random Design Model: Asymptotic Theory, IEEE Transactions on Information Theory, 60, 6522-6554 (2014) · Zbl 1360.62074
[20] Lancaster, H. O., “Some Properties of the Bivariate Normal Distribution Considered in the Form of a Contingency Table, Biometrika, 44, 289-292 (1957) · Zbl 0082.35105 · doi:10.1093/biomet/44.1-2.289
[21] Li, B.; Dewey, C. N., “RSEM: Accurate Transcript Quantification From RNA-Seq Data With or Without a Reference Genome, BMC Bioinformatics, 12, 323 (2011) · doi:10.1186/1471-2105-12-323
[22] Liu, W., “Gaussian Graphical Model Estimation With False Discovery Rate Control, The Annals of Statistics, 41, 2948-2978 (2013) · Zbl 1288.62094 · doi:10.1214/13-AOS1169
[23] Ma, R.; Cai, T. T.; Li, H., “Global and Simultaneous Hypothesis Testing for High-Dimensional Logistic Regression Models, Journal of the American Statistical Association, 1-15 (2020)
[24] Majumder, S.; Bhattacharjee, S.; Chowdhury, B. P.; Majumdar, S., “Cxcl10 is Critical for the Generation of Protective cd8 t Cell Response Induced by Antigen Pulsed cpg-odn Activated Dendritic Cells, PLoS One, 7 (2012) · doi:10.1371/journal.pone.0048727
[25] Meier, L.; van de Geer, S.; Bühlmann, P., “The Group Lasso for Logistic Regression, Journal of the Royal Statistical Society, 70, 53-71 (2008) · Zbl 1400.62276 · doi:10.1111/j.1467-9868.2007.00627.x
[26] Mukherjee, R.; Pillai, N. S.; Lin, X., “Hypothesis Testing for High-Dimensional Sparse Binary Regression, The Annals of Statistics, 43, 352-381 (2015) · Zbl 1308.62094 · doi:10.1214/14-AOS1279
[27] Negahban, S.; Ravikumar, P.; Wainwright, M. J.; Yu, B., 979 (2010)
[28] Nickl, R.; van de Geer, S., “Confidence Sets in Sparse Regression, The Annals of Statistics, 41, 2852-2876 (2013) · Zbl 1288.62108 · doi:10.1214/13-AOS1170
[29] Ning, Y., and Cheng, G. (2020), “Sparse Confidence Sets for Normal Mean Models,” arXiv:2008.07107.
[30] Ning, Y.; Liu, H., “A General Theory of Hypothesis Tests and Confidence Regions for Sparse High Dimensional Models, The Annals of Statistics, 45, 158-195 (2017) · Zbl 1364.62128 · doi:10.1214/16-AOS1448
[31] Nualart, D., The Malliavin Calculus and Related Topics (2006), Springer Science & Business Media: Springer Science & Business Media, Berlin · Zbl 1099.60003
[32] Plan, Y.; Vershynin, R., “Robust 1-bit Compressed Sensing and Sparse Logistic Regression: A Convex Programming Approach, IEEE Transactions on Information Theory, 59, 482-494 (2013) · Zbl 1364.94153 · doi:10.1109/TIT.2012.2207945
[33] Rakshit, P., Cai, T. T., and Guo, Z. (2021), “Sihr: An r Package for Statistical Inference in High-Dimensional Linear and Logistic Regression Models,” arXiv:2109.03365.
[34] Razzaghi, M., “The Probit Link Function in Generalized Linear Models for Data Mining Applications, Journal of Modern Applied Statistical Methods, 12, 19 (2013) · doi:10.22237/jmasm/1367381880
[35] Shalek, A. K.; Satija, R.; Shuga, J.; Trombetta, J. J.; Gennert, D.; Lu, D.; Chen, P.; Gertner, R. S.; Gaublomme, J. T.; Yosef, N., “Single-Cell RNA-seq Reveals Dynamic Paracrine Control of Cellular Variation, Nature, 510, 363-369 (2014) · doi:10.1038/nature13437
[36] Shi, C.; Song, R.; Lu, W.; R., “Statistical Inference for High-Dimensional Models Via Recursive Online-Score Estimation, Journal of the American Statistical Association, 1-12 (2020)
[37] Sur, P.; Candès, E. J., “A Modern Maximum-Likelihood Theory for High-Dimensional Logistic Regression, 116, 14516-14525 (2019) · Zbl 1431.62084
[38] Sur, P.; Chen, Y.; Candès, E. J., “The Likelihood Ratio Test in High-Dimensional Logistic Regression is Asymptotically a Rescaled Chi-Square, Probability Theory and Related Fields, 175, 487-558 (2019) · Zbl 1431.62319 · doi:10.1007/s00440-018-00896-9
[39] Tanaka, T.; Narazaki, M.; Kishimoto, T., “Il-6 in Inflammation, Immunity, and Disease, Cold Spring Harbor Perspectives in Biology, 6, a016295 (2014) · doi:10.1101/cshperspect.a016295
[40] Tsiatis, A., Semiparametric Theory and Missing Data (2007), Springer Science & Business Media: Springer Science & Business Media, New York · Zbl 1105.62002
[41] van de Geer, S., “High-Dimensional Generalized Linear Models and the Lasso, The Annals of Statistics, 36, 614-645 (2008) · Zbl 1138.62323 · doi:10.1214/009053607000000929
[42] van de Geer, S.; Bühlmann, P.; Ritov, Y.; Dezeure, R., “On Asymptotically Optimal Confidence Regions and Tests for High-Dimensional Models, The Annals of Statistics, 42, 1166-1202 (2014) · Zbl 1305.62259 · doi:10.1214/14-AOS1221
[43] Xia, L., Nan, B., and Li, Y. (2020), “A Revisit to De-Biased Lasso for Generalized Linear Models,” arXiv:2006.12778.
[44] Ymer, S.; Huang, D.; Penna, G.; Gregori, S.; Branson, K.; Adorini, L.; Morahan, G., “Polymorphisms in the il12b Gene Affect Structure and Expression of il-12 in Nod and Other Autoimmune-Prone Mouse Strains, Genes & Immunity, 3, 151-157 (2002) · doi:10.1038/sj.gene.6363849
[45] Yu, Y., “On the Maximal Correlation Coefficient, Statistics & Probability Letters, 78, 1072-1075 (2008) · Zbl 1140.60308
[46] Zhang, C.-H.; Zhang, S. S., “Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models, Journal of the Royal Statistical Society, Series B, 76, 217-242 (2014) · Zbl 1411.62196 · doi:10.1111/rssb.12026
[47] Zhang, X.; Cheng, G., “Simultaneous Inference for High-Dimensional Linear Models, Journal of the American Statistical Association, 112, 757-768 (2017) · doi:10.1080/01621459.2016.1166114
[48] Zhu, Y.; Shen, X.; Pan, W., “On High-Dimensional Constrained Maximum Likelihood Inference, Journal of the American Statistical Association, 115, 217-230 (2020) · Zbl 1437.62102 · doi:10.1080/01621459.2018.1540986
[49] Zwiers, A.; Fuss, I. J.; Seegers, D.; Konijn, T.; Garcia-Vallejo, J. J.; Samsom, J. N.; Strober, W.; Kraal, G.; Bouma, G., “A Polymorphism in the Coding Region of il12b Promotes il-12p70 and il-23 Heterodimer Formation, The Journal of Immunology, 186, 3572-3580 (2011) · doi:10.4049/jimmunol.1001330
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.