×

An effective procedure for feature subset selection in logistic regression based on information criteria. (English) Zbl 1472.62120

Summary: In this paper, the problem of best subset selection in logistic regression is addressed. In particular, we take into account formulations of the problem resulting from the adoption of information criteria, such as AIC or BIC, as goodness-of-fit measures. There exist various methods to tackle this problem. Heuristic methods are computationally cheap, but are usually only able to find low quality solutions. Methods based on local optimization suffer from similar limitations as heuristic ones. On the other hand, methods based on mixed integer reformulations of the problem are much more effective, at the cost of higher computational requirements, that become unsustainable when the problem size grows. We thus propose a new approach, which combines mixed-integer programming and decomposition techniques in order to overcome the aforementioned scalability issues. We provide a theoretical characterization of the proposed algorithm properties. The results of a vast numerical experiment, performed on widely available datasets, show that the proposed method achieves the goal of outperforming state-of-the-art techniques.

MSC:

62J12 Generalized linear models (logistic models)
62-08 Computational methods for problems pertaining to statistics
90C11 Mixed integer programming
90C05 Linear programming

References:

[1] Akaike, H., A new look at the statistical model identification, IEEE Trans. Autom. Control, 19, 6, 716-723 (1974) · Zbl 0314.62039 · doi:10.1109/TAC.1974.1100705
[2] Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Selected Papers of Hirotugu Akaike, pp. 199-213. Springer (1998)
[3] Bach, F.; Jenatton, R.; Mairal, J.; Obozinski, G., Optimization with sparsity-inducing penalties, Found. Trends Mach. Learn., 4, 1, 1-106 (2012) · Zbl 06064248 · doi:10.1561/2200000015
[4] Beck, A.; Eldar, Y., Sparsity constrained nonlinear optimization: optimality conditions and algorithms, SIAM J. Optim., 23, 3, 1480-1509 (2013) · Zbl 1295.90051 · doi:10.1137/120869778
[5] Bertsekas, DP; Tsitsiklis, JN, Parallel and distributed computation: numerical methods (1989), Englewood Cliffs: Prentice Hall, Englewood Cliffs · Zbl 0743.65107
[6] Bertsimas, D., Digalakis Jr, V.: The backbone method for ultra-high dimensional sparse machine learning. arXiv:2006.06592 (2020)
[7] Bertsimas, D.; King, A.; Mazumder, R., Best subset selection via a modern optimization lens, Ann. Stat., 44, 813-852 (2016) · Zbl 1335.62115 · doi:10.1214/15-AOS1388
[8] Bertsimas, D.; King, A., Logistic regression: from art to science, Stat. Sci., 32, 3, 367-384 (2017) · Zbl 1442.62166 · doi:10.1214/16-STS602
[9] Bishop, CM, Pattern Recognition and Machine Learning (2006), Berlin: Springer, Berlin · Zbl 1107.68072
[10] Bonami, P.; Biegler, LT; Conn, AR; Cornuéjols, G.; Grossmann, IE; Laird, CD; Lee, J.; Lodi, A.; Margot, F.; Sawaya, N., An algorithmic framework for convex mixed integer nonlinear programs, Discrete Optim., 5, 2, 186-204 (2008) · Zbl 1151.90028 · doi:10.1016/j.disopt.2006.10.011
[11] Bozdogan, H., Akaike’s information criterion and recent developments in information complexity, J. Math. Psychol., 44, 1, 62-91 (2000) · Zbl 1047.62501 · doi:10.1006/jmps.1999.1277
[12] Burnham, K.P., Anderson, D.R.: Practical use of the information-theoretic approach. In: Model Selection and Inference, pp. 75-117. Springer (1998) · Zbl 0920.62006
[13] Burnham, KP; Anderson, DR, Multimodel inference: understanding AIC and BIC in model selection, Sociol. Methods Res., 33, 2, 261-304 (2004) · doi:10.1177/0049124104268644
[14] Di Gangi, L.; Lapucci, M.; Schoen, F.; Sortino, A., An efficient optimization approach for best subset selection in linear regression, with application to model selection and fitting in autoregressive time-series, Comput. Optim. Appl., 74, 3, 919-948 (2019) · Zbl 1435.90088 · doi:10.1007/s10589-019-00134-5
[15] Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
[16] Duran, MA; Grossmann, IE, An outer-approximation algorithm for a class of mixed-integer nonlinear programs, Math. Program., 36, 3, 307-339 (1986) · Zbl 0619.90052 · doi:10.1007/BF02592064
[17] Efroymson, MA; Ralston, A.; Wilf, HS, Multiple regression analysis, Mathematical Methods for Digital Computers, 191-203 (1960), New York: Wiley, New York · Zbl 0089.12602
[18] Fan, R-E; Chang, K-W; Hsieh, C-J; Wang, X-R; Lin, C-J, LIBLINEAR: a library for large linear classification, J. Mach. Learn. Res., 9, 1871-1874 (2008) · Zbl 1225.68175
[19] Gómez, A., Prokopyev, O.A.: A mixed-integer fractional optimization approach to best subset selection. INFORMS J. Comput. (2021, accepted) · Zbl 07362333
[20] Gurobi Optimization, L.: Gurobi optimizer reference manual (2020). http://www.gurobi.com
[21] Hannan, EJ; Quinn, BG, The determination of the order of an autoregression, J. R. Stat. Soc. Ser. B (Methodol.), 41, 2, 190-195 (1979) · Zbl 0408.62076
[22] Hastie, T.; Tibshirani, R.; Friedman, J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009), Berlin: Springer, Berlin · Zbl 1273.62005 · doi:10.1007/978-0-387-84858-7
[23] Hosmer, DW Jr; Lemeshow, S.; Sturdivant, RX, Applied Logistic Regression (2013), Hoboken: Wiley, Hoboken · Zbl 1276.62050 · doi:10.1002/9781118548387
[24] James, G.; Witten, D.; Hastie, T.; Tibshirani, R., An Introduction to Statistical Learning (2013), Berlin: Springer, Berlin · Zbl 1281.62147 · doi:10.1007/978-1-4614-7138-7
[25] Jörnsten, K., Näsberg, M., Smeds, P.: Variable Splitting: A New Lagrangean Relaxation Approach to Some Mathematical Programming Models. LiTH MAT R.: Matematiska Institutionen. University of Linköping, Department of Mathematics (1985)
[26] Kamiya, S., Miyashiro, R., Takano, Y.: Feature subset selection for the multinomial logit model via mixed-integer optimization. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1254-1263 (2019)
[27] Kim, S-J; Koh, K.; Lustig, M.; Boyd, S.; Gorinevsky, D., An interior-point method for large-scale \(\ell_1\)-regularized least squares, IEEE J. Sel. Top. Signal Process., 1, 4, 606-617 (2007) · doi:10.1109/JSTSP.2007.910971
[28] Konishi, S.; Kitagawa, G., Information Criteria and Statistical Modeling (2008), Berlin: Springer, Berlin · Zbl 1172.62003 · doi:10.1007/978-0-387-71887-3
[29] Lee, S-I; Lee, H.; Abbeel, P.; Ng, AY, Efficient \(\ell_1\) regularized logistic regression, AAAI, 6, 401-408 (2006)
[30] Liu, DC; Nocedal, J., On the limited memory BFGS method for large scale optimization, Math. Program., 45, 1-3, 503-528 (1989) · Zbl 0696.90048 · doi:10.1007/BF01589116
[31] Liuzzi, G.; Rinaldi, F., Solving \(\ell_0\)-penalized problems with simple constraints via the Frank-Wolfe reduced dimension method, Optim. Lett., 9, 1, 57-74 (2015) · Zbl 1316.90047 · doi:10.1007/s11590-014-0757-3
[32] Lu, Z.; Zhang, Y., Sparse approximation via penalty decomposition methods, SIAM J. Optim., 23, 4, 2448-2478 (2013) · Zbl 1295.90056 · doi:10.1137/100808071
[33] Miller, A., Subset Selection in Regression (2002), Boca Raton: Chapman and Hall/CRC, Boca Raton · Zbl 1051.62060 · doi:10.1201/9781420035933
[34] Miyashiro, R.; Takano, Y., Mixed integer second-order cone programming formulations for variable selection in linear regression, Eur. J. Oper. Res., 247, 3, 721-731 (2015) · Zbl 1346.90616 · doi:10.1016/j.ejor.2015.06.081
[35] Miyashiro, R.; Takano, Y., Subset selection by Mallows’ Cp: a mixed integer programming approach, Expert Syst. Appl., 42, 1, 325-331 (2015) · doi:10.1016/j.eswa.2014.07.056
[36] Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E., Scikit-learn: machine learning in Python, J. Mach. Learn. Res., 12, 2825-2830 (2011) · Zbl 1280.68189
[37] Rinaldi, F.; Schoen, F.; Sciandrone, M., Concave programming for minimizing the zero-norm over polyhedral sets, Comput. Optim. Appl., 46, 3, 467-486 (2010) · Zbl 1229.90170 · doi:10.1007/s10589-008-9202-9
[38] Sato, T.; Takano, Y.; Miyashiro, R.; Yoshise, A., Feature subset selection for logistic regression via mixed integer optimization, Comput. Optim. Appl., 64, 3, 865-880 (2016) · Zbl 1352.90068 · doi:10.1007/s10589-016-9832-2
[39] Schwarz, G., Estimating the dimension of a model, Ann. Stat., 6, 2, 461-464 (1978) · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[40] Shen, X.; Pan, W.; Zhu, Y.; Zhou, H., On constrained and regularized high-dimensional regression, Ann. Inst. Stat. Math., 65, 5, 807-832 (2013) · Zbl 1329.62307 · doi:10.1007/s10463-012-0396-3
[41] Tibshirani, R., Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B (Methodol.), 58, 1, 267-288 (1996) · Zbl 0850.62538
[42] Tseng, P., Convergence of a block coordinate descent method for nondifferentiable minimization, J. Optim. Theory Appl., 109, 3, 475-494 (2001) · Zbl 1006.65062 · doi:10.1023/A:1017501703105
[43] Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S.J., Brett, M., Wilson, J., Millman, K. Jarrod, Mayorov, N., Nelson, A.R.J., Jones, E., Kern, R., Larson, E., Carey, C., Polat, I., Feng, Y., Moore, E.W., Vander Plas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E.A., Harris, C.R., Archibald, A.M., Ribeiro, A.H., Pedregosa, F., van Mulbregt, P.: Fundamental algorithms for scientific computing in python and SciPy 1.0 contributors. SciPy 1.0. Nat. Methods 17, 261-272 (2020). doi:10.1038/s41592-019-0686-2
[44] Weston, J.; Elisseeff, A.; Schölkopf, B.; Tipping, M., Use of the zero-norm with linear models and kernel methods, J. Mach. Learn. Res., 3, Mar, 1439-1461 (2003) · Zbl 1102.68605
[45] Yuan, G-X; Ho, C-H; Lin, C-J, An improved GLMNET for L1-regularized logistic regression, J. Mach. Learn. Res., 13, 64, 1999-2030 (2012) · Zbl 1432.68404
[46] Zheng, Z.; Fan, Y.; Lv, J., High dimensional thresholded regression and shrinkage effect, J. R. Stat. Soc. Ser. B (Stat. Methodol.), 76, 3, 627-649 (2014) · Zbl 1411.62049 · doi:10.1111/rssb.12037
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.