×

Best subset selection via a modern optimization lens. (English) Zbl 1335.62115

Summary: In the period 1991-2015, algorithmic advances in Mixed Integer Optimization (MIO) coupled with hardware improvements have resulted in an astonishing 450 billion factor speedup in solving MIO problems. We present a MIO approach for solving the classical best subset selection problem of choosing \(k\) out of \(p\) features in linear regression given \(n\) observations. We develop a discrete extension of modern first-order continuous optimization methods to find high quality feasible solutions that we use as warm starts to a MIO solver that finds provably optimal solutions. The resulting algorithm (a) provides a solution with a guarantee on its suboptimality even if we terminate the algorithm early, (b) can accommodate side constraints on the coefficients of the linear regression and (c) extends to finding best subset solutions for the least absolute deviation loss function. Using a wide variety of synthetic and real datasets, we demonstrate that our approach solves problems with \(n\) in the 1000s and \(p\) in the 100s in minutes to provable optimality, and finds near optimal solutions for \(n\) in the 100s and \(p\) in the 1000s in minutes. We also establish via numerical experiments that the MIO approach performs better than Lasso and other popularly used sparse learning procedures, in terms of achieving sparse solutions with good predictive power.

MSC:

62J05 Linear regression; mixed models
62J07 Ridge regression; shrinkage estimators (Lasso)
62G35 Nonparametric robustness
90C11 Mixed integer programming
90C26 Nonconvex programming, global optimization
90C27 Combinatorial optimization

References:

[1] Bandeira, A. S., Dobriban, E., Mixon, D. G. and Sawin, W. F. (2013). Certifying the restricted isometry property is hard. IEEE Trans. Inform. Theory 59 3448-3450. · Zbl 1364.94109 · doi:10.1109/TIT.2013.2248414
[2] Bertsimas, D., King, A. and Mazumder, R. (2015). Supplement to “Best subset selection via a modern optimization lens.” . · Zbl 1335.62115 · doi:10.1214/15-AOS1388
[3] Bertsimas, D. and Mazumder, R. (2014). Least quantile regression via modern optimization. Ann. Statist. 42 2494-2525. · Zbl 1302.62154 · doi:10.1214/14-AOS1223
[4] Bertsimas, D. and Shioda, R. (2009). Algorithm for cardinality-constrained quadratic optimizaiton. Comput. Optim. Appl. 43 1-22. · Zbl 1178.90262 · doi:10.1007/s10589-007-9126-9
[5] Bertsimas, D. and Weismantel, R. (2005). Optimization Over Integers. Dynamic Ideas, Belmont.
[6] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705-1732. · Zbl 1173.62022 · doi:10.1214/08-AOS620
[7] Bienstock, D. (1996). Computational study of a family of mixed-integer quadratic programming problems. Math. Programming 74 121-140. · Zbl 0855.90090 · doi:10.1007/BF02592208
[8] Bixby, R. E. (2012). A brief history of linear and mixed-integer programming computation. Doc. Math. Extra Volume : Optimization Stories 107-121. · Zbl 1270.90003
[9] Blumensath, T. and Davies, M. E. (2008). Iterative thresholding for sparse approximations. J. Fourier Anal. Appl. 14 629-654. · Zbl 1175.94060 · doi:10.1007/s00041-008-9035-z
[10] Blumensath, T. and Davies, M. E. (2009). Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 27 265-274. · Zbl 1174.94008 · doi:10.1016/j.acha.2009.04.002
[11] Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data : Methods , Theory and Applications. Springer Series in Statistics . Springer, Heidelberg. · Zbl 1273.62015 · doi:10.1007/978-3-642-20192-9
[12] Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674-1697. · Zbl 1209.62065 · doi:10.1214/009053606000001587
[13] Candès, E. J. (2008). The restricted isometry property and its implications for compressed sensing. C. R. Math. Acad. Sci. Paris 346 589-592. · Zbl 1153.94002 · doi:10.1016/j.crma.2008.03.014
[14] Candès, E. J. and Plan, Y. (2009). Near-ideal model selection by \(\ell_{1}\) minimization. Ann. Statist. 37 2145-2177. · Zbl 1173.62053 · doi:10.1214/08-AOS653
[15] Candes, E. J. and Tao, T. (2006). Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Inform. Theory 52 5406-5425. · Zbl 1309.94033 · doi:10.1109/TIT.2006.885507
[16] Candès, E. J., Wakin, M. B. and Boyd, S. P. (2008). Enhancing sparsity by reweighted \(l_{1}\) minimization. J. Fourier Anal. Appl. 14 877-905. · Zbl 1176.94014 · doi:10.1007/s00041-008-9045-x
[17] Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33-61. · Zbl 0919.94002 · doi:10.1137/S1064827596304010
[18] Dettling, M. (2004). Bagboosting for tumor classification with gene expression data. Bioinformatics 20 3583-3593.
[19] Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal \(l_{1}\)-norm solution is also the sparsest solution. Comm. Pure Appl. Math. 59 797-829. · Zbl 1113.15004 · doi:10.1002/cpa.20132
[20] Donoho, D. L. and Elad, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via \(l^{1}\) minimization. Proc. Natl. Acad. Sci. USA 100 2197-2202 (electronic). · Zbl 1064.94011 · doi:10.1073/pnas.0437847100
[21] Donoho, D. L. and Huo, X. (2001). Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inform. Theory 47 2845-2862. · Zbl 1019.94503 · doi:10.1109/18.959265
[22] Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425-455. · Zbl 0815.62019 · doi:10.1093/biomet/81.3.425
[23] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407-499. · Zbl 1091.62054 · doi:10.1214/009053604000000067
[24] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348-1360. · Zbl 1073.62547 · doi:10.1198/016214501753382273
[25] Fan, J. and Lv, J. (2011). Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inform. Theory 57 5467-5484. · Zbl 1365.62277 · doi:10.1109/TIT.2011.2158486
[26] Fan, Y. and Lv, J. (2013). Asymptotic equivalence of regularization methods in thresholded parameter space. J. Amer. Statist. Assoc. 108 1044-1061. · Zbl 06224986 · doi:10.1080/01621459.2013.803972
[27] Frank, I. and Friedman, J. (1993). A statistical view of some chemometrics regression tools (with discussion). Technometrics 35 109-148. · Zbl 0775.62288 · doi:10.2307/1269656
[28] Friedman, J. (2008). Fast sparse regression and classification. Technical report, Dept. Statistics, Stanford Univ., Stanford, CA.
[29] Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Stat. 1 302-332. · Zbl 1378.90064 · doi:10.1214/07-AOAS131
[30] Furnival, G. and Wilson, R. (1974). Regression by leaps and bounds. Technometrics 16 499-511. · Zbl 0285.05110 · doi:10.1016/0095-8956(74)90098-7
[31] Greenshtein, E. (2006). Best subset selection, persistence in high-dimensional statistical learning and optimization under \(l_{1}\) constraint. Ann. Statist. 34 2367-2386. · Zbl 1106.62022 · doi:10.1214/009053606000000768
[32] Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10 971-988. · Zbl 1055.62078 · doi:10.3150/bj/1106314846
[33] Gurobi, I. (2013). Optimization. Gurobi optimizer reference manual. Available at .
[34] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning : Data Mining , Inference , and Prediction , 2nd ed. Springer Series in Statistics . Springer, New York. · Zbl 1273.62005 · doi:10.1007/978-0-387-84858-7
[35] Knight, K. and Fu, W. (2000). Asymptotics for lasso-type estimators. Ann. Statist. 28 1356-1378. · Zbl 1105.62357 · doi:10.1214/aos/1015957397
[36] Loh, P.-L. and Wainwright, M. (2013). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In Advances in Neural Information Processing Systems 476-484. Curran Associates, Red Hook, NY.
[37] Lv, J. and Fan, Y. (2009). A unified approach to model selection and sparse recovery using regularized least squares. Ann. Statist. 37 3498-3528. · Zbl 1369.62156 · doi:10.1214/09-AOS683
[38] Mazumder, R., Friedman, J. H. and Hastie, T. (2011). SparseNet : Coordinate descent with nonconvex penalties. J. Amer. Statist. Assoc. 106 1125-1138. · Zbl 1229.62091 · doi:10.1198/jasa.2011.tm09738
[39] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436-1462. · Zbl 1113.62082 · doi:10.1214/009053606000000281
[40] Miller, A. (2002). Subset Selection in Regression , 2nd ed. Monographs on Statistics and Applied Probability 95 . Chapman & Hall/CRC, Boca Raton, FL. · Zbl 1051.62060
[41] Natarajan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM J. Comput. 24 227-234. · Zbl 0827.68054 · doi:10.1137/S0097539792240406
[42] Nemhauser, G. (2013). Integer programming: The global impact. 2013-12-2013-04, Rome, Italy. Presented at EURO, INFORMS, Accessed. Available at .
[43] Nesterov, Y. (2004). Introductory Lectures on Convex Optimization : A Basic Course. Applied Optimization 87 . Kluwer Academic, Boston, MA. · Zbl 1086.90045
[44] Nesterov, Yu. (2005). Smooth minimization of non-smooth functions. Math. Program. 103 127-152. · Zbl 1079.90102 · doi:10.1007/s10107-004-0552-5
[45] Nesterov, Yu. (2013). Gradient methods for minimizing composite functions. Math. Program. 140 125-161. · Zbl 1287.90067 · doi:10.1007/s10107-012-0629-5
[46] Optimization Inc. (2015). Gurobi 6.0 performance benchmarks. Available at . Accessed 5 September 2015.
[47] Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over \(\ell_{q}\)-balls. IEEE Trans. Inform. Theory 57 6976-6994. · Zbl 1365.62276 · doi:10.1109/TIT.2011.2165799
[48] Shen, X., Pan, W., Zhu, Y. and Zhou, H. (2013). On constrained and regularized high-dimensional regression. Ann. Inst. Statist. Math. 65 807-832. · Zbl 1329.62307 · doi:10.1007/s10463-012-0396-3
[49] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. · Zbl 0850.62538
[50] Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: A retrospective. J. R. Stat. Soc. Ser. B Stat. Methodol. 73 273-282. · doi:10.1111/j.1467-9868.2011.00771.x
[51] Top500 Supercomputer Sites (2015). Directory page for Top500 lists. In Result for each list since June 1993. Accessed: 09-15-2015. Available at .
[52] Tropp, J. A. (2006). Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Inform. Theory 52 1030-1051. · Zbl 1288.94025 · doi:10.1109/TIT.2008.2009806
[53] van de Geer, S., Bühlmann, P. and Zhou, S. (2011). The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso). Electron. J. Stat. 5 688-749. · Zbl 1274.62471 · doi:10.1214/11-EJS624
[54] Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell_{1}\)-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183-2202. · Zbl 1367.62220 · doi:10.1109/TIT.2009.2016018
[55] Zhang, C.-H. (2010a). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894-942. · Zbl 1183.62120 · doi:10.1214/09-AOS729
[56] Zhang, T. (2010b). Analysis of multi-stage convex relaxation for sparse regularization. J. Mach. Learn. Res. 11 1081-1107. · Zbl 1242.68262
[57] Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567-1594. · Zbl 1142.62044 · doi:10.1214/07-AOS520
[58] Zhang, Y., Wainwright, M. and Jordan, M. I. (2014). Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. Preprint. Available at . arXiv:1402.1918
[59] Zhang, C.-H. and Zhang, T. (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statist. Sci. 27 576-593. · Zbl 1331.62353 · doi:10.1214/12-STS399
[60] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541-2563. · Zbl 1222.62008
[61] Zheng, Z., Fan, Y. and Lv, J. (2014). High dimensional thresholded regression and shrinkage effect. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 627-649. · doi:10.1111/rssb.12037
[62] Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418-1429. · Zbl 1171.62326 · doi:10.1198/016214506000000735
[63] Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist. 36 1509-1533. · Zbl 1142.62027 · doi:10.1214/009053607000000802
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.