×

Least squares after model selection in high-dimensional sparse models. (English) Zbl 1456.62066

Summary: In this article we study post-model selection estimators that apply ordinary least squares (OLS) to the model selected by first-step penalized estimators, typically Lasso. It is well known that Lasso can estimate the nonparametric regression function at nearly the oracle rate, and is thus hard to improve upon. We show that the OLS post-Lasso estimator performs at least as well as Lasso in terms of the rate of convergence, and has the advantage of a smaller bias. Remarkably, this performance occurs even if the Lasso-based model selection “fails” in the sense of missing some components of the “true” regression model. By the “true” model, we mean the best \(s\)-dimensional approximation to the nonparametric regression function chosen by the oracle. Furthermore, OLS post-Lasso estimator can perform strictly better than Lasso, in the sense of a strictly faster rate of convergence, if the Lasso-based model selection correctly includes all components of the “true” model as a subset and also achieves sufficient sparsity. In the extreme case, when Lasso perfectly selects the “true” model, the OLS post-Lasso estimator becomes the oracle estimator. An important ingredient in our analysis is a new sparsity bound on the dimension of the model selected by Lasso, which guarantees that this dimension is at most of the same order as the dimension of the “true” model. Our rate results are nonasymptotic and hold in both parametric and nonparametric models. Moreover, our analysis is not limited to the Lasso estimator acting as a selector in the first step, but also applies to any other estimator, for example, various forms of thresholded Lasso, with good rates and good sparsity properties. Our analysis covers both traditional thresholding and a new practical, data-driven thresholding scheme that induces additional sparsity subject to maintaining a certain goodness of fit. The latter scheme has theoretical guarantees similar to those of Lasso or OLS post-Lasso, but it dominates those procedures as well as traditional thresholding in a wide variety of experiments.

MSC:

62G08 Nonparametric regression and quantile regression
62G20 Asymptotic properties of nonparametric inference
62J07 Ridge regression; shrinkage estimators (Lasso)

References:

[1] Belloni, A. and Chernozhukov, V. (2011). Supplement to “\(\ell_{1}\)-penalized quantile regression in high-dimensional sparse models.” . · Zbl 1209.62064 · doi:10.1214/10-AOS827
[2] Belloni, A. and Chernozhukov, V. (2012). Supplement to “Least squares after model selection in high-dimensional sparse models.” . · Zbl 1456.62066
[3] Belloni, A. and Chernozhukov, V. (2011). \(\ell_{1}\)-penalized quantile regression in high-dimensional sparse models. Ann. Statist. 39 82-130. · Zbl 1209.62064 · doi:10.1214/10-AOS827
[4] Bickel, P.J., Ritov, Y. and Tsybakov, A.B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705-1732. · Zbl 1173.62022 · doi:10.1214/08-AOS620
[5] Bunea, F. (2008). Consistent selection via the Lasso for high-dimensional approximating models. In IMS Lecture Notes Monograph Series 123 123-137.
[6] Bunea, F., Tsybakov, A.B. and Wegkamp, M.H. (2006). Aggregation and sparsity via \(l_{1}\) penalized least squares. In Learning Theory. Lecture Notes in Computer Science 4005 379-391. Berlin: Springer. · Zbl 1143.62319 · doi:10.1007/11776420_29
[7] Bunea, F., Tsybakov, A.B. and Wegkamp, M. (2007). Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 1 169-194. · Zbl 1146.62028 · doi:10.1214/07-EJS008
[8] Bunea, F., Tsybakov, A.B. and Wegkamp, M.H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674-1697. · Zbl 1209.62065 · doi:10.1214/009053606000001587
[9] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when \(p\) is much larger than \(n\). Ann. Statist. 35 2313-2351. · Zbl 1139.62019 · doi:10.1214/009053606000001523
[10] Efromovich, S. (1999). Nonparametric Curve Estimation : Methods , Theory , and Applications. Springer Series in Statistics . New York: Springer. · Zbl 0935.62039
[11] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849-911. · doi:10.1111/j.1467-9868.2008.00674.x
[12] Koltchinskii, V. (2009). Sparsity in penalized empirical risk minimization. Ann. Inst. Henri Poincaré Probab. Stat. 45 7-57. · Zbl 1168.62044 · doi:10.1214/07-AIHP146
[13] Lounici, K. (2008). Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electron. J. Stat. 2 90-102. · Zbl 1306.62155 · doi:10.1214/08-EJS177
[14] Lounici, K., Pontil, M., Tsybakov, A.B. and van de Geer, S. (2009). Taking advantage of sparsity in multi-task learning. In Proceedings of the 22 nd Annual Conference on Learning Theory ( COLT 2009) 73-82. Omnipress.
[15] Lounici, K., Pontil, M., Tsybakov, A.B. and van de Geer, S. (2012). Oracle inequalities and optimal inference under group sparsity. Ann. Statist. · Zbl 1306.62156 · doi:10.1214/11-AOS896
[16] Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 37 246-270. · Zbl 1155.62050 · doi:10.1214/07-AOS582
[17] Rosenbaum, M. and Tsybakov, A.B. (2010). Sparse recovery under matrix uncertainty. Ann. Statist. 38 2620-2651. · Zbl 1373.62357 · doi:10.1214/10-AOS793
[18] Rudelson, M. and Vershynin, R. (2008). On sparse reconstruction from Fourier and Gaussian measurements. Comm. Pure Appl. Math. 61 1025-1045. · Zbl 1149.94010 · doi:10.1002/cpa.20227
[19] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. · Zbl 0850.62538
[20] Tsybakov, A.B. (2008). Introduction to Nonparametric Estimation . Berlin: Springer. · Zbl 1176.62032
[21] van de Geer, S.A. (2000). Empirical Processes in M-Estimation . Cambridge: Cambridge Univ. Press. · Zbl 0953.62049
[22] van de Geer, S.A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36 614-645. · Zbl 1138.62323 · doi:10.1214/009053607000000929
[23] van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Processes. Springer Series in Statistics . New York: Springer. · Zbl 0862.60002
[24] Wainwright, M.J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell_{1}\)-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183-2202. · Zbl 1367.62220 · doi:10.1109/TIT.2009.2016018
[25] Wasserman, L. (2006). All of Nonparametric Statistics. Springer Texts in Statistics . New York: Springer. · Zbl 1099.62029 · doi:10.1007/0-387-30623-4
[26] Zhang, C.H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567-1594. · Zbl 1142.62044 · doi:10.1214/07-AOS520
[27] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541-2563. · Zbl 1222.62008
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.