×

Best subset binary prediction. (English) Zbl 1398.62362

Summary: We consider a variable selection problem for the prediction of binary outcomes. We study the best subset selection procedure by which the covariates are chosen by maximizing C. F. Manski’s [J. Econom. 3, 205–228 (1975; Zbl 0307.62068); ibid. 27, 313–333 (1985; Zbl 0567.62096)] maximum score objective function subject to a constraint on the maximal number of selected variables. We show that this procedure can be equivalently reformulated as solving a mixed integer optimization problem, which enables computation of the exact or an approximate solution with a definite approximation error bound. In terms of theoretical results, we obtain non-asymptotic upper and lower risk bounds when the dimension of potential covariates is possibly much larger than the sample size. Our upper and lower risk bounds are minimax rate-optimal when the maximal number of selected variables is fixed and does not increase with the sample size. We illustrate usefulness of the best subset binary prediction approach via Monte Carlo simulations and an empirical application of the work-trip transportation mode choice.

MSC:

62P20 Applications of statistics to economics
62G05 Nonparametric estimation
62G20 Asymptotic properties of nonparametric inference

References:

[1] Abrevaya, J., Rank estimation of a generalized fixed-effects regression model, J. Econometrics, 95, 1, 1-23, (2000) · Zbl 0970.62045
[2] Abrevaya, J.; Huang, J., On the bootstrap of the maximum score estimator, Econometrica, 73, 4, 1175-1204, (2005) · Zbl 1152.62337
[3] Benoit, D. F.; Van den Poel, D., Binary quantile regression: a Bayesian approach based on the asymmetric Laplace distribution, J. Appl. Econometrics, 27, 7, 1174-1188, (2012)
[4] Bertsimas, D.; King, A.; Mazumder, R., Best subset selection via a modern optimization Lens, Ann. Statist., 44, 2, 813-852, (2016) · Zbl 1335.62115
[5] Blevins, J. R., Non-standard rates of convergence of criterion-function-based set estimators, Econom. J., 18, 172-199, (2015) · Zbl 1521.62037
[6] Blevins, J. R.; Khan, S., Distribution-free estimation of heteroskedastic binary response models in stata, Stata Journal, 13, 588-602, (2013)
[7] Blevins, J. R.; Khan, S., Local NLLS estimation of semiparametric binary choice models, Econom. J., 16, 135-160, (2013) · Zbl 1521.62105
[8] Charlier, E.; Melenberg, B.; van Soest, A. H.O., A smoothed maximum score estimator for the binary choice panel data model with an application to labour force participation, Stat. Neerl., 49, 3, 324-342, (1995) · Zbl 0845.62029
[9] Chen, X., Large sample sieve estimation of semi-nonparametric models, (Handbook of Econometrics, vol. 6, (2007), Elsevier), 5549-5632, (chapter 76)
[10] Chen, S., An integrated maximum score estimator for a generalized censored quantile regression model, J. Econometrics, 155, 1, 90-98, (2010) · Zbl 1431.62597
[11] Chen, L.-Y., Lee, S., 2015. Breaking the curse of dimensionality in conditional moment inequalities for discrete choice models. Cemmap Working Paper CWP26/15.; Chen, L.-Y., Lee, S., 2015. Breaking the curse of dimensionality in conditional moment inequalities for discrete choice models. Cemmap Working Paper CWP26/15.
[12] Chen, L.-Y.; Lee, S.; Sung, M. J., Maximum score estimation with nonparametrically generated regressors, Econom. J., 17, 3, 271-300, (2014) · Zbl 1521.62038
[13] Chen, S.; Zhang, H., Binary quantile regression with local polynomial smoothing, J. Econometrics, 189, 1, 24-40, (2015) · Zbl 1337.62352
[14] Danilov, D.; Magnus, J. R., On the harm that ignoring pretesting can cause, J. Econometrics, 122, 1, 27-46, (2004) · Zbl 1282.91257
[15] de Jong, R.; Woutersen, T., Dynamic time series binary choice, Econometric Theory, 1-30, (2011)
[16] Delgado, M. A.; Rodrıguez-Poo, J. M.; Wolf, M., Subsampling inference in cube root asymptotics with an application to manski’s maximum score estimator, Econom. Lett., 73, 2, 241-250, (2001) · Zbl 1056.91546
[17] Devroye, L.; Györfi, L.; Lugosi, G., Probabilistic theory of pattern recognition, (1996), Springer · Zbl 0853.68150
[18] Elliott, G.; Lieli, R., Predicting binary outcomes, J. Econometrics, 174, 1, 15-26, (2013) · Zbl 1277.62043
[19] Florios, K.; Skouras, S., Exact computation of MAX weighted score estimators, J. Econometrics, 146, 1, 86-91, (2008) · Zbl 1418.62450
[20] Fox, J. T., Semiparametric estimation of multinomial discrete-choice models using a subset of choices, Rand J. Econ., 1002-1019, (2007)
[21] Friedman, J.; Hastie, T.; Tibshirani, R., Regularization paths for generalized linear models via coordinate descent, J. Stat. Softw., 33, 1, 1-22, (2010)
[22] Graham, B.S., 2016. Homophily and transitivity in dynamic network formation. Cemmap Working Paper CWP16/16.; Graham, B.S., 2016. Homophily and transitivity in dynamic network formation. Cemmap Working Paper CWP16/16.
[23] Greenshtein, E., Bvol.est subset selection, persistence in high-dimensional statistical learning and optimization under \(L_1\) constraint, Ann. Statist., 34, 5, 2367-2386, (2006) · Zbl 1106.62022
[24] Greenshtein, E.; Ritov, Y., Persistence in high-dimensional linear predictor selection and the virtue of overparametrization, Bernoulli, 10, 6, 971-988, (2004) · Zbl 1055.62078
[25] Guerre, E.; Moon, H., A study of a semiparametric binary choice model with integrated covariates, Econometric Theory, 22, 4, 721-742, (2006) · Zbl 1125.62025
[26] Hastie, T.; Tibshirani, R.; Friedman, J., (The Elements of Statistical Learning: Prediction, Inference and Data Mining, Springer Series in Statistics New York, vol. 2, (2009)) · Zbl 1273.62005
[27] Horowitz, J. L., A smoothed maximum score estimator for the binary response model, Econometrica, 60, 3, 505-531, (1992) · Zbl 0761.62166
[28] Horowitz, J. L., Semiparametric estimation of a work-trip mode choice model, J. Econometrics, 58, 1, 49-70, (1993) · Zbl 0772.62066
[29] Horowitz, J. L., Semiparametric methods in econometrics, (1998), Springer · Zbl 0897.62128
[30] Horowitz, J. L., Bootstrap critical values for tests based on the smoothed maximum score estimator, J. Econometrics, 111, 2, 141-167, (2002) · Zbl 1020.62035
[31] Jiang, W.; Tanner, M. A., Risk minimization for time series binary choice with variable selection, Econometric Theory, 26, 5, 1437-1452, (2010) · Zbl 1197.62129
[32] Johnson, D.; Preparata, F., The densest hemisphere problem, Theoret. Comput. Sci., 6, 1, 93-107, (1978) · Zbl 0368.68053
[33] Jun, S. J.; Pinkse, J.; Wan, Y., Classical Laplace estimation for \(\sqrt[3]{n}\)-consistent estimators: improved convergence rates and rate-adaptive inference, J. Econometrics, 187, 1, 201-216, (2015) · Zbl 1337.62102
[34] Jun, S. J.; Pinkse, J.; Wan, Y., Integrated score estimation, Econometric Theory, 33, 6, 1418-1456, (2017) · Zbl 1396.62089
[35] Khan, S., Distribution free estimation of heteroskedastic binary response models using probit/logit criterion functions, J. Econometrics, 172, 1, 168-182, (2013) · Zbl 1443.62477
[36] Kim, J.; Pollard, D., Cube root asymptotics, Ann. Statist., 18, 1, 191-219, (1990) · Zbl 0703.62063
[37] Kitagawa, T.; Tetenov, A., Who should be treated? empirical welfare maximization methods for treatment choice, Econometrica, 86, 2, 591-616, (2018) · Zbl 1419.91280
[38] Komarova, T., Binary choice models with discrete regressors: identification and misspecification, J. Econometrics, 177, 1, 14-33, (2013) · Zbl 1285.62053
[39] Lee, S. M.S.; Pun, M. C., On m out of n bootstrapping for nonstandard m-estimation with nuisance parameters, J. Amer. Statist. Assoc., 101, 475, 1185-1197, (2006) · Zbl 1120.62310
[40] Lee, S.; Seo, M. H., Semiparametric estimation of a binary response model with a change-point due to a covariate threshold, J. Econometrics, 144, 2, 492-499, (2008) · Zbl 1418.62504
[41] Lee, S.; Seo, M. H.; Shin, Y., Testing for threshold effects in regression models, J. Amer. Statist. Assoc., 106, 493, 220-231, (2011) · Zbl 1396.62025
[42] Lieli, R. P.; Nieto-Barthaburu, A., Optimal binary prediction for group decision making, J. Bus. Econom. Statist., 28, 2, 308-319, (2010) · Zbl 1198.62125
[43] Lieli, R. P.; Springborn, M., Closing the gap between risk estimation and decision making: efficient management of trade-related invasive species risk, Rev. Econom. Stat., 95, 2, 632-645, (2013)
[44] Lieli, R. P.; White, H., The construction of empirical credit scoring rules based on maximization principles, J. Econometrics, 157, 1, 110-119, (2010) · Zbl 1431.62647
[45] Lugosi, G., Pattern classification and learning theory, (Györfi, L., Principles of Nonparametric Learning, (2002), Springer), 1-56
[46] Magnac, T.; Maurin, E., Partial identification in monotone binary models: discrete regressors and interval data, Rev. Econom. Stud., 75, 3, 835-864, (2008) · Zbl 1141.91642
[47] Magnus, J. R.; Durbin, J., Estimation of regression coefficients of interest when other regression coefficients are of no interest, Econometrica, 67, 3, 639-643, (1999) · Zbl 1056.62525
[48] Mammen, E.; Tsybakov, A. B., Smooth discrimination analysis, Ann. Statist., 27, 6, 1808-1829, (1999) · Zbl 0961.62058
[49] Manski, C. F., Maximum score estimation of the stochastic utility model of choice, J. Econometrics, 3, 3, 205-228, (1975) · Zbl 0307.62068
[50] Manski, C. F., Semiparametric analysis of discrete response. asymptotic properties of the maximum score estimator, J. Econometrics, 27, 3, 313-333, (1985) · Zbl 0567.62096
[51] Manski, C. F., Semiparametric analysis of random effects linear models from binary panel data, Econometrica, 55, 2, 357-362, (1987) · Zbl 0655.62106
[52] Manski, C. F., Identification of binary response models, J. Amer. Statist. Assoc., 83, 403, 729-738, (1988) · Zbl 0684.62049
[53] Manski, C. F.; Tamer, E., Inference on regressions with interval data on a regressor or outcome, Econometrica, 70, 2, 519-546, (2002) · Zbl 1121.62544
[54] Manski, C. F.; Thompson, T. S., Operational characteristics of maximum score estimation, J. Econometrics, 32, 1, 85-108, (1986)
[55] Manski, C. F.; Thompson, T. S., Estimation of best predictors of binary response, J. Econometrics, 40, 1, 97-123, (1989) · Zbl 0684.62050
[56] Massart, P.; Nédélec, E., Risk bounds for statistical learning, Ann. Statist., 34, 5, 2326-2366, (2006) · Zbl 1108.62007
[57] Matzkin, R. L., Nonparametric identification and estimation of polychotomous choice models, J. Econometrics, 58, 1-2, 137-168, (1993) · Zbl 0780.62030
[58] Moon, H., Maximum score estimation of a nonstationary binary choice model, J. Econometrics, 122, 2, 385-403, (2004) · Zbl 1328.62216
[59] Natarajan, B. K., Sparse approximate solutions to linear systems, SIAM J. Comput., 24, 2, 227-234, (1995) · Zbl 0827.68054
[60] Patra, R.K., Seijo, E., Sen, B., 2015. A consistent bootstrap procedure for the maximum score estimator. J. Econometrics, forthcoming, https://doi.org/10.1016/j.jeconom.2018.04.001; Patra, R.K., Seijo, E., Sen, B., 2015. A consistent bootstrap procedure for the maximum score estimator. J. Econometrics, forthcoming, https://doi.org/10.1016/j.jeconom.2018.04.001 · Zbl 1452.62254
[61] Pinkse, C., On the computation of semiparametric estimates in limited dependent variable models, J. Econometrics, 58, 1, 185-205, (1993) · Zbl 0775.62342
[62] Raskutti, G.; Wainwright, M. J.; Yu, B., Minimax rates of estimation for high-dimensional linear regression over lq-balls, IEEE Trans. Inform. Theory, 57, 10, 6976-6994, (2011) · Zbl 1365.62276
[63] Seo, M. H.; Otsu, T., Local m-estimation with discontinuous criterion for dependent and limited observations, Ann. Statist., 46, 1, 344-369, (2018) · Zbl 1394.62058
[64] Tsybakov, A. B., Optimal aggregation of classifiers in statistical learning, Ann. Statist., 32, 1, 135-166, (2004) · Zbl 1105.62353
[65] van de Geer, S. A.; Bühlmann, P., On the conditions used to prove oracle results for the lasso, Electron. J. Stat., 3, 1360-1392, (2009) · Zbl 1327.62425
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.