×

Overfitting, generalization, and MSE in class probability estimation with high-dimensional data. (English) Zbl 1441.62399

Summary: Accurate class probability estimation is important for medical decision making but is challenging, particularly when the number of candidate features exceeds the number of cases. Special methods have been developed for nonprobabilistic classification, but relatively little attention has been given to class probability estimation with numerous candidate variables. In this paper, we investigate overfitting in the development of regularized class probability estimators. We investigate the relation between overfitting and accurate class probability estimation in terms of mean square error. Using simulation studies based on real datasets, we found that some degree of overfitting can be desirable for reducing mean square error. We also introduce a mean square error decomposition for class probability estimation that helps clarify the relationship between overfitting and prediction accuracy.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

R; glmnet; boost
Full Text: DOI

References:

[1] Breiman, L. (2001). Random forests. Machine Learning45, 5-32. · Zbl 1007.68152
[2] Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review78, 1-3.
[3] Bühlmann, P. and van de Geer, S. (2011). Statistics for High‐Dimensional Data: Methods, Theory and Applications. Springer series in statistics, Springer, New York, NY. · Zbl 1273.62015
[4] Cox, T. F. and Cox, M. A. A. (1994). Multidimensional Scaling, Monographs on Statistics and Applied Probability. Chapman & Hall, London, UK. · Zbl 0853.62047
[5] Dettling, M. (2004). BagBoosting for tumor classification with gene expression data. Bioinformatics20, 3583-3593.
[6] Efron, B. (2004). The estimation of prediction error: covariance penalties and cross‐validation. Journal of the American Statistical Association99, 619-642. · Zbl 1117.62324
[7] Efron, B. (2010). Large‐scale Inference Institute of Mathematical Statistics Monographs. Empirical Bayes methods for estimation, testing, and prediction, Cambridge University Press, Cambridge, UK. · Zbl 1277.62016
[8] Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software33, 1-22.
[9] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning (2nd edn.). Data mining, inference, and prediction, Springer Series in Statistics, Springer, New York, NY. · Zbl 1273.62005
[10] Jolliffe, I. T. and Stephenson, D. B. (2003). Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley, Chichester, UK.
[11] Liaw, A. and Wiener, M. (2002). Classification and regression by random forest. R News2, 18-22.
[12] Molinaro, A. M., Simon, R. and Pfeiffer, R. M. (2005). Prediction error estimation: a comparison of resampling methods. Bioinformatics21, 3301-3307.
[13] Poggio, T., Rifkin, R., Mukherjee, S. and Niyogi, P. (2004). General conditions for predictivity in learning theory. Nature428, 419-422.
[14] Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America99, 6567-6572.
[15] Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. Fourth Edition. Springer, New York, NY. ISBN 0‐387‐95457‐0. · Zbl 1006.62003
[16] Wilks, D. S. (2011). Statistical Methods in the Atmospheric Sciences. Academic Press, Oxford, UK.
[17] Wright, G., Tan, B., Rosenwald, A., Hurt, E. H., Wiestner, A. and Staudt, L. M. (2003). A gene expression‐based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proceedings of the National Academy of Sciences of the United States of America100, 9991-9996.
[18] Zhu, J. and Hastie, T. (2004). Classification of gene microarrays by penalized logistic regression. Biostatistics5, 427-443. · Zbl 1154.62406
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.