Abstract
Data splitting divides data into two parts. One part is reserved for model selection. In some applications, the second part is used for model validation but we use this part for estimating the parameters of the chosen model. We focus on the problem of constructing reliable predictive distributions for future observed values. We judge the predictive performance using log scoring. We compare the full data strategy with the data splitting strategy for prediction. We show how the full data score can be decomposed into model selection, parameter estimation and data reuse costs. Data splitting is preferred when data reuse costs are high. We investigate the relative performance of the strategies in four simulation scenarios. We introduce a hybrid estimator that uses one part for model selection but both parts for estimation. We argue that a split data analysis is prefered to a full data analysis for prediction with some exceptions.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Altman, D.G., Royston, P.: What do we mean by validating a prognostic model? Stat. Med. 19(4), 453–473 (2000)
Bell, R., Koren, Y.: Lessons from the Netflix prize challenge. ACM SIGKDD Explor. Newsl. 9(2), 75–79 (2007)
Belloni, A., Chernozhukov, V.: Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2), 521–547 (2013)
Berk, R., Brown, L., Zhao, L.: Statistical inference after model selection. J. Quant. Criminol. 26(2), 217–236 (2009)
Carpenter, J.: May the best analyst win. Science 331(6018), 698–699 (2011)
Chatfield, C.: Model uncertainty, data mining and statistical inference. J. R. Statist. Soc. Ser. A 158(3), 419–466 (1995)
Cox, D.: A note on data-splitting for the evaluation of significance levels. Biometrika 62, 441–444 (1975)
Dahl, F., Grotle, M., Saltyte Benth, J., Natvig, B.: Data splitting as a countermeasure against hypothesis fishing: with a case study of predictors for low back pain. Eur. J. Epidemiol. 23(4), 237–242 (2008)
Dawid, A.: Present position and potential developments: some personal views statistical theory the prequential approach. J. R. Stat. Soc. Ser. A 147, 278–292 (1984)
Draper, D.: Assessment and propogation of model uncertainty. J. R. Stat. Soc. Ser. B 57, 45–97 (1995)
Faraway, J.: On the cost of data analysis. J. Comput. Gr. Stat. 1, 215–231 (1992)
Friedman, J., Hastie, T., Tibshirani, R.: Elements Statistical Learning, 2nd edn. Springer, New York (2008)
Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 102(477), 359–378 (2007)
Good, I.J.: Rational decisions. J. R. Stat. Soc. Ser. B 14(1), 107–114 (1952)
Heller, R., Rosenbaum, P.R., Small, D.S.: Split samples and design sensitivity in observational studies. J. Am. Stat. Assoc. 104(487), 1090–1101 (2009)
Hinkley, D., Runger, G.: The analysis of transformed data (with discussion). J. Am. Stat. Assoc. 79, 302–319 (1984)
Hirsch, R.: Validation samples. Biometrics 47(3), 1193–1194 (1991)
Lawless, J.F., Fredette, M.: Frequentist prediction intervals and predictive distributions. Biometrika 92(3), 529–542 (2005)
Leeb, H., Pötscher, B.M.: Model selection and inference: facts and fiction. Econom. Theory 21(01), 21–59 (2005)
Little, R.: Calibrated bayes. Am. Stat. 60(3), 213–223 (2006)
Meng, X., Xie, X.: I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb? Econom. Rev. 33, 1–33 (2013)
Miller, A.: Subset Selection in Regression. CRC Press, Boca Raton (1990)
Molinaro, A.M., Simon, R., Pfeiffer, R.M.: Prediction error estimation: a comparison of resampling methods. Bioinformatics 21(15), 3301–3307 (2005)
Mosteller, F., Tukey, J.: Data Analysis and Regression. A Second Course in Statistics. Addison-Wesley, Reading (1977)
Parry, M., Dawid, A.P., Lauritzen, S.: Proper local scoring rules. Ann. Stat. 40(1), 561–592 (2012)
Picard, R., Berk, K.: Data splitting. Am. Stat. 44, 140–147 (1990)
Picard, R., Cook, R.: Cross-validation of regression models. J. Am. Stat. Assoc. 79, 575–583 (1984)
Pötscher, B.: Effects of model selection on inference. Econom. Theory 7(2), 163–185 (1991)
Roecker, E.: Prediction error and its estimation for subset-selected models. Technometrics 33, 459–468 (1991)
Schumacher, M., Binder, H., Gerds, T.: Assessment of survival prediction models based on microarray data. Bioinformatics 23(14), 1768–1774 (2007)
Steyerberg, E.: Clinical Prediction Models. Springer, New York (2009)
Stone, M.: Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B 36, 111–147 (1974)
Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002)
Wit, E., Heuvel, E.V.D., Romeijn, J.W.: All models are wrong..: an introduction to model uncertainty. Stat. Neerl. 66(3), 217–236 (2012)
Xie, M.G., Singh, K.: Confidence distribution, the frequentist distribution estimator of a parameter — a review. Int. Stat. Rev. 81, 3–39 (2013)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Faraway, J.J. Does data splitting improve prediction?. Stat Comput 26, 49–60 (2016). https://doi.org/10.1007/s11222-014-9522-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-014-9522-9