Does data splitting improve prediction?

Julian J. Faraway¹

1746 Accesses
18 Citations
1 Altmetric
Explore all metrics

Abstract

Data splitting divides data into two parts. One part is reserved for model selection. In some applications, the second part is used for model validation but we use this part for estimating the parameters of the chosen model. We focus on the problem of constructing reliable predictive distributions for future observed values. We judge the predictive performance using log scoring. We compare the full data strategy with the data splitting strategy for prediction. We show how the full data score can be decomposed into model selection, parameter estimation and data reuse costs. Data splitting is preferred when data reuse costs are high. We investigate the relative performance of the strategies in four simulation scenarios. We introduce a hybrid estimator that uses one part for model selection but both parts for estimation. We argue that a split data analysis is prefered to a full data analysis for prediction with some exceptions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Altman, D.G., Royston, P.: What do we mean by validating a prognostic model? Stat. Med. 19(4), 453–473 (2000)
Article Google Scholar
Bell, R., Koren, Y.: Lessons from the Netflix prize challenge. ACM SIGKDD Explor. Newsl. 9(2), 75–79 (2007)
Article Google Scholar
Belloni, A., Chernozhukov, V.: Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2), 521–547 (2013)
Article MATH MathSciNet Google Scholar
Berk, R., Brown, L., Zhao, L.: Statistical inference after model selection. J. Quant. Criminol. 26(2), 217–236 (2009)
Article Google Scholar
Carpenter, J.: May the best analyst win. Science 331(6018), 698–699 (2011)
Article Google Scholar
Chatfield, C.: Model uncertainty, data mining and statistical inference. J. R. Statist. Soc. Ser. A 158(3), 419–466 (1995)
Article Google Scholar
Cox, D.: A note on data-splitting for the evaluation of significance levels. Biometrika 62, 441–444 (1975)
Article MATH MathSciNet Google Scholar
Dahl, F., Grotle, M., Saltyte Benth, J., Natvig, B.: Data splitting as a countermeasure against hypothesis fishing: with a case study of predictors for low back pain. Eur. J. Epidemiol. 23(4), 237–242 (2008)
Article Google Scholar
Dawid, A.: Present position and potential developments: some personal views statistical theory the prequential approach. J. R. Stat. Soc. Ser. A 147, 278–292 (1984)
Article MATH MathSciNet Google Scholar
Draper, D.: Assessment and propogation of model uncertainty. J. R. Stat. Soc. Ser. B 57, 45–97 (1995)
MATH MathSciNet Google Scholar
Faraway, J.: On the cost of data analysis. J. Comput. Gr. Stat. 1, 215–231 (1992)
Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Elements Statistical Learning, 2nd edn. Springer, New York (2008)
Google Scholar
Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 102(477), 359–378 (2007)
Article MATH MathSciNet Google Scholar
Good, I.J.: Rational decisions. J. R. Stat. Soc. Ser. B 14(1), 107–114 (1952)
MathSciNet Google Scholar
Heller, R., Rosenbaum, P.R., Small, D.S.: Split samples and design sensitivity in observational studies. J. Am. Stat. Assoc. 104(487), 1090–1101 (2009)
Article MathSciNet Google Scholar
Hinkley, D., Runger, G.: The analysis of transformed data (with discussion). J. Am. Stat. Assoc. 79, 302–319 (1984)
Article MATH MathSciNet Google Scholar
Hirsch, R.: Validation samples. Biometrics 47(3), 1193–1194 (1991)
Google Scholar
Lawless, J.F., Fredette, M.: Frequentist prediction intervals and predictive distributions. Biometrika 92(3), 529–542 (2005)
Article MATH MathSciNet Google Scholar
Leeb, H., Pötscher, B.M.: Model selection and inference: facts and fiction. Econom. Theory 21(01), 21–59 (2005)
MATH Google Scholar
Little, R.: Calibrated bayes. Am. Stat. 60(3), 213–223 (2006)
Article MathSciNet Google Scholar
Meng, X., Xie, X.: I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb? Econom. Rev. 33, 1–33 (2013)
MathSciNet Google Scholar
Miller, A.: Subset Selection in Regression. CRC Press, Boca Raton (1990)
Book MATH Google Scholar
Molinaro, A.M., Simon, R., Pfeiffer, R.M.: Prediction error estimation: a comparison of resampling methods. Bioinformatics 21(15), 3301–3307 (2005)
Mosteller, F., Tukey, J.: Data Analysis and Regression. A Second Course in Statistics. Addison-Wesley, Reading (1977)
Google Scholar
Parry, M., Dawid, A.P., Lauritzen, S.: Proper local scoring rules. Ann. Stat. 40(1), 561–592 (2012)
Picard, R., Berk, K.: Data splitting. Am. Stat. 44, 140–147 (1990)
Google Scholar
Picard, R., Cook, R.: Cross-validation of regression models. J. Am. Stat. Assoc. 79, 575–583 (1984)
Article MATH MathSciNet Google Scholar
Pötscher, B.: Effects of model selection on inference. Econom. Theory 7(2), 163–185 (1991)
Article Google Scholar
Roecker, E.: Prediction error and its estimation for subset-selected models. Technometrics 33, 459–468 (1991)
Article Google Scholar
Schumacher, M., Binder, H., Gerds, T.: Assessment of survival prediction models based on microarray data. Bioinformatics 23(14), 1768–1774 (2007)
Article Google Scholar
Steyerberg, E.: Clinical Prediction Models. Springer, New York (2009)
Book MATH Google Scholar
Stone, M.: Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B 36, 111–147 (1974)
MATH Google Scholar
Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002)
Book MATH Google Scholar
Wit, E., Heuvel, E.V.D., Romeijn, J.W.: All models are wrong..: an introduction to model uncertainty. Stat. Neerl. 66(3), 217–236 (2012)
Article MathSciNet Google Scholar
Xie, M.G., Singh, K.: Confidence distribution, the frequentist distribution estimator of a parameter — a review. Int. Stat. Rev. 81, 3–39 (2013)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematical Sciences, University of Bath, Bath, BA2 7AY, UK
Julian J. Faraway

Authors

Julian J. Faraway
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julian J. Faraway.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Faraway, J.J. Does data splitting improve prediction?. Stat Comput 26, 49–60 (2016). https://doi.org/10.1007/s11222-014-9522-9

Download citation

Received: 05 April 2013
Accepted: 04 October 2014
Published: 29 October 2014
Issue Date: January 2016
DOI: https://doi.org/10.1007/s11222-014-9522-9

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Conditional vs marginal estimation of the predictive loss of hierarchical models using WAIC and cross-validation

Using reference models in variable selection

Statistical estimation in the presence of possibly incorrect model assumptions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Does data splitting improve prediction?

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Conditional vs marginal estimation of the predictive loss of hierarchical models using WAIC and cross-validation

Using reference models in variable selection

Statistical estimation in the presence of possibly incorrect model assumptions

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation