Should significance testing be abandoned in machine learning?

830 Accesses
6 Citations
3 Altmetric
Explore all metrics

Abstract

Significance testing has become a mainstay in machine learning, with the p value being firmly embedded in the current research practice. Significance tests are widely believed to lend scientific rigor to the interpretation of empirical findings; however, their problems have received only scant attention in the machine learning literature so far. Here, we investigate one particular problem, the Jeffreys–Lindley paradox. This paradox describes a statistical conundrum: the p value can be close to zero, convincing us that there is overwhelming evidence against the null hypothesis. At the same time, however, the posterior probability of the null hypothesis being true can be close to 1, convincing us of the exact opposite. In experiments with synthetic data sets and a subsequent thought experiment, we demonstrate that this paradox can have severe repercussions for the comparison of multiple classifiers over multiple benchmark data sets. Our main result suggests that significance tests should not be used in such comparative studies. We caution that the reliance on significance tests might lead to a situation that is similar to the reproducibility crisis in other fields of science. We offer for debate four avenues that might alleviate the looming crisis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using p-values for the comparison of classifiers: pitfalls and alternatives

Article 11 April 2022

Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers

Article 30 December 2016

rNPBST: An R Package Covering Non-parametric and Bayesian Statistical Tests

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Baker, M.: Is there a reproducibility crisis? Nature 533, 452–454 (2016)
Article Google Scholar
Bartlett, M.: A comment on D.V. Lindley’s statistical paradox. Biometrika 44, 533–534 (1957)
Article MATH Google Scholar
Bayarri, M., Berger, J.: $P$ values for composite null models. J. Am. Stat. Assoc. 95(452), 1127–1142 (2000)
MathSciNet MATH Google Scholar
Begley, C., Ioannidis, J.: Reproducibility in science: improving the standard for basic and preclinical research. Circ. Res. 116(1), 116–126 (2015)
Article Google Scholar
Benavoli, A., Corani, G., Demšar, J., Zaffalon, M.: Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. J. Mach. Learn. Res. 18(77), 1–36 (2017)
MathSciNet MATH Google Scholar
Benavoli, A., Corani, G., Mangili, F.: Should we really use post-hoc tests based on mean-ranks? J. Mach. Learn. Res. 17(5), 1–10 (2016)
MathSciNet MATH Google Scholar
Berger, J., Berry, D.: Statistical analysis and the illusion of objectivity. Am. Sci. 76, 159–165 (1988)
Google Scholar
Berger, J., Delampady, M.: Testing precise hypotheses. Stat. Sci. 2(3), 317–352 (1987)
Article MathSciNet MATH Google Scholar
Berrar, D.: Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers. Mach. Learn. 106(6), 911–949 (2017)
Article MathSciNet MATH Google Scholar
Berrar, D., Dubitzky, W.: Jeffreys–Lindley Paradox in Machine Learning (2017). http://doi.org/10.17605/OSF.IO/SNXWJ. Accessed 23 July 2018
Berrar, D., Dubitzky, W.: On the Jeffreys–Lindley paradox and the looming reproducibility crisis in machine learning. In: Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics, pp. 334–340 (2017)
Berrar, D., Lopes, P., Dubitzky, W.: Caveats and pitfalls in crowdsourcing research: the case of soccer referee bias. Int. J. Data Sci. Anal. 4(2), 143–151 (2017)
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Cohen, J.: The earth is round ($p <$.05). Am. Psychol. 49(12), 997–1003 (1994)
Article Google Scholar
Cousins, R.D.: The Jeffreys–Lindley paradox and discovery criteria in high energy physics. Synthese 194(2), 395–432 (2017)
Article MathSciNet MATH Google Scholar
Cox, D., Hinkley, D.: Theoretical Statistics. Chapman and Hall/CR, London (1974)
Book MATH Google Scholar
Cummings, G.: Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Routledge, New York (2012)
Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Fisher, R.: Statistical methods and scientific induction. J. R. Stat. Soc. Ser. B 17(1), 69–78 (1955)
MathSciNet MATH Google Scholar
Foster, E., Deardorff, A.: Open Science Framework (OSF). J. Med. Libr. Assoc. JMLA 105(2), 203–206 (2017). https://doi.org/10.5195/jmla.2017.88. Accessed 23 July 2018
Gelman, A., Loken, E.: The garden of forking paths: why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time (2013). http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf. Accessed 23 July 2018
Gigerenzer, G.: Mindless statistics. J. Socio-Econ. 33, 587–606 (2004)
Article Google Scholar
Goodman, S.: Toward evidence-based medical statistics. 1: the $P$ value fallacy. Ann. Intern. Med. 130(12), 995–1004 (1999)
Article Google Scholar
Goodman, S.: A dirty dozen: twelve $P$-value misconceptions. Semin. Hematol. 45(3), 135–140 (2008)
Article Google Scholar
Goodman, S., Royall, R.: Evidence and scientific research. Am. J. Public Health 78(12), 1568–1574 (1988)
Article Google Scholar
Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.B., Poole, C., Goodman, S.N., Altman, D.G.: Statistical tests, $p$ values, confidence intervals, and power: a guide to misinterpretations. Eur. J. Epidemiol. 31(4), 337–350 (2016)
Article Google Scholar
Hays, W.: Statistics for the Social Sciences. Holt, Rinehart & Winston, New York (1973)
Google Scholar
Hubbard, R.: Alphabet soup—blurring the distinctions between $p$’s and $\alpha $’s in psychological research. Theory Psychol. 14(3), 295–327 (2004)
Article MathSciNet Google Scholar
Hubbard, R., Armstrong, J.: Why we don’t really know what “statistical significance” means: a major educational failure. J. Mark. Edu. 28(2), 114–120 (2006)
Article Google Scholar
Hubbard, R., Lindsay, R.: Why $p$ values are not a useful measure of evidence in statistical significance testing. Theory Psychol. 18(1), 69–88 (2008)
Article Google Scholar
Ioannidis, J.: Why most published research findings are false. PLoS Med. 2(8), e124 (2005)
Article Google Scholar
Jeffreys, H.: Theory of Probability, 3rd edn. Clarendon Press, Oxford (1961). (Reprinted 2003)
MATH Google Scholar
Leek, J., McShane, B., Gelman, A., Colquhoun, D., Nuijten, M., Goodman, S.: Five ways to fix statistics. Nature 551, 557–559 (2017)
Article Google Scholar
Levin, J.: What if there were no more bickering about statistical significance tests? Res. Sch. 5(2), 43–53 (1998)
Google Scholar
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002). http://CRAN.R-project.org/doc/Rnews/. Accessed 23 July 2018
Lindley, D.: A statistical paradox. Biometrika 44, 187–192 (1957)
Article MATH Google Scholar
Lu, M., Ishwaran, H.: A prediction-based alternative to $P$ values in regression models. J. Thoracic Cardiovasc. Surg. 155(3), 1130–1136.e4 (2018)
Article Google Scholar
Matthews, R., Wasserstein, R., Spiegelhalter, D.: The ASA’s $p$-value statement, one year on. Significance 14(2), 38–41 (2017)
Article Google Scholar
McShane, B.B., Gal, D., Gelman, A., Robert, C., Tackett, J.L.: Abandon Statistical Significance (2017). ArXiv e-prints 1709.07588
Nuzzo, R.: Statistical errors. Nature 506, 150–152 (2014)
Article Google Scholar
Poole, C.: Beyond the confidence interval. Am. J. Public Health 2(77), 195–199 (1987)
Article Google Scholar
R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2017). https://www.R-project.org/. Accessed 23 July 2018
Rosenthal, R.: The file drawer problem and tolerance for null results. Psychol. Bull. 86(3), 638–641 (1979)
Article Google Scholar
Rothman, K.: Writing for epidemiology. Epidemiology 9(3), 333–337 (1998)
Article MathSciNet Google Scholar
Rothman, K., Greenland, S., Lash, T.: Modern Epidemiology, 3rd edn. Wolters Kluwer, Alphen aan den Rijn (2008)
Google Scholar
Savalei, V., Dunn, E.: Is the call to abandon $p$-values the red herring of the replicability crisis? Front. Psychol. Artic. 6, 1–4, Article 245 (2015)
Schervish, M.: $P$ values: what they are and what they are not. Am. Stat. 50(3), 203–206 (1996)
MathSciNet Google Scholar
Schmidt, F.: Statistical significance testing and cumulative knowledge in psychology: implications for training of researchers. Psychol. Methods 1(2), 115–129 (1996)
Article Google Scholar
Schmidt, F., Hunter, J.: Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In: Harlow, L., Mulaik, S., Steiger, J. (eds.) What If There were No Significance Tests?, pp. 37–64. Psychology Press, Hove (1997)
Sellke, T., Bayarri, M., Berger, J.: Calibration of $p$ values for testing precise null hypotheses. Am. Stat. 55(1), 62–71 (2001)
Article MathSciNet MATH Google Scholar
Senn, S.: Two cheers for $p$-values? J. Epidemiol. Biostat. 6, 193–204 (2001)
Simmons, J., Nelson, L., Simonsohn, U.: False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22(11), 1359–1366 (2011)
Article Google Scholar
Trafimow, D., Marks, M.: Editorial. Basic Appl. Soc. Psychol. 37, 1–2 (2015)
Article Google Scholar
Wasserstein, R., Lazar, N.: The ASA’s statement on $p$-values: context, process, and purpose (editorial). Am. Stat. 70(2), 129–133 (2016)
Article MathSciNet Google Scholar
Webb, G.I., Boughton, J.R., Zheng, F., Ting, K.M., Salem, H.: Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification. Mach. Learn. 86(2), 233–272 (2012)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Data Science Laboratory, Department of Information and Communications Engineering, Tokyo Institute of Technology, Tokyo, Japan
Daniel Berrar
Research Unit Scientific Computing, German Research Center for Environmental Health, Helmholtz Zentrum München, Munich, Germany
Werner Dubitzky

Authors

Daniel Berrar
View author publications
You can also search for this author in PubMed Google Scholar
Werner Dubitzky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Berrar.

Additional information

This paper is an extended version of the DSAA2017 Research Track paper titled “On the Jeffreys–Lindley paradox and the looming reproducibility crisis in machine learning” [11].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Berrar, D., Dubitzky, W. Should significance testing be abandoned in machine learning?. Int J Data Sci Anal 7, 247–257 (2019). https://doi.org/10.1007/s41060-018-0148-4

Download citation

Received: 12 April 2018
Accepted: 26 July 2018
Published: 03 August 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s41060-018-0148-4

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Using p-values for the comparison of classifiers: pitfalls and alternatives

Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers

rNPBST: An R Package Covering Non-parametric and Bayesian Statistical Tests

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Should significance testing be abandoned in machine learning?

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Using p-values for the comparison of classifiers: pitfalls and alternatives

Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers

rNPBST: An R Package Covering Non-parametric and Bayesian Statistical Tests

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation