Abstract
Significance testing has become a mainstay in machine learning, with the p value being firmly embedded in the current research practice. Significance tests are widely believed to lend scientific rigor to the interpretation of empirical findings; however, their problems have received only scant attention in the machine learning literature so far. Here, we investigate one particular problem, the Jeffreys–Lindley paradox. This paradox describes a statistical conundrum: the p value can be close to zero, convincing us that there is overwhelming evidence against the null hypothesis. At the same time, however, the posterior probability of the null hypothesis being true can be close to 1, convincing us of the exact opposite. In experiments with synthetic data sets and a subsequent thought experiment, we demonstrate that this paradox can have severe repercussions for the comparison of multiple classifiers over multiple benchmark data sets. Our main result suggests that significance tests should not be used in such comparative studies. We caution that the reliance on significance tests might lead to a situation that is similar to the reproducibility crisis in other fields of science. We offer for debate four avenues that might alleviate the looming crisis.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Baker, M.: Is there a reproducibility crisis? Nature 533, 452–454 (2016)
Bartlett, M.: A comment on D.V. Lindley’s statistical paradox. Biometrika 44, 533–534 (1957)
Bayarri, M., Berger, J.: \(P\) values for composite null models. J. Am. Stat. Assoc. 95(452), 1127–1142 (2000)
Begley, C., Ioannidis, J.: Reproducibility in science: improving the standard for basic and preclinical research. Circ. Res. 116(1), 116–126 (2015)
Benavoli, A., Corani, G., Demšar, J., Zaffalon, M.: Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. J. Mach. Learn. Res. 18(77), 1–36 (2017)
Benavoli, A., Corani, G., Mangili, F.: Should we really use post-hoc tests based on mean-ranks? J. Mach. Learn. Res. 17(5), 1–10 (2016)
Berger, J., Berry, D.: Statistical analysis and the illusion of objectivity. Am. Sci. 76, 159–165 (1988)
Berger, J., Delampady, M.: Testing precise hypotheses. Stat. Sci. 2(3), 317–352 (1987)
Berrar, D.: Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers. Mach. Learn. 106(6), 911–949 (2017)
Berrar, D., Dubitzky, W.: Jeffreys–Lindley Paradox in Machine Learning (2017). http://doi.org/10.17605/OSF.IO/SNXWJ. Accessed 23 July 2018
Berrar, D., Dubitzky, W.: On the Jeffreys–Lindley paradox and the looming reproducibility crisis in machine learning. In: Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics, pp. 334–340 (2017)
Berrar, D., Lopes, P., Dubitzky, W.: Caveats and pitfalls in crowdsourcing research: the case of soccer referee bias. Int. J. Data Sci. Anal. 4(2), 143–151 (2017)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Cohen, J.: The earth is round (\(p <\).05). Am. Psychol. 49(12), 997–1003 (1994)
Cousins, R.D.: The Jeffreys–Lindley paradox and discovery criteria in high energy physics. Synthese 194(2), 395–432 (2017)
Cox, D., Hinkley, D.: Theoretical Statistics. Chapman and Hall/CR, London (1974)
Cummings, G.: Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Routledge, New York (2012)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Fisher, R.: Statistical methods and scientific induction. J. R. Stat. Soc. Ser. B 17(1), 69–78 (1955)
Foster, E., Deardorff, A.: Open Science Framework (OSF). J. Med. Libr. Assoc. JMLA 105(2), 203–206 (2017). https://doi.org/10.5195/jmla.2017.88. Accessed 23 July 2018
Gelman, A., Loken, E.: The garden of forking paths: why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time (2013). http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf. Accessed 23 July 2018
Gigerenzer, G.: Mindless statistics. J. Socio-Econ. 33, 587–606 (2004)
Goodman, S.: Toward evidence-based medical statistics. 1: the \(P\) value fallacy. Ann. Intern. Med. 130(12), 995–1004 (1999)
Goodman, S.: A dirty dozen: twelve \(P\)-value misconceptions. Semin. Hematol. 45(3), 135–140 (2008)
Goodman, S., Royall, R.: Evidence and scientific research. Am. J. Public Health 78(12), 1568–1574 (1988)
Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.B., Poole, C., Goodman, S.N., Altman, D.G.: Statistical tests, \(p\) values, confidence intervals, and power: a guide to misinterpretations. Eur. J. Epidemiol. 31(4), 337–350 (2016)
Hays, W.: Statistics for the Social Sciences. Holt, Rinehart & Winston, New York (1973)
Hubbard, R.: Alphabet soup—blurring the distinctions between \(p\)’s and \(\alpha \)’s in psychological research. Theory Psychol. 14(3), 295–327 (2004)
Hubbard, R., Armstrong, J.: Why we don’t really know what “statistical significance” means: a major educational failure. J. Mark. Edu. 28(2), 114–120 (2006)
Hubbard, R., Lindsay, R.: Why \(p\) values are not a useful measure of evidence in statistical significance testing. Theory Psychol. 18(1), 69–88 (2008)
Ioannidis, J.: Why most published research findings are false. PLoS Med. 2(8), e124 (2005)
Jeffreys, H.: Theory of Probability, 3rd edn. Clarendon Press, Oxford (1961). (Reprinted 2003)
Leek, J., McShane, B., Gelman, A., Colquhoun, D., Nuijten, M., Goodman, S.: Five ways to fix statistics. Nature 551, 557–559 (2017)
Levin, J.: What if there were no more bickering about statistical significance tests? Res. Sch. 5(2), 43–53 (1998)
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002). http://CRAN.R-project.org/doc/Rnews/. Accessed 23 July 2018
Lindley, D.: A statistical paradox. Biometrika 44, 187–192 (1957)
Lu, M., Ishwaran, H.: A prediction-based alternative to \(P\) values in regression models. J. Thoracic Cardiovasc. Surg. 155(3), 1130–1136.e4 (2018)
Matthews, R., Wasserstein, R., Spiegelhalter, D.: The ASA’s \(p\)-value statement, one year on. Significance 14(2), 38–41 (2017)
McShane, B.B., Gal, D., Gelman, A., Robert, C., Tackett, J.L.: Abandon Statistical Significance (2017). ArXiv e-prints 1709.07588
Nuzzo, R.: Statistical errors. Nature 506, 150–152 (2014)
Poole, C.: Beyond the confidence interval. Am. J. Public Health 2(77), 195–199 (1987)
R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2017). https://www.R-project.org/. Accessed 23 July 2018
Rosenthal, R.: The file drawer problem and tolerance for null results. Psychol. Bull. 86(3), 638–641 (1979)
Rothman, K.: Writing for epidemiology. Epidemiology 9(3), 333–337 (1998)
Rothman, K., Greenland, S., Lash, T.: Modern Epidemiology, 3rd edn. Wolters Kluwer, Alphen aan den Rijn (2008)
Savalei, V., Dunn, E.: Is the call to abandon \(p\)-values the red herring of the replicability crisis? Front. Psychol. Artic. 6, 1–4, Article 245 (2015)
Schervish, M.: \(P\) values: what they are and what they are not. Am. Stat. 50(3), 203–206 (1996)
Schmidt, F.: Statistical significance testing and cumulative knowledge in psychology: implications for training of researchers. Psychol. Methods 1(2), 115–129 (1996)
Schmidt, F., Hunter, J.: Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In: Harlow, L., Mulaik, S., Steiger, J. (eds.) What If There were No Significance Tests?, pp. 37–64. Psychology Press, Hove (1997)
Sellke, T., Bayarri, M., Berger, J.: Calibration of \(p\) values for testing precise null hypotheses. Am. Stat. 55(1), 62–71 (2001)
Senn, S.: Two cheers for \(p\)-values? J. Epidemiol. Biostat. 6, 193–204 (2001)
Simmons, J., Nelson, L., Simonsohn, U.: False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22(11), 1359–1366 (2011)
Trafimow, D., Marks, M.: Editorial. Basic Appl. Soc. Psychol. 37, 1–2 (2015)
Wasserstein, R., Lazar, N.: The ASA’s statement on \(p\)-values: context, process, and purpose (editorial). Am. Stat. 70(2), 129–133 (2016)
Webb, G.I., Boughton, J.R., Zheng, F., Ting, K.M., Salem, H.: Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification. Mach. Learn. 86(2), 233–272 (2012)
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper is an extended version of the DSAA2017 Research Track paper titled “On the Jeffreys–Lindley paradox and the looming reproducibility crisis in machine learning” [11].
Rights and permissions
About this article
Cite this article
Berrar, D., Dubitzky, W. Should significance testing be abandoned in machine learning?. Int J Data Sci Anal 7, 247–257 (2019). https://doi.org/10.1007/s41060-018-0148-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-018-0148-4