×

Divergence versus decision \(P\)-values: a distinction worth making in theory and keeping in practice: or, how divergence \(P\)-values measure evidence even when decision \(P\)-values do not. (English) Zbl 07677029

Scand. J. Stat. 50, No. 1, 54-88 (2023); corrigendum ibid. 51, No. 1, 425 (2024).
Summary: There are two distinct definitions of “\(P\)-value” for evaluating a proposed hypothesis or model for the process generating an observed dataset. The original definition starts with a measure of the divergence of the dataset from what was expected under the model, such as a sum of squares or a deviance statistic. A \(P\)-value is then the ordinal location of the measure in a reference distribution computed from the model and the data, and is treated as a unit-scaled index of compatibility between the data and the model. In the other definition, a \(P\)-value is a random variable on the unit interval whose realizations can be compared to a cutoff \(\alpha\) to generate a decision rule with known error rates under the model and specific alternatives. It is commonly assumed that realizations of such decision \(P\)-values always correspond to divergence \(P\)-values. But this need not be so: Decision \(P\)-values can violate intuitive single-sample coherence criteria where divergence \(P\)-values do not. It is thus argued that divergence and decision \(P\)-values should be carefully distinguished in teaching, and that divergence \(P\)-values are the relevant choice when the analysis goal is to summarize evidence rather than implement a decision rule.
{© 2023 Board of the Foundation of the Scandinavian Journal of Statistics.}

MSC:

62-XX Statistics

References:

[1] Amari, S. (2016). Information geometry and its applications. Applied Mathematical Sciences Monograph 194 (Vol. 374, p. XIII). Springer. · Zbl 1350.94001
[2] Amrhein, V., & Greenland, S. (2022). Discuss practical importance of results based on interval estimates and p‐value functions, not only on point estimates and null p‐values. Journal of Information Technology, 2022(37), 316-320. 10.1177
[3] Amrhein, V., Trafimow, D., & Greenland, S. (2019). Inferential statistics as descriptive statistics: There is no replication crisis if we don’t expect replication. The American Statistician, 73(Suppl. 1), 262-270. https://doi.org/10.1080/00031305.2018.1543137 · Zbl 07588208 · doi:10.1080/00031305.2018.1543137
[4] Bayarri, M., & Berger, J. O. (2000). P values for composite null models. Journal of the American Statistical Association, 95(452), 1127-1142. · Zbl 1004.62022
[5] Bayarri, M. J., & Berger, J. O. (1999). Quantifying surprise in the data and model verification. In J. M.Bernardo (ed.), J. O.Berger (ed.), A. P.Dawid (ed.), & A. F. M.Smith (ed.) (Eds.), Bayesian statistics (Vol. 6, pp. 53-82). Oxford University Press. · Zbl 0974.62021
[6] Bayarri, M. J., & Berger, J. O. (2004). The interplay of Bayesian and frequentist analysis. Statistical Science, 19, 58-80. · Zbl 1062.62001
[7] Berger, J. O. (2003). Could Fisher, Jeffreys and Neyman have agreed on testing?Statistical Science, 18(1), 1-32. · Zbl 1048.62006
[8] Berger, J. O., & Delampady, M. (1987). Testing precise hypotheses. Statistical Science, 2(3), 317-335. · Zbl 0955.62545
[9] Berger, R. L., & Hsu, J. C. (1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statistical Science, 11(4), 283-319. · Zbl 0955.62555
[10] Bickel, D. R. (2021a). Null hypothesis significance testing interpreted and calibrated by estimating probabilities of sign errors: A Bayes‐frequentist continuum. The American Statistician, 75(1), 104-112. · Zbl 07632826
[11] Bickel, D. R. (2021b). Null hypothesis significance testing defended and calibrated by Bayesian model checking. The American Statistician, 75(3), 249-255. · Zbl 07632862
[12] Bickel, D. R. (2022). Coherent checking and updating of Bayesian models without specifying the model space: A decision‐theoretic semantics for possibility theory. International Journal of Approximate Reasoning, 142, 81-93. · Zbl 07478939
[13] Bickel, D. R., & Patriota, A. G. (2019). Self‐consistent confidence sets and tests of composite hypotheses applicable to restricted parameters. Bernoulli, 25(1), 47-74. · Zbl 1512.62030
[14] Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis: Theory and practice. MIT Press. · Zbl 0332.62039
[15] Box, G. E. P. (1980). Sampling and Bayes inference in scientific modeling and robustness (with discussion). Journal of the Royal Statistical Society Series A‐Statistics in Society, 143, 383-430. · Zbl 0471.62036
[16] Casella, G., & Berger, R. L. (1987a). Reconciling Bayesian and frequentist evidence in the 1‐sided testing problem (with discussion). Journal of the American Statistical Association, 82, 106-135. · Zbl 0612.62021
[17] Casella, G., & Berger, R. L. (1987b). Comment on berger and delampady. Statistical Science, 2, 344-417.
[18] Cole, S. R., Edwards, J., & Greenland, S. (2021). Surprise!International Journal of Epidemiology, 190(2), 191-193. https://doi.org/10.1093/aje/kwaa136/5869593 · doi:10.1093/aje/kwaa136/5869593
[19] Cox, D. R. (1977). The role of significance tests. Scandinavian Journal of Statistics, 4, 49-70. · Zbl 0358.62006
[20] Cox, D. R., & Donnelley, C. (2011). Principles of applied statistics. Cambridge University Press. · Zbl 1273.62002
[21] Cox, D. R., & Hinkley, D. V. (1974). Theoretical statistics. Chapman & Hall. · Zbl 0334.62003
[22] Edwards, A. W. F. (1992). Likelihood (expanded ed.). Johns Hopkins University Press. · Zbl 0833.62004
[23] Efron, B., & Hinkley, D. V. (1978). Assessing the accuracy of the maximum likelihood estimator: Observed versus expected information (with discussion). Biometrika, 65, 457-487. · Zbl 0401.62002
[24] Fay, M. P., Proschan, M. A., Brittain, E. H., & Tiwari, R. (2022). Interpreting p‐values and confidence intervals using well‐calibrated null preference priors. Statistical Science, 37(4), 455-472. · Zbl 07612067
[25] Fisher, R. A. (1934). Statistical methods for research workers (5th ed.). Oliver & Boyd. · JFM 60.1162.01
[26] Fisher, R. A. (1935). The design of experiments. Oliver & Boyd.
[27] Fisher, R. A. (1936). Has Mendel’s work been rediscovered?Annals of Science, 1(2), 115-137.
[28] Fisher, R. A. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 17(1), 69-78. · Zbl 0066.38008
[29] Folks, J. F. (1981). Ideas of statistics. Wiley.
[30] Fossaluza, V., Izbicki, R., Miranda da Silva, G., & Esteves, L. G. (2017). Coherent hypothesis testing. The American Statistician, 71(3), 242-248. · Zbl 07687907
[31] Gabriel, K. R. (1969). Simultaneous test procedures ‐ some theory of multiple comparisons. Annals of Mathematical Statistics, 40, 224-250. · Zbl 0198.23602
[32] Gibson, E. W. (2021). The role of p‐values in judging the strength of evidence and realistic replication expectations. Statistics in Biopharmaceutical Research, 13(1), 6-18.
[33] Good, I. J. (2001). Comment on Lavine and Schervish (letter). The American Statistician, 55, 173-174.
[34] Goodman, S. (1993). P‐values, hypothesis tests and likelihood: Implications for epidemiology of a neglected historical debate. International Journal of Epidemiology, 137, 485-496.
[35] Goodman, S. (2016). Aligning statistical and scientific reasoning. Science, 352, 1180-1181. · Zbl 1357.62023
[36] Greenland, S. (2005). Multiple‐bias modeling for analysis of observational data (with discussion). Journal of the Royal Statistical Society Series A (Statistics in Society), 168, 267-308. · Zbl 1099.62129
[37] Greenland, S. (2006). Smoothing observational data: A philosophy and implementation for the health sciences. International Statistical Review, 74, 31-46.
[38] Greenland, S. (2019a). Valid P‐values behave exactly as they should: Some misleading criticisms of P‐values and their resolution with S‐values. The American Statistician, 73(suppl 1), 106-114. https://doi.org/10.1080/00031305.2018.1529625 · Zbl 07588191 · doi:10.1080/00031305.2018.1529625
[39] Greenland, S. (2019b). Are confidence intervals better termed “uncertainty intervals”? No: Call them compatibility intervals. British Medical Journal, 366, I5381. https://www.bmj.com/content/366/bmj.l5381
[40] Greenland, S. (2021a). Contribution to the discussion of “testing by betting: A strategy for statistical and scientific communication” by Glenn Shafer. Journal of the Royal Statistical Society Series A (Statistics in Society), 184, 450-451. https://arxiv.org/abs/2102.05569
[41] Greenland, S. (2021b). Analysis goals, error‐cost sensitivity, and analysis hacking: Essential considerations in hypothesis testing and multiple comparisons. Paediatric and Perinatal Epidemiology, 35, 8-23.
[42] Greenland, S. (2022). Ch. 31. The causal foundations of applied probability and statistics. In R.Dechter (ed.), J.Halpern (ed.), & H.Geffner (ed.) (Eds.), Probabilistic and causal inference: The works of Judea pearl (Vol. 36, pp. 605-624). ACM Books. https://arxiv.org/abs/2011.02677 · Zbl 07672261
[43] Greenland, S., & Hofman, A. (2019). Multiple comparisons controversies are about context and costs, not frequentism versus Bayesianism. European Journal of Epidemiology, 2019, 801-808. https://doi.org/10.1007/s10654‐019‐00552‐z · doi:10.1007/s10654‐019‐00552‐z
[44] Greenland, S., Mansournia, M. A., & Joffe, M. M. (2022). To curb research misreporting, replace significance and confidence by compatibility. Preventive Medicine, 164, in press. https://doi.org/10.1016/j.ypmed.2022.107127 · doi:10.1016/j.ypmed.2022.107127
[45] Greenland, S., & Poole, C. (2013a). Living with P‐values: Resurrecting a Bayesian perspective. Epidemiology, 24, 62-68.
[46] Greenland, S., & Poole, C. (2013b). Living with statistics in observational research. Epidemiology, 24, 73-78.
[47] Greenland, S., & Rafi, Z. (2020). Technical issues in the interpretation of S‐values and their relation to other information measures (online supplement to Rafi & Greenland, 2020). https://arxiv.org/abs/2008.12991
[48] Greenland, S., Rafi, Z., Matthews, R., & Higgs, M. (2023). To aid scientific inference, emphasize unconditional compatibility descriptions of statistics. The International Journal of Biostatistics to appear. http://arxiv.org/abs/1909.08583
[49] Grünwald, P. (2022). Beyond Neyman‐Pearson. arXiv:2205.00901v1 [stat.ME], 2 May 2022.
[50] Grünwald, P., deHeide, R., & Koolen, W. (2021). Safe testing. arXiv:1906.07801v3 [math.ST], 6 December 2021.
[51] Hacking, I. (1980). Ch. 5. The theory of probable inference: Neyman, Peirce and Braithwaite. In D. H.Mellor (ed.) (Ed.), Science, belief and behaviour (pp. 141-160). Cambridge University Press.
[52] Hansen, S., & Rice, K. (2023). Coherent tests for interval null hypotheses. The American Statistician, 77, in press, 1-9. https://doi.org/10.1080/00031305.2022.2050299 · Zbl 07699829 · doi:10.1080/00031305.2022.2050299
[53] Held, L., & Ott, M. (2016). How the maximal evidence of P‐values against point null hypotheses depends on sample size. The American Statistician, 70(4), 335-341. · Zbl 07665893
[54] Held, L., & Ott, M. (2018). On p‐values and Bayes factors. Annual Review of Statistics and Its Application, 5, 393-419.
[55] Hodges, J., & Lehmann, E. L. (1954). Testing the approximate validity of statistical hypotheses. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 16, 261-268. · Zbl 0057.35403
[56] Hubbard, R., & Bayarri, M. J. (2003). Confusion over measures of evidence (p’s) versus errors (α ’s) in classical statistical testing. The American Statistician, 57, 171-178.
[57] Huisman, L. (2022). Are P‐values and Bayes factors valid measures of evidential strength?Psychonomic Bull.
[58] Hurlbert, S. H., & Lombardi, C. M. (2009). Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Annales Zoologici Fennici, 46, 311-349.
[59] Kay, J., & King, M. (2020). Radical uncertainty: Decision‐making beyond the number. W.W. Norton.
[60] Kempthorne, O. (1976). Of what use are significance tests and tests of hypotheses?Communication in Statistics‐ Theory and Methods, A5, 763-777. · Zbl 0333.62007
[61] Kempthorne, O. (1990). Discussion of paper by C.B. Begg. Biometrika, 77, 481-483.
[62] Kempthorne, O., & Folks, J. L. (1971). Probability, statistics, and data analysis. Iowa State University Press. · Zbl 0245.62004
[63] Kullback, S. (1997). Information theory and statistics. Minneola (corrected republication of 1959 Wiley edition). · Zbl 0149.37901
[64] Lakens, D., Adolfi, F. G., Albers, C., Anvari, F., Apps, M. A., Argamon, S. E., Baguley, T., Becker, R. B., Benning, S. D., Bradford, D. E., & Buchanan, E. M. (2018). Justify your alpha. Nature Human Behaviour, 2, 168-171.
[65] Lancaster, H. O. (1961). Significance tests in discrete distributions. Journal of the American Statistical Association, 56, 223-234. · Zbl 0104.13201
[66] Lavine, M. (2022). P‐values don’t measure evidence. Communications in Statistics ‐ Theory and Methods in press.
[67] Lavine, M., & Schervish, M. J. (1999). Bayes factors: What they are and what they are not. The American Statistician, 53, 119-122.
[68] Leamer, E. E. (1978). Specification searches. Wiley. · Zbl 0384.62089
[69] Lehmann, E. L. (1986). Testing statistical hypotheses. Wiley. · Zbl 0608.62020
[70] Maldonado, G., & Greenland, S. (1994). A comparison of the performance of model‐based confidence intervals when the correct model form is unknown: Coverage of asymptotic means. Epidemiology, 5, 171-182.
[71] Mayo, D. (2018). Statistical inference as severe testing. Cambridge University Press. · Zbl 1400.62002
[72] McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman & Hall. · Zbl 0744.62098
[73] McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). Abandon statistical significance. The American Statistician, 73, 235-245. · Zbl 07588206
[74] Neyman, J. (1956). Note on an article by Sir Ronald Fisher. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 18(2), 288-294. · Zbl 0073.14105
[75] Neyman, J. (1977). Frequentist probability and frequentist statistics. Synthese, 36, 97-131. · Zbl 0372.60002
[76] Patriota, A. G. (2013). A classical measure of evidence for general null hypotheses. Fuzzy Sets and Systems, 233, 74-88. · Zbl 1314.62068
[77] Pearson, E. S. (1955). Statistical concepts in their relation to reality. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 17, 204-207. · Zbl 0067.11401
[78] Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50, 157-175. · JFM 31.0238.04
[79] Pearson, K. (1906). Note on the significant or non‐significant character of a subsample drawn from a sample. Biometrika, 5, 181-183.
[80] Perezgonzalez, J. D. (2015a). P‐values as percentiles. Commentary on: “Null hypothesis significance tests. A mix‐up of two different theories: The basis for widespread confusion and numerous misinterpretations”. Frontiers in Psychology, 6. https://doi.org/10.3389/fpsyg.2015.00341 · doi:10.3389/fpsyg.2015.00341
[81] Perezgonzalez, J. D. (2015b). Fisher, Neyman‐Pearson or NHST? A tutorial for teaching data testing. Frontiers in Psychology, 6. https://doi.org/10.3389/fpsyg.2015.00223 · doi:10.3389/fpsyg.2015.00223
[82] Peskun, P. H. (2020). Two‐tailed p‐values and coherent measures of evidence. The American Statistician, 74(1), 80-86. · Zbl 07593659
[83] Popper, K. R. (1959). The logic of scientific discovery. Basic Books. · Zbl 0083.24104
[84] Rafi, Z. & Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise. BMC Medical Research Methodology20, 244, https://10.1186/s12874‐020‐01105‐9
[85] Ritov, Y., Bickel, P. J., Gamst, A. C., & Kleijn, B. J. K. (2014). The Bayesian analysis of complex, high‐dimensional models: Can it be CODA?Statistical Science, 29(4), 619-639. · Zbl 1331.62162
[86] Robins, J. M., & Ritov, Y. (1997). Toward a curse of dimensionality appropriate (CODA) asymptotic theory for semiparametric models. Statistics in Medicine, 17, 285-319.
[87] Robins, J. M., van derVaart, A., & Ventura, V. (2000). Asymptotic distribution of p values in composite null models. Journal of the American Statistical Association, 95(452), 1143-1156. · Zbl 1072.62522
[88] Robins, J. M., & Wasserman, L. (2000). Conditioning, likelihood, and coherence: A review of some foundational concepts. Journal of the American Statistical Association, 95(452), 1340-1346. · Zbl 1072.62507
[89] Royall, R. (1997). Statistical evidence: A likelihood paradigm. Chapman & Hall. · Zbl 0919.62004
[90] Rubin, M. (2021). When to adjust alpha during multiple testing: A consideration of disjunction, conjunction, and individual testing. Synthese, 199, 10969-11000.
[91] Schervish, M. J. (1996). P‐values: What they are and what they are not. The American Statistician, 50, 203-206.
[92] Schneider, J. W. (2014). Null hypothesis significance tests: A mix‐up of two different theories. Scientometrics, 102, 411-432.
[93] Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of p values for testing precise null hypotheses. The American Statistician, 55(1), 62-71. · Zbl 1182.62053
[94] Shafer, G. (2020). On the nineteenth‐century origins of significance testing and p‐hacking. The Game‐Theoretic Probability and Finance Project, 55. http://www.probabilityandfinance.com/
[95] Shafer, G. (2021). Testing by betting: A strategy for statistical and scientific communication (with discussion). Journal of the Royal Statistical Society. Series A, (Statistics in Society), 184, 407-431.
[96] Shuster, J. (1992). Re: “On the logical justification of conditional tests for two‐by‐two contingency tables” (letter with reply). The American Statistician, 46, 163.
[97] Sjölander, A., & Greenland, S. (2022). Are E‐values too optimistic or too pessimistic? Both and neither. International Journal of Epidemiology, 51, 364-371.
[98] Stark, P. B. (2022). Pay no attention to the model behind the curtain. Pure and Applied Geophysics In press.
[99] Van Zwet, E. W., & Cator, E. A. (2021). The significance filter, the winner’s curse and the need to shrink. Statistica Neerlandica, 75(4), 437-452. · Zbl 1541.62172
[100] Vos, P., & Holbert, D. (2022). Frequentist statistical inference without repeated sampling. Synthese, 200, 89.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.