Three-Decision Methods: A Sensible Formulation of Significance Tests—and Much Else

Kenneth M. Rice; Chloe A. Krakauer

doi:10.1146/annurev-statistics-033021-111159

Annual Review of Statistics and Its Application

Volume 10, 2023

Review Article

Open Access

Three-Decision Methods: A Sensible Formulation of Significance Tests—and Much Else

Kenneth M. Rice¹, and Chloe A. Krakauer²
View Affiliations Hide Affiliations

Affiliations: ¹Department of Biostatistics, University of Washington, Seattle, Washington, USA; email: [email protected] ²Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA; email: [email protected]
Vol. 10:525-546 (Volume publication date March 2023) https://doi.org/10.1146/annurev-statistics-033021-111159
First published as a Review in Advance on October 06, 2022
Copyright © 2023 by the author(s).

This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See credit lines of images or other third-party material in this article for license information

Abstract

For real-valued parameters, significance tests can be motivated as three-decision methods, in which we either assert the sign of the parameter above or below a specified null value, or say nothing either way. Tukey viewed this as a “sensible formulation” of tests, unlike the widely taught null hypothesis significance testing (NHST) system that is today's default. We review the three-decision framework, collecting the substantial literature on how other statistical tools can be usefully motivated in this way. These tools include close Bayesian analogs of frequentist power calculations, p-values, confidence intervals, and multiple testing corrections. We also show how three-decision arguments can straightforwardly resolve some well-known difficulties in the interpretation and criticism of testing results. Explicit results are shown for simple conjugate analyses, but the methods discussed apply generally to real-valued parameters.

Keyword(s): decision theory, multiple testing, power, sensitivity analysis, significance tests, statistical tests

Article metrics loading...

/content/journals/10.1146/annurev-statistics-033021-111159

2023-03-09

2024-05-01

Full text loading...

/deliver/fulltext/statistics/10/1/annurev-statistics-033021-111159.html?itemId=/content/journals/10.1146/annurev-statistics-033021-111159&mimeType=html&fmt=ahah

Literature Cited

Ahuja A. 2019. Scientists strike back on statistical tyranny. Financial Times, March 27. https://www.ft.com/content/36f9374c-5075-11e9-8f44-fe4a86c48b33
[Google Scholar]
Altman DG, Gore SM, Gardner MJ, Pocock SJ. 1983. Statistical guidelines for contributors to medical journals. Br. Med. J. 286:63761489
[Google Scholar]
Amrhein V, Greenland S, McShane B. 2019. Scientists rise up against statistical significance. Nature 2019:305–7
[Google Scholar]
Bababekov YJ, Chang DC. 2019. Post hoc power: a surgeon's first assistant in interpreting “negative” studies. Ann. Surg. 269:1e11–12
[Google Scholar]
Bahadur RR. 1952. A property of the t-statistic. Sankhyā 12:1/279–88
[Google Scholar]
Bansal NK, Sheng R. 2010. Bayesian decision theoretic approach to hypothesis problems with skewed alternatives. J. Stat. Plan. Inference 140:102894–903
[Google Scholar]
Barnett V. 1999. Comparative Statistical Inference New York: Wiley. , 3rd ed..
Bayarri M, Benjamin DJ, Berger JO, Sellke TM. 2016. Rejection odds and rejection ratios: a proposal for statistical practice in testing hypotheses. J. Math. Psychol. 72:90–103
[Google Scholar]
Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers EJ et al. 2018. Redefine statistical significance. Nat. Hum. Behav. 2:16–10
[Google Scholar]
Benjamini Y, De Veaux RD, Efron B, Evans S, Glickman M et al. 2021. The ASA president's task force statement on statistical significance and replicability. Ann. Appl. Stat. 15:31084–85
[Google Scholar]
Berg N. 2004. No-decision classification: an alternative to testing for statistical significance. J. Socio-Econ. 33:5631–50
[Google Scholar]
Berger JO, Sellke T. 1987. Testing a point null hypothesis: the irreconcilability of P values and evidence. J. Am. Stat. Assoc. 82:397112–22
[Google Scholar]
Bernardo JM, Smith AF. 2009. Bayesian Theory New York: Wiley
Bland JM, Altman DG. 1994. Statistics notes: one and two sided tests of significance. BMJ 309:6949248
[Google Scholar]
Bland JM, Altman DG. 1995. Multiple significance tests: the Bonferroni method. BMJ 310:6973170
[Google Scholar]
Bohrer R. 1979. Multiple three-decision rules for parametric signs. J. Am. Stat. Assoc. 74:366a432–37
[Google Scholar]
Casella G, Berger RL. 2021. Statistical Inference Independence, KY: Cengage
Cohen J 1994. The earth is round (p <.05). Am. Psychol 4912997–1003
[Google Scholar]
Cox DR. 1958. Some problems connected with statistical inference. Ann. Math. Stat. 29:2357–72
[Google Scholar]
Cox DR. 2006. Principles of Statistical Inference Cambridge, UK: Cambridge University Press
Cox DR, Hinkley D. 1974. Theoretical Statistics Boca Raton, FL: Chapman and Hall/CRC
Cox DR, Spjøtvoll E, Johansen S, van Zwet WR, Bithell J et al. 1977. The role of significance tests [with discussion and reply]. Scand. J. Stat. 4:249–70
[Google Scholar]
Duncan DB. 1965. A Bayesian approach to multiple comparisons. Technometrics 7:2171–222
[Google Scholar]
Esteves LG, Izbicki R, Stern JM, Stern RB. 2016. The logical consistency of simultaneous agnostic hypothesis tests. Entropy 18:7256
[Google Scholar]
Evans M, Moshonov H. 2006. Checking for prior-data conflict. Bayesian Anal. 1:4893–914
[Google Scholar]
Fisher R 1935a. The Design of Experiments Edinburgh, UK: Oliver & Boyd
Fisher R. 1935b. The logic of inductive inference (with discussion). J. R. Stat. Soc. 98:39–82
[Google Scholar]
Forstmeier W, Wagenmakers EJ, Parker TH. 2017. Detecting and avoiding likely false-positive findings—a practical guide. Biol. Rev. 92:41941–68
[Google Scholar]
Gabriel KR. 1969. Simultaneous test procedures–some theory of multiple comparisons. Ann. Math. Stat. 40:1224–50
[Google Scholar]
Gelman A. 2016. The problems with p-values are not just with p-values. Supplemental material to the ASA statement on statistical significance and p-values. Am. Stat. 70:suppl.1–2
[Google Scholar]
Gelman A. 2019. Comment on “Post-hoc power using observed estimate of effect size is too noisy to be useful. ”. Ann. Surg. 270:2e64
[Google Scholar]
Gelman A, Carlin J. 2014. Beyond power calculations: assessing type S (sign) and type M (magnitude) errors. Perspect. Psychol. Sci. 9:6641–51
[Google Scholar]
Gelman A, Loken E. 2014. The statistical crisis in science data-dependent analysis—a “garden of forking paths”—explains why many statistically significant comparisons don't hold up. Am. Sci. 102:6460
[Google Scholar]
Gelman A, Tuerlinckx F. 2000. Type S error rates for classical and Bayesian single and multiple comparison procedures. Comput. Stat. 15:3373–90
[Google Scholar]
Ghosh BK, Sen PK. 1991. Handbook of Sequential Analysis Boca Raton, FL: Chapman and Hall/CRC
Good I. 1950. Probability and the Weighing of Evidence London: Charles Griffin
Goodman S. 2008. A dirty dozen: twelve p-value misconceptions. Semin. Hematol. 45:3135–40
[Google Scholar]
Goodman SN. 2016. Aligning statistical and scientific reasoning. Science 352:62901180–81
[Google Scholar]
Greenwald A, Gonzalez R, Harris RJ, Guthrie D 1996. Effect sizes and p values: What should be reported and what should be replicated?. Psychophysiology 33:2175–83
[Google Scholar]
Hand DJ. 2022. Trustworthiness of statistical inference. J. R. Stat. Soc. Ser. A 185:329–47
[Google Scholar]
Hannig J, Iyer H, Lai RC, Lee TC. 2016. Generalized fiducial inference: a review and new results. J. Am. Stat. Assoc. 111:5151346–61
[Google Scholar]
Hansen S, Rice K. 2022. Coherent tests for interval null hypotheses. Am. Stat. In press. https://doi.org/10.1080/00031305.2022.2050299
[Crossref] [Google Scholar]
Hardwicke TE, Ioannidis JP. 2019. Petitions in scientific argumentation: dissecting the request to retire statistical significance. Eur. J. Clin. Investig. 49:10e13162
[Google Scholar]
Harris RJ 1997. Reforming significance testing via three-valued logic. What If There Were No Significance Tests? LL Harlow, SA Mulaik, JH Steiger 145–74 London: Routledge
[Google Scholar]
Hartigan J. 1966. Note on the confidence-prior of Welch and Peers. J. R. Stat. Soc. Ser. B 28:155–56
[Google Scholar]
Heinsberg LW, Weeks DE. 2022. Post hoc power is not informative. Genet. Epidemiol. 46:7390–94
[Google Scholar]
Held L, Matthews R, Ott M, Pawel S. 2021. Reverse-Bayes methods for evidence assessment and research synthesis. Res. Synthesis Methods 13:295–314
[Google Scholar]
Hernández G, Cavalcanti AB, Ospina-Tascón G, Dubin A, Hurtado FJ et al. 2018. Statistical analysis plan for early goal-directed therapy using a physiological holistic view—the ANDROMEDA-SHOCK: a randomized controlled trial. Rev. Bras. Ter. Intensiva 30:3253
[Google Scholar]
Hernández G, Ospina-Tascón GA, Damiani LP, Estenssoro E, Dubin A et al. 2019. Effect of a resuscitation strategy targeting peripheral perfusion status vs serum lactate levels on 28-day mortality among patients with septic shock: the ANDROMEDA-SHOCK randomized clinical trial. JAMA 321:7654–64
[Google Scholar]
Hoenig JM, Heisey DM. 2001. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am. Stat. 55:119–24
[Google Scholar]
Hubbard R, Bayarri MJ. 2003. Confusion over measures of evidence (p's) versus errors (α's) in classical statistical testing. Am. Stat. 57:3171–78
[Google Scholar]
Hunter JE. 1997. Needed: a ban on the significance test. Psychol. Sci. 8:13–7
[Google Scholar]
Hurlbert SH, Lombardi CM. 2009. Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Ann. Zool. Fennici 46:5311–49
[Google Scholar]
Jeffreys H. 1935. Some tests of significance, treated by the theory of probability. Math. Proc. Camb. Philos. Soc. 31:203–22
[Google Scholar]
Jeffreys H 1980. Some general points in probability theory. Bayesian Analysis in Econometrics and Statistics. Essays in Honor of Harold Jeffreys A Zellner, J Kadane 451–53 Amsterdam: North-Holland
[Google Scholar]
Johnson VE. 2013. Revised standards for statistical evidence. PNAS 110:4819313–17
[Google Scholar]
Jones LV, Tukey JW. 2000. A sensible formulation of the significance test. Psychol. Methods 5:4411–14
[Google Scholar]
Jonsson F. 2013. Characterizing optimality among three-decision procedures for directional conclusions. J. Stat. Plan. Inference 143:2392–99
[Google Scholar]
Kaiser HF. 1960. Directional statistical decisions. Psychol. Rev. 67:3160–67
[Google Scholar]
Krakauer C, Rice K. 2021. Discussion of “Testing by betting: a strategy for statistical and scientific communication” by Glenn Shafer. J. R. Stat. Soc. Ser. A 184:2452–53
[Google Scholar]
Lehmann EL. 1950. Some principles of the theory of testing hypotheses. Ann. Math. Stat. 21:11–26
[Google Scholar]
Lehmann EL. 1957a. A theory of some multiple decision problems, I. Ann. Math. Stat. 28:1–25
[Google Scholar]
Lehmann EL. 1957b. A theory of some multiple decision problems, II. Ann. Math. Stat. 28:547–72
[Google Scholar]
Lewis C, Thayer DT. 2004. A loss function related to the FDR for random effects multiple comparisons. J. Stat. Plan. Inference 125:1–249–58
[Google Scholar]
Lewis C, Thayer DT 2009. Bayesian decision theory for multiple comparisons. Optimality: The Third Erich L. Lehmann Symposium J Rojo 326–32 N.p.: Inst. Math. Stat.
[Google Scholar]
Lewis C, Thayer DT. 2013. Undesirable optimality results in multiple testing?. Stat. Model. 13:5–6541–51
[Google Scholar]
Lindley DV. 1957. A statistical paradox. Biometrika 44:1/2187–92
[Google Scholar]
Lindley DV. 1997. The choice of sample size. J. R. Stat. Soc. Ser. D 46:2129–38
[Google Scholar]
Longford N. 2020. Discussion on the meeting on ‘Signs and sizes: understanding and replicating statistical findings. .’ J. R. Stat. Soc. Ser. A 183:2451
[Google Scholar]
Matthews R, Wasserstein R, Spiegelhalter D. 2017. The ASA's p-value statement, one year on. Significance 14:238–41
[Google Scholar]
Matthews RA. 2001. Methods for assessing the credibility of clinical trial outcomes. Drug Inform. J. 35:41469–78
[Google Scholar]
Matthews RA. 2018. Beyond ‘significance’: principles and practice of the analysis of credibility. R. Soc. Open Sci. 5:1171047
[Google Scholar]
Mayo DG, Spanos A. 2006. Severe testing as a basic concept in a Neyman–Pearson philosophy of induction. Br. J. Philos. Sci. 57:2323–57
[Google Scholar]
McShane BB, Gal D. 2017. Statistical significance and the dichotomization of evidence. J. Am. Stat. Assoc. 112:519885–95
[Google Scholar]
McShane BB, Gal D, Gelman A, Robert C, Tackett JL. 2019. Abandon statistical significance. Am. Stat. 73:suppl.235–45
[Google Scholar]
Mosteller F. 1948. A k-sample slippage test for an extreme population. Ann. Math. Stat. 19:58–65
[Google Scholar]
Neyman J. 1952. Lectures and Conferences on Mathematical Statistics and Probability Washington, DC: USDA
Neyman J, Pearson ES. 1933. IX. On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond. Ser. A 231:694–706289–337
[Google Scholar]
Nuzzo R. 2014. Scientific method: statistical errors. Nat. News 506:7487150
[Google Scholar]
O'Hagan A, Stevens JW. 2001. Bayesian assessment of sample size for clinical trials of cost-effectiveness. Med. Decis. Making 21:3219–30
[Google Scholar]
Perlman MD, Wu L. 1999. The emperor's new tests. Stat. Sci. 14:4355–69
[Google Scholar]
Rafi Z, Greenland S. 2020. Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise. BMC Med. Res. Methodol. 20:1244
[Google Scholar]
Reid N, Cox DR. 2015. On some principles of statistical inference. Int. Stat. Rev. 83:2293–308
[Google Scholar]
Rice K. 2010. A decision-theoretic formulation of Fisher's approach to testing. Am. Stat. 64:4345–49
[Google Scholar]
Rice K, Bonnett T, Krakauer C. 2020. Knowing the signs: a direct and generalizable motivation of two-sided tests. J. R. Stat. Soc. Ser. A 183:2411–30
[Google Scholar]
Rice K, Ye L. 2022. Expressing regret: a unified view of credible intervals. Am. Stat. 76:248–56
[Google Scholar]
Robbins H 1951. Asymptotically subminimax solutions of compound statistical decision problems. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, Vol. 2 J Neyman 131–49 Berkeley: Univ. Calif. Press
[Google Scholar]
Robert C. 2007. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation New York: Springer
Rothman KJ. 1990. No adjustments are needed for multiple comparisons. Epidemiology 1:43–46
[Google Scholar]
Royall RM. 1986. The effect of sample size on the meaning of significance tests. Am. Stat. 40:4313–15
[Google Scholar]
Rukhin AL. 1988. Loss functions for loss estimation. Ann. Stat. 16:1262–69
[Google Scholar]
Sarkar SK, Zhou T. 2008. Controlling Bayes directional false discovery rate in random effects model. J. Stat. Plan. Inference 138:3682–93
[Google Scholar]
Schervish MJ. 1995. Theory of Statistics New York: Springer
Schervish MJ. 1996. P values: what they are and what they are not. Am. Stat. 50:3203–6
[Google Scholar]
Sekhon H, Ennew C, Kharouf H, Devlin J. 2014. Trustworthiness and trust: influences and implications. J. Mark. Manag. 30:3–4409–30
[Google Scholar]
Senn S. 2001. Two cheers for p-values?. J. Epidemiol. Biostat. 6:2193–204
[Google Scholar]
Shafer G. 2021. Testing by betting: a strategy for statistical and scientific communication. J. R. Stat. Soc. Ser. A 184:2407–31
[Google Scholar]
Shaffer JP. 2002. Multiplicity, directional (type III) errors, and the null hypothesis. Psychol. Methods 7:3356–69
[Google Scholar]
Spiegelhalter DJ. 2020. Andromeda and ‘appalling science’: a response to Hardwicke and Ioannidis. WintonCentre Blog Jan. 25. https://medium.com/wintoncentre/andromeda-and-appalling-science-a-response-to-hardwicke-and-ioannidis-a79458efdba1
[Google Scholar]
Spiegelhalter DJ, Abrams KR, Myles JP. 2004. Bayesian Approaches to Clinical Trials and Health-Care Evaluation New York: Wiley
Stephens M. 2017. False discovery rates: a new deal. Biostatistics 18:2275–94
[Google Scholar]
Sun W, Cai TT. 2007. Oracle and adaptive compound decision rules for false discovery rate control. J. Am. Stat. Assoc. 102:479901–12
[Google Scholar]
Thulin M. 2014. Decision-theoretic justifications for Bayesian hypothesis testing using credible sets. J. Stat. Plan. Inference 146:133–38
[Google Scholar]
Tsiatis AA. 1981. The asymptotic joint distribution of the efficient scores test for the proportional hazards model calculated over time. Biometrika 68:1311–15
[Google Scholar]
Van der Vaart AW. 2000. Asymptotic Statistics, Vol. 3 Cambridge, UK: Cambridge Univ. Press
Wakefield J. 2009. Bayes factors for genome-wide association studies: comparison with p-values. Genet. Epidemiol. 33:179–86
[Google Scholar]
Wasserstein RL, Lazar NA. 2016. The ASA statement on p-values: context, process, and purpose. Am. Stat. 70:2129–33
[Google Scholar]
Williams VS, Jones LV, Tukey JW. 1999. Controlling error in multiple comparisons, with examples from state-to-state differences in educational achievement. J. Educ. Behav. Stat. 24:142–69
[Google Scholar]
Wood J, Freemantle N, King M, Nazareth I. 2014. Trap of trends to statistical significance: likelihood of near significant P value becoming more significant with extra data. BMJ 348:g2215
[Google Scholar]
Ye L, Rice K. 2021. Bayesian optimality and intervals for Stein-type estimates. Stat 11:e445
[Google Scholar]
Zampieri FG, Damiani LP, Bakker J, Ospina-Tascón GA, Castro R et al. 2020. Effects of a resuscitation strategy targeting peripheral perfusion status versus serum lactate levels among patients with septic shock. A Bayesian reanalysis of the ANDROMEDA-SHOCK trial. Am. J. Respir. Crit. Care Med. 201:4423–29
[Google Scholar]

/content/journals/10.1146/annurev-statistics-033021-111159

Three-Decision Methods: A Sensible Formulation of Significance Tests—and Much Else

Annual Review of Statistics and Its Application 10, 525 (2023); https://doi.org/10.1146/annurev-statistics-033021-111159

/content/journals/10.1146/annurev-statistics-033021-111159

Data & Media loading...

Article Type: Review Article

Most Cited Most Cited RSS feed

- Probabilistic Forecasting
  
  Tilmann Gneiting, and Matthias Katzfuss
  
  Vol. 1 (2014), pp. 125–151
- Functional Data Analysis
  
  Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller
  
  Vol. 3 (2016), pp. 257–295
- Bayesian Computing with INLA: A Review
  
  Håvard Rue, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren
  
  Vol. 4 (2017), pp. 395–421
- Functional Regression
  
  Jeffrey S. Morris
  
  Vol. 2 (2015), pp. 321–359
- Topological Data Analysis
  
  Larry Wasserman
  
  Vol. 5 (2018), pp. 501–532
- Algorithmic Fairness: Choices, Assumptions, and Definitions
  
  Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, and Kristian Lum
  
  Vol. 8 (2021), pp. 141–163
- Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis
  
  Hongzhe Li
  
  Vol. 2 (2015), pp. 73–94
- Learning Deep Generative Models
  
  Ruslan Salakhutdinov
  
  Vol. 2 (2015), pp. 361–385
- On p-Values and Bayes Factors
  
  Leonhard Held, and Manuela Ott
  
  Vol. 5 (2018), pp. 393–419
- High-Dimensional Statistics with a View Toward Applications in Biology
  
  Peter Bühlmann, Markus Kalisch, and Lukas Meier
  
  Vol. 1 (2014), pp. 255–278
More Less

Annual Review of Statistics and Its Application

Volume 10, 2023

Review Article

Open Access

Three-Decision Methods: A Sensible Formulation of Significance Tests—and Much Else

Abstract

Most Read This Month

Most Cited Most Cited RSS feed