1932

Abstract

For real-valued parameters, significance tests can be motivated as three-decision methods, in which we either assert the sign of the parameter above or below a specified null value, or say nothing either way. Tukey viewed this as a “sensible formulation” of tests, unlike the widely taught null hypothesis significance testing (NHST) system that is today's default. We review the three-decision framework, collecting the substantial literature on how other statistical tools can be usefully motivated in this way. These tools include close Bayesian analogs of frequentist power calculations, -values, confidence intervals, and multiple testing corrections. We also show how three-decision arguments can straightforwardly resolve some well-known difficulties in the interpretation and criticism of testing results. Explicit results are shown for simple conjugate analyses, but the methods discussed apply generally to real-valued parameters.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-033021-111159
2023-03-09
2024-05-01
Loading full text...

Full text loading...

/deliver/fulltext/statistics/10/1/annurev-statistics-033021-111159.html?itemId=/content/journals/10.1146/annurev-statistics-033021-111159&mimeType=html&fmt=ahah

Literature Cited

  1. Ahuja A. 2019. Scientists strike back on statistical tyranny. Financial Times, March 27. https://www.ft.com/content/36f9374c-5075-11e9-8f44-fe4a86c48b33
    [Google Scholar]
  2. Altman DG, Gore SM, Gardner MJ, Pocock SJ. 1983. Statistical guidelines for contributors to medical journals. Br. Med. J. 286:63761489
    [Google Scholar]
  3. Amrhein V, Greenland S, McShane B. 2019. Scientists rise up against statistical significance. Nature 2019:305–7
    [Google Scholar]
  4. Bababekov YJ, Chang DC. 2019. Post hoc power: a surgeon's first assistant in interpreting “negative” studies. Ann. Surg. 269:1e11–12
    [Google Scholar]
  5. Bahadur RR. 1952. A property of the t-statistic. Sankhyā 12:1/279–88
    [Google Scholar]
  6. Bansal NK, Sheng R. 2010. Bayesian decision theoretic approach to hypothesis problems with skewed alternatives. J. Stat. Plan. Inference 140:102894–903
    [Google Scholar]
  7. Barnett V. 1999. Comparative Statistical Inference New York: Wiley. , 3rd ed..
  8. Bayarri M, Benjamin DJ, Berger JO, Sellke TM. 2016. Rejection odds and rejection ratios: a proposal for statistical practice in testing hypotheses. J. Math. Psychol. 72:90–103
    [Google Scholar]
  9. Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers EJ et al. 2018. Redefine statistical significance. Nat. Hum. Behav. 2:16–10
    [Google Scholar]
  10. Benjamini Y, De Veaux RD, Efron B, Evans S, Glickman M et al. 2021. The ASA president's task force statement on statistical significance and replicability. Ann. Appl. Stat. 15:31084–85
    [Google Scholar]
  11. Berg N. 2004. No-decision classification: an alternative to testing for statistical significance. J. Socio-Econ. 33:5631–50
    [Google Scholar]
  12. Berger JO, Sellke T. 1987. Testing a point null hypothesis: the irreconcilability of P values and evidence. J. Am. Stat. Assoc. 82:397112–22
    [Google Scholar]
  13. Bernardo JM, Smith AF. 2009. Bayesian Theory New York: Wiley
  14. Bland JM, Altman DG. 1994. Statistics notes: one and two sided tests of significance. BMJ 309:6949248
    [Google Scholar]
  15. Bland JM, Altman DG. 1995. Multiple significance tests: the Bonferroni method. BMJ 310:6973170
    [Google Scholar]
  16. Bohrer R. 1979. Multiple three-decision rules for parametric signs. J. Am. Stat. Assoc. 74:366a432–37
    [Google Scholar]
  17. Casella G, Berger RL. 2021. Statistical Inference Independence, KY: Cengage
  18. Cohen J 1994. The earth is round (p <.05). Am. Psychol 4912997–1003
    [Google Scholar]
  19. Cox DR. 1958. Some problems connected with statistical inference. Ann. Math. Stat. 29:2357–72
    [Google Scholar]
  20. Cox DR. 2006. Principles of Statistical Inference Cambridge, UK: Cambridge University Press
  21. Cox DR, Hinkley D. 1974. Theoretical Statistics Boca Raton, FL: Chapman and Hall/CRC
  22. Cox DR, Spjøtvoll E, Johansen S, van Zwet WR, Bithell J et al. 1977. The role of significance tests [with discussion and reply]. Scand. J. Stat. 4:249–70
    [Google Scholar]
  23. Duncan DB. 1965. A Bayesian approach to multiple comparisons. Technometrics 7:2171–222
    [Google Scholar]
  24. Esteves LG, Izbicki R, Stern JM, Stern RB. 2016. The logical consistency of simultaneous agnostic hypothesis tests. Entropy 18:7256
    [Google Scholar]
  25. Evans M, Moshonov H. 2006. Checking for prior-data conflict. Bayesian Anal. 1:4893–914
    [Google Scholar]
  26. Fisher R 1935a. The Design of Experiments Edinburgh, UK: Oliver & Boyd
  27. Fisher R. 1935b. The logic of inductive inference (with discussion). J. R. Stat. Soc. 98:39–82
    [Google Scholar]
  28. Forstmeier W, Wagenmakers EJ, Parker TH. 2017. Detecting and avoiding likely false-positive findings—a practical guide. Biol. Rev. 92:41941–68
    [Google Scholar]
  29. Gabriel KR. 1969. Simultaneous test procedures–some theory of multiple comparisons. Ann. Math. Stat. 40:1224–50
    [Google Scholar]
  30. Gelman A. 2016. The problems with p-values are not just with p-values. Supplemental material to the ASA statement on statistical significance and p-values. Am. Stat. 70:suppl.1–2
    [Google Scholar]
  31. Gelman A. 2019. Comment on “Post-hoc power using observed estimate of effect size is too noisy to be useful. ”. Ann. Surg. 270:2e64
    [Google Scholar]
  32. Gelman A, Carlin J. 2014. Beyond power calculations: assessing type S (sign) and type M (magnitude) errors. Perspect. Psychol. Sci. 9:6641–51
    [Google Scholar]
  33. Gelman A, Loken E. 2014. The statistical crisis in science data-dependent analysis—a “garden of forking paths”—explains why many statistically significant comparisons don't hold up. Am. Sci. 102:6460
    [Google Scholar]
  34. Gelman A, Tuerlinckx F. 2000. Type S error rates for classical and Bayesian single and multiple comparison procedures. Comput. Stat. 15:3373–90
    [Google Scholar]
  35. Ghosh BK, Sen PK. 1991. Handbook of Sequential Analysis Boca Raton, FL: Chapman and Hall/CRC
  36. Good I. 1950. Probability and the Weighing of Evidence London: Charles Griffin
  37. Goodman S. 2008. A dirty dozen: twelve p-value misconceptions. Semin. Hematol. 45:3135–40
    [Google Scholar]
  38. Goodman SN. 2016. Aligning statistical and scientific reasoning. Science 352:62901180–81
    [Google Scholar]
  39. Greenwald A, Gonzalez R, Harris RJ, Guthrie D 1996. Effect sizes and p values: What should be reported and what should be replicated?. Psychophysiology 33:2175–83
    [Google Scholar]
  40. Hand DJ. 2022. Trustworthiness of statistical inference. J. R. Stat. Soc. Ser. A 185:329–47
    [Google Scholar]
  41. Hannig J, Iyer H, Lai RC, Lee TC. 2016. Generalized fiducial inference: a review and new results. J. Am. Stat. Assoc. 111:5151346–61
    [Google Scholar]
  42. Hansen S, Rice K. 2022. Coherent tests for interval null hypotheses. Am. Stat. In press. https://doi.org/10.1080/00031305.2022.2050299
    [Crossref] [Google Scholar]
  43. Hardwicke TE, Ioannidis JP. 2019. Petitions in scientific argumentation: dissecting the request to retire statistical significance. Eur. J. Clin. Investig. 49:10e13162
    [Google Scholar]
  44. Harris RJ 1997. Reforming significance testing via three-valued logic. What If There Were No Significance Tests? LL Harlow, SA Mulaik, JH Steiger 145–74 London: Routledge
    [Google Scholar]
  45. Hartigan J. 1966. Note on the confidence-prior of Welch and Peers. J. R. Stat. Soc. Ser. B 28:155–56
    [Google Scholar]
  46. Heinsberg LW, Weeks DE. 2022. Post hoc power is not informative. Genet. Epidemiol. 46:7390–94
    [Google Scholar]
  47. Held L, Matthews R, Ott M, Pawel S. 2021. Reverse-Bayes methods for evidence assessment and research synthesis. Res. Synthesis Methods 13:295–314
    [Google Scholar]
  48. Hernández G, Cavalcanti AB, Ospina-Tascón G, Dubin A, Hurtado FJ et al. 2018. Statistical analysis plan for early goal-directed therapy using a physiological holistic view—the ANDROMEDA-SHOCK: a randomized controlled trial. Rev. Bras. Ter. Intensiva 30:3253
    [Google Scholar]
  49. Hernández G, Ospina-Tascón GA, Damiani LP, Estenssoro E, Dubin A et al. 2019. Effect of a resuscitation strategy targeting peripheral perfusion status vs serum lactate levels on 28-day mortality among patients with septic shock: the ANDROMEDA-SHOCK randomized clinical trial. JAMA 321:7654–64
    [Google Scholar]
  50. Hoenig JM, Heisey DM. 2001. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am. Stat. 55:119–24
    [Google Scholar]
  51. Hubbard R, Bayarri MJ. 2003. Confusion over measures of evidence (p's) versus errors (α's) in classical statistical testing. Am. Stat. 57:3171–78
    [Google Scholar]
  52. Hunter JE. 1997. Needed: a ban on the significance test. Psychol. Sci. 8:13–7
    [Google Scholar]
  53. Hurlbert SH, Lombardi CM. 2009. Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Ann. Zool. Fennici 46:5311–49
    [Google Scholar]
  54. Jeffreys H. 1935. Some tests of significance, treated by the theory of probability. Math. Proc. Camb. Philos. Soc. 31:203–22
    [Google Scholar]
  55. Jeffreys H 1980. Some general points in probability theory. Bayesian Analysis in Econometrics and Statistics. Essays in Honor of Harold Jeffreys A Zellner, J Kadane 451–53 Amsterdam: North-Holland
    [Google Scholar]
  56. Johnson VE. 2013. Revised standards for statistical evidence. PNAS 110:4819313–17
    [Google Scholar]
  57. Jones LV, Tukey JW. 2000. A sensible formulation of the significance test. Psychol. Methods 5:4411–14
    [Google Scholar]
  58. Jonsson F. 2013. Characterizing optimality among three-decision procedures for directional conclusions. J. Stat. Plan. Inference 143:2392–99
    [Google Scholar]
  59. Kaiser HF. 1960. Directional statistical decisions. Psychol. Rev. 67:3160–67
    [Google Scholar]
  60. Krakauer C, Rice K. 2021. Discussion of “Testing by betting: a strategy for statistical and scientific communication” by Glenn Shafer. J. R. Stat. Soc. Ser. A 184:2452–53
    [Google Scholar]
  61. Lehmann EL. 1950. Some principles of the theory of testing hypotheses. Ann. Math. Stat. 21:11–26
    [Google Scholar]
  62. Lehmann EL. 1957a. A theory of some multiple decision problems, I. Ann. Math. Stat. 28:1–25
    [Google Scholar]
  63. Lehmann EL. 1957b. A theory of some multiple decision problems, II. Ann. Math. Stat. 28:547–72
    [Google Scholar]
  64. Lewis C, Thayer DT. 2004. A loss function related to the FDR for random effects multiple comparisons. J. Stat. Plan. Inference 125:1–249–58
    [Google Scholar]
  65. Lewis C, Thayer DT 2009. Bayesian decision theory for multiple comparisons. Optimality: The Third Erich L. Lehmann Symposium J Rojo 326–32 N.p.: Inst. Math. Stat.
    [Google Scholar]
  66. Lewis C, Thayer DT. 2013. Undesirable optimality results in multiple testing?. Stat. Model. 13:5–6541–51
    [Google Scholar]
  67. Lindley DV. 1957. A statistical paradox. Biometrika 44:1/2187–92
    [Google Scholar]
  68. Lindley DV. 1997. The choice of sample size. J. R. Stat. Soc. Ser. D 46:2129–38
    [Google Scholar]
  69. Longford N. 2020. Discussion on the meeting on ‘Signs and sizes: understanding and replicating statistical findings. .’ J. R. Stat. Soc. Ser. A 183:2451
    [Google Scholar]
  70. Matthews R, Wasserstein R, Spiegelhalter D. 2017. The ASA's p-value statement, one year on. Significance 14:238–41
    [Google Scholar]
  71. Matthews RA. 2001. Methods for assessing the credibility of clinical trial outcomes. Drug Inform. J. 35:41469–78
    [Google Scholar]
  72. Matthews RA. 2018. Beyond ‘significance’: principles and practice of the analysis of credibility. R. Soc. Open Sci. 5:1171047
    [Google Scholar]
  73. Mayo DG, Spanos A. 2006. Severe testing as a basic concept in a Neyman–Pearson philosophy of induction. Br. J. Philos. Sci. 57:2323–57
    [Google Scholar]
  74. McShane BB, Gal D. 2017. Statistical significance and the dichotomization of evidence. J. Am. Stat. Assoc. 112:519885–95
    [Google Scholar]
  75. McShane BB, Gal D, Gelman A, Robert C, Tackett JL. 2019. Abandon statistical significance. Am. Stat. 73:suppl.235–45
    [Google Scholar]
  76. Mosteller F. 1948. A k-sample slippage test for an extreme population. Ann. Math. Stat. 19:58–65
    [Google Scholar]
  77. Neyman J. 1952. Lectures and Conferences on Mathematical Statistics and Probability Washington, DC: USDA
  78. Neyman J, Pearson ES. 1933. IX. On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond. Ser. A 231:694–706289–337
    [Google Scholar]
  79. Nuzzo R. 2014. Scientific method: statistical errors. Nat. News 506:7487150
    [Google Scholar]
  80. O'Hagan A, Stevens JW. 2001. Bayesian assessment of sample size for clinical trials of cost-effectiveness. Med. Decis. Making 21:3219–30
    [Google Scholar]
  81. Perlman MD, Wu L. 1999. The emperor's new tests. Stat. Sci. 14:4355–69
    [Google Scholar]
  82. Rafi Z, Greenland S. 2020. Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise. BMC Med. Res. Methodol. 20:1244
    [Google Scholar]
  83. Reid N, Cox DR. 2015. On some principles of statistical inference. Int. Stat. Rev. 83:2293–308
    [Google Scholar]
  84. Rice K. 2010. A decision-theoretic formulation of Fisher's approach to testing. Am. Stat. 64:4345–49
    [Google Scholar]
  85. Rice K, Bonnett T, Krakauer C. 2020. Knowing the signs: a direct and generalizable motivation of two-sided tests. J. R. Stat. Soc. Ser. A 183:2411–30
    [Google Scholar]
  86. Rice K, Ye L. 2022. Expressing regret: a unified view of credible intervals. Am. Stat. 76:248–56
    [Google Scholar]
  87. Robbins H 1951. Asymptotically subminimax solutions of compound statistical decision problems. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, Vol. 2 J Neyman 131–49 Berkeley: Univ. Calif. Press
    [Google Scholar]
  88. Robert C. 2007. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation New York: Springer
  89. Rothman KJ. 1990. No adjustments are needed for multiple comparisons. Epidemiology 1:43–46
    [Google Scholar]
  90. Royall RM. 1986. The effect of sample size on the meaning of significance tests. Am. Stat. 40:4313–15
    [Google Scholar]
  91. Rukhin AL. 1988. Loss functions for loss estimation. Ann. Stat. 16:1262–69
    [Google Scholar]
  92. Sarkar SK, Zhou T. 2008. Controlling Bayes directional false discovery rate in random effects model. J. Stat. Plan. Inference 138:3682–93
    [Google Scholar]
  93. Schervish MJ. 1995. Theory of Statistics New York: Springer
  94. Schervish MJ. 1996. P values: what they are and what they are not. Am. Stat. 50:3203–6
    [Google Scholar]
  95. Sekhon H, Ennew C, Kharouf H, Devlin J. 2014. Trustworthiness and trust: influences and implications. J. Mark. Manag. 30:3–4409–30
    [Google Scholar]
  96. Senn S. 2001. Two cheers for p-values?. J. Epidemiol. Biostat. 6:2193–204
    [Google Scholar]
  97. Shafer G. 2021. Testing by betting: a strategy for statistical and scientific communication. J. R. Stat. Soc. Ser. A 184:2407–31
    [Google Scholar]
  98. Shaffer JP. 2002. Multiplicity, directional (type III) errors, and the null hypothesis. Psychol. Methods 7:3356–69
    [Google Scholar]
  99. Spiegelhalter DJ. 2020. Andromeda and ‘appalling science’: a response to Hardwicke and Ioannidis. WintonCentre Blog Jan. 25. https://medium.com/wintoncentre/andromeda-and-appalling-science-a-response-to-hardwicke-and-ioannidis-a79458efdba1
    [Google Scholar]
  100. Spiegelhalter DJ, Abrams KR, Myles JP. 2004. Bayesian Approaches to Clinical Trials and Health-Care Evaluation New York: Wiley
  101. Stephens M. 2017. False discovery rates: a new deal. Biostatistics 18:2275–94
    [Google Scholar]
  102. Sun W, Cai TT. 2007. Oracle and adaptive compound decision rules for false discovery rate control. J. Am. Stat. Assoc. 102:479901–12
    [Google Scholar]
  103. Thulin M. 2014. Decision-theoretic justifications for Bayesian hypothesis testing using credible sets. J. Stat. Plan. Inference 146:133–38
    [Google Scholar]
  104. Tsiatis AA. 1981. The asymptotic joint distribution of the efficient scores test for the proportional hazards model calculated over time. Biometrika 68:1311–15
    [Google Scholar]
  105. Van der Vaart AW. 2000. Asymptotic Statistics, Vol. 3 Cambridge, UK: Cambridge Univ. Press
  106. Wakefield J. 2009. Bayes factors for genome-wide association studies: comparison with p-values. Genet. Epidemiol. 33:179–86
    [Google Scholar]
  107. Wasserstein RL, Lazar NA. 2016. The ASA statement on p-values: context, process, and purpose. Am. Stat. 70:2129–33
    [Google Scholar]
  108. Williams VS, Jones LV, Tukey JW. 1999. Controlling error in multiple comparisons, with examples from state-to-state differences in educational achievement. J. Educ. Behav. Stat. 24:142–69
    [Google Scholar]
  109. Wood J, Freemantle N, King M, Nazareth I. 2014. Trap of trends to statistical significance: likelihood of near significant P value becoming more significant with extra data. BMJ 348:g2215
    [Google Scholar]
  110. Ye L, Rice K. 2021. Bayesian optimality and intervals for Stein-type estimates. Stat 11:e445
    [Google Scholar]
  111. Zampieri FG, Damiani LP, Bakker J, Ospina-Tascón GA, Castro R et al. 2020. Effects of a resuscitation strategy targeting peripheral perfusion status versus serum lactate levels among patients with septic shock. A Bayesian reanalysis of the ANDROMEDA-SHOCK trial. Am. J. Respir. Crit. Care Med. 201:4423–29
    [Google Scholar]
/content/journals/10.1146/annurev-statistics-033021-111159
Loading
/content/journals/10.1146/annurev-statistics-033021-111159
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error