Reliability of clinical diagnosis is essential for good clinical decision making as well as productive clinical research. The current review emphasizes the distinction between a disorder and a diagnosis and between validity and reliability of diagnoses, and the relationships that exist between them. What is crucial is that reliable diagnoses are essential to establishing valid diagnoses. The present review discusses the theoretical background underlying the evaluation of diagnoses, possible designs of reliability studies, estimation of the reliability coefficient, the standards for assessment of reliability, and strategies for improving reliability without compromising validity.

Keyword(s): designdisorderkappavalidity

Article metrics loading...

Loading full text...

Full text loading...


Literature Cited

  1. Algina J. 1978. Comment on Bartko's “On various intraclass correlation reliability coefficients.”. Psychol. Bull. 85:135–38 [Google Scholar]
  2. Am. Psychiatr. Assoc 1980. Diagnostic and Statistical Manual of Mental Disorders. Washington, DC: Am. Psychiatr. Publ, 3rd ed.. [Google Scholar]
  3. Am. Psychiatr. Assoc 1994. Diagnostic and Statistical Manual of Mental Disorders. Washington, DC: Am. Psychiatr. Publ, 4th ed.. [Google Scholar]
  4. Am. Psychiatr. Assoc 2013. Diagnostic and Statistical Manual of Mental Disorders. Washington, DC: Am. Psychiatr. Publ, 5th ed.. [Google Scholar]
  5. Bartko JJ. 1976. On various intraclass correlation reliability coefficients. Psychol. Bull. 83:762–65 [Google Scholar]
  6. Berkson J. 1946. Limitations of the application of fourfold table analysis to hospital data. Biometrics Bull. 2:47–53 [Google Scholar]
  7. Berkson J. 1955. The statistical study of association between smoking and lung cancer. Proc. Staff Meet. Mayo Clin. 30:56–60 [Google Scholar]
  8. Berry KF, Johnston JE, Mielke PW Jr. 2005. Exact and resampling probability values for weighted kappa. Psychol. Rep. 96:243–52 [Google Scholar]
  9. Bloch DA, Kraemer HC. 1989. 2×2 kappa coefficients: measures of agreement or association. Biometrics 45:269–87 [Google Scholar]
  10. Brown GW. 1976. Berkson fallacy revisited: spurious conclusions from patient surveys. Am. J. Dis. Child. 130:56–60 [Google Scholar]
  11. Brown W. 1910. Some experimental results in the correlation of mental abilities. Br. J. Psychol. 3:296–322 [Google Scholar]
  12. Clarke DE, Narrow WE, Regier DA, Kuramoto SJ, Kupfer DJ. et al. 2013. DSM-5 field trials in the United States and Canada, part I: study design, sampling strategy, implementation, and analytic approaches. Am. J. Psychiatry 170:43–58 [Google Scholar]
  13. Cohen J. 1968. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 70:213–29 [Google Scholar]
  14. Detre KM, Wright E, Murphy ML, Takaro T. 1975. Observer agreement in evaluating coronary angiograms. Circulation 52:979–86 [Google Scholar]
  15. Donner A, Bull S. 1983. Inferences concerning a common intraclass correlation. Biometrics 39:771–75 [Google Scholar]
  16. Donner A, Wells G. 1986. A comparison of confidence interval methods for the intraclass correlation coefficient. Biometrics 42:401–12 [Google Scholar]
  17. Efron B. 1988. Bootstrap confidence intervals: good or bad?. Psychol. Bull. 104:293–96 [Google Scholar]
  18. Efron B, Gong G. 1983. A leisurely look at the bootstrap, the jackknife, and cross-validation. Am. Stat. 37:36–48 [Google Scholar]
  19. Efron B, Tibshirani R. 1995. Computer-Intensive Statistical Methods Stanford, CA: Div. Biostat. Stanford Univ. [Google Scholar]
  20. Elwood RW. 1993. Psychological tests and clinical discriminations: beginning to address the base rate problem. Clin. Psychol. Rev. 13:409–19 [Google Scholar]
  21. Finney DJ. 1994. On biometric language and its abuses. Biom. Bull. 11:2–4 [Google Scholar]
  22. Fleiss JL. 1971. Measuring nominal scale agreement among many raters. Psychol. Bull. 76:378–82 [Google Scholar]
  23. Fleiss JL. 1981. Statistical Methods For Rates and Proportions New York: Wiley [Google Scholar]
  24. Fleiss JL, Cicchetti DV. 1978. Inference about weighted kappa in the non-null case. Appl. Psychol. Meas. 2:113–17 [Google Scholar]
  25. Fleiss JL, Cohen J, Everitt BS. 1969. Large sample standard errors of kappa and weighted kappa. Psychol. Bull. 72:323–27 [Google Scholar]
  26. Frances A. 2013. Saving Normal: An Insider's Revolt Against Out-of-Control Psychiatric Diagnosis, DSM-5, Big Pharma, and the Medicalization of Ordinary Life New York: William Morrow [Google Scholar]
  27. Greenberg G. 2013. The Book of Woe: The DSM and the Unmaking of Psychiatry New York: Blue Rider Press [Google Scholar]
  28. Gross RT, Spiker D, Haynes CW. 1997. Helping Low Birth Weight, Premature Babies Stanford, CA: Stanford Univ. Press [Google Scholar]
  29. Koran LM. 1975a. The reliability of clinical methods, data and judgments, part 1. N. Engl. J. Med. 293:642–46 [Google Scholar]
  30. Koran LM. 1975b. The reliability of clinical methods, data and judgments, part 2. N. Engl. J. Med. 293:695–701 [Google Scholar]
  31. Kraemer HC. 1979. Ramifications of a population model for k as a coefficient of reliability. Psychometrika 44:461–72 [Google Scholar]
  32. Kraemer HC. 1980. Extensions of the kappa coefficient. Biometrics 36:207–16 [Google Scholar]
  33. Kraemer HC. 1992a. How many raters? Toward the most reliable diagnostic consensus. Stat. Med. 11:317–31 [Google Scholar]
  34. Kraemer HC. 1992b. Evaluating Medical Tests: Objective and Quantitative Guidelines Newbury Park, CA: Sage [Google Scholar]
  35. Kraemer HC. 2013. Validity and psychiatric diagnosis. Arch. Gen. Psychiatry 70:138–39 [Google Scholar]
  36. Kraemer HC, Kupfer DJ, Clarke DE, Narrow WE, Regier DA. 2012. DSM-5: How reliable is reliable enough?. Am. J. Psychiatry 169:13–15 [Google Scholar]
  37. Landis JR, Koch GG. 1977. The measurement of observer agreement for categorical data. Biometrics 33:159–74 [Google Scholar]
  38. Lord FM, Novick MR. 1968. Statistical Theories of Mental Test Scores Reading, MA: Addison-Wesley [Google Scholar]
  39. Ramasundarahettige CF, Donner A, Zhou GY. 2009. Confidence interval construction for a difference between two dependent intraclass correlation coefficients. Stat. Med. 28:1041–53 [Google Scholar]
  40. Regier DA, Narrow WE, Clarke DE, Kraemer HC, Kuramoto SJ. et al. 2013. DSM-5 field trials in the United States and Canada, part II: test-retest reliability of selected categorical diagnoses. Am. J. Psychiatry 170:59–70 [Google Scholar]
  41. Robins LN, Barrett JE. 1989. The Validity of Psychiatric Diagnosis New York: Raven [Google Scholar]
  42. Rothery P. 1979. A nonparametric measure of intraclass correlation. Biometrika 66:629–39 [Google Scholar]
  43. Shrout PE, Fleiss JL. 1979. Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86:420–28 [Google Scholar]
  44. Spearman C. 1910. Correlation calculated from faulty data. Br. J. Psychol. 3:271–95 [Google Scholar]
  45. Spitzer RL, Forman JB, Nee J. 1979. DSM-III field trials: I. Initial interrater diagnostic reliability. Am. J. Psychiatry 136:815–20 [Google Scholar]
  46. Spitznagel EL, Helzer JE. 1985. A proposed solution to the base rate problem in the kappa statistic. Arch. Gen. Psychiatry 42:725–28 [Google Scholar]
  47. Veiel HOF. 1988. Base-rates, cut-points, and interaction effects: the problem with dichotomized continuous variables. Psychol. Med. 18:703–10 [Google Scholar]

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error