1932

Abstract

This review introduces classical item response theory (IRT) models as well as more contemporary extensions to the case of multilevel, multidimensional, and mixtures of discrete and continuous latent variables through the lens of discrete multivariate analysis. A general modeling framework is discussed, and the applications of this framework in diverse contexts are presented, including large-scale educational surveys, randomized efficacy studies, and diagnostic measurement. Other topics covered include parameter estimation and model fit evaluation. Both classical (numerical integration based) and more modern (stochastic) parameter estimation approaches are discussed. Similarly, limited information goodness-of-fit testing and posterior predictive model checking are reviewed and contrasted. The review concludes with a discussion of some emerging strands in IRT research such as response time modeling, crossed random effects models, and non-standard models for response processes.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-041715-033702
2016-06-01
2024-06-25
Loading full text...

Full text loading...

/deliver/fulltext/statistics/3/1/annurev-statistics-041715-033702.html?itemId=/content/journals/10.1146/annurev-statistics-041715-033702&mimeType=html&fmt=ahah

Literature Cited

  1. Adams RJ, Wu ML. 2002. PISA 2000 technical report. Tech. Rep., Organ. Econ. Coop. Dev., Paris [Google Scholar]
  2. AERA (Am. Educ. Res. Assoc.), APA (Am. Psychol. Assoc.), NCME (Natl. Counc. Meas. Educ.) 2014. Standards for Educational and Psychological Testing Washington, DC: Am. Educ. Res. Assoc. [Google Scholar]
  3. Andrich D. 1996. A hyperbolic cosine latent trait model for unfolding polytomous responses: reconciling Thurstone and Likert methodologies. Br. J. Math. Stat. Psychol. 49:347–65 [Google Scholar]
  4. Baker FB, Kim S-H. 2004. Item Response Theory: Parameter Estimation Techniques New York: Dekker [Google Scholar]
  5. Bartholomew DJ, Tzamourani P. 1999. The goodness-of-fit of latent trait models in attitude measurement. Sociol. Res. 27:525–46 [Google Scholar]
  6. Birnbaum A. 1968. Some latent trait models and their use in inferring an examinee's ability. Statistical Theories of Mental Test Scores FM Lord, MR Novick 397–479 Reading, MA: Addison-Wesley [Google Scholar]
  7. Bock RD. 1997. A brief history of item theory response. Educ. Meas. Issues Pract. 1621–33 [Google Scholar]
  8. Bock RD, Aitkin M. 1981. Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika 46443–59 [Google Scholar]
  9. Bock RD, Bargmann RE. 1966. Analysis of covariance structures. Psychometrika 31:507–33 [Google Scholar]
  10. Bock RD, Gibbons RD, Muraki E. 1988. Full-information item factor analysis. Appl. Psychol. Meas. 12:261–80 [Google Scholar]
  11. Böckenholt U. 2012. Modeling multiple response processes in judgment and choice. Psychol. Methods 17:665–78 [Google Scholar]
  12. Böckenholt U. 2014. Modeling motivated misreports to sensitive survey questions. Psychometrika 79:515–37 [Google Scholar]
  13. Booth JG, Hobert JP. 1999. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J. R. Stat. Soc. Ser. B 61:265–85 [Google Scholar]
  14. Briggs DC, Wilson M. 2007. Generalizability in item response modeling. J. Educ. Meas. 44:2131–55 [Google Scholar]
  15. Brown A. 2016. Item response models for forced-choice questionnaires: a common framework. Psychometrika 81135–60 [Google Scholar]
  16. Cai L. 2010a. A two-tier full-information item factor analysis model with applications. Psychometrika 75:581–612 [Google Scholar]
  17. Cai L. 2010b. High-dimensional exploratory item factor analysis by a Metropolis–Hastings Robbins–Monro algorithm. Psychometrika 75:33–57 [Google Scholar]
  18. Cai L. 2010c. Metropolis–Hastings Robbins–Monro algorithm for confirmatory item factor analysis. J. Educ. Behav. Stat. 35:307–35 [Google Scholar]
  19. Cai L. 2015. Lord–Wingersky algorithm version 2.0 for hierarchical item factor models with applications in test scoring, scale alignment, and model fit testing. Psychometrika 80:535–59 [Google Scholar]
  20. Cai L, Choi K, Kuhfeld M. 2016. On the role of multilevel item response models in multi-site evaluation studies for serious games. Issues Regarding the Use of Games and Simulations for Teaching and Assessment HF O'Neil, EL Baker, R Perez New York: Taylor & Francis. In press [Google Scholar]
  21. Cai L, Hansen M. 2013. Limited-information goodness-of-fit testing of hierarchical item factor models. Br. J. Math. Stat. Psychol. 66:245–76 [Google Scholar]
  22. Cai L, Maydeu-Olivares A, Coffman DL, Thissen D. 2006. Limited information goodness-of-fit testing of item response theory models for sparse 2P tables. Br. J. Math. Stat. Psychol. 59:173–94 [Google Scholar]
  23. Cai L, Thissen D. 2014. Modern approaches to parameter estimation in item response theory. Handbook of Item Response Theory Modeling: Applications to Typical Performance Assessment SP Reise, D Revicki 41–59 New York: Taylor & Francis [Google Scholar]
  24. Cai L, Thissen D, du Toit SHC. 2011a. IRTPRO: flexible, multidimensional, multiple categorical IRT modeling. Computer Software. Lincolnwood, IL: Sci. Softw. Int. [Google Scholar]
  25. Cai L, Yang J, Hansen M. 2011b. Generalized full-information item bifactor analysis. Psychol. Methods 16221–48 [Google Scholar]
  26. Chalmers RP. 2012. mirt: a multidimensional item response theory package for the R environment. J. Stat. Softw. 48:1–29 [Google Scholar]
  27. Choi H-J, Rupp AA, Pan M. 2013. Standardized diagnostic assessment design and analysis: key ideas from modern measurement theory. Self-Directed Learning Oriented Assessment in the Asia-Pacific R Maclean 61–85 New York: Springer [Google Scholar]
  28. Chung GKWK, Choi K, Baker EL, Cai L. 2014. The effects of math video games on learning: a randomized evaluation study with innovative impact estimation techniques CRESST Rep. 841, Natl. Cent. Res. Eval., Stand., Stud. Test., Univ. Calif., Los Angeles [Google Scholar]
  29. Curran P, Hussong A, Cai L, Huang W, Chassin L. et al. 2008. Pooling data from multiple longitudinal studies: the role of item response theory in integrative data analysis. Dev. Psychol. 44:2365–80 [Google Scholar]
  30. De Boeck P. 2008. Random item IRT models. Psychometrika 73:533–59 [Google Scholar]
  31. de la Torre J, Douglas JA. 2004. Higher-order latent trait models for cognitive diagnosis. Psychometrika 69:333–53 [Google Scholar]
  32. Edwards MC. 2010. A Markov chain Monte Carlo approach to confirmatory item factor analysis. Psychometrika 75:474–97 [Google Scholar]
  33. Embretson S. 1991. A multidimensional latent trait model for measuring learning and change. Psychometrika 56:495–515 [Google Scholar]
  34. Embretson SE. 1999. Generating items during testing: psychometric issues and models. Psychometrika 64:407–33 [Google Scholar]
  35. Falk CF, Cai L. 2016. Maximum marginal likelihood estimation of a monotonic polynomial generalized partial credit model with applications to multiple group analysis. Psychometrika. In press. doi: 10.1007/s11336-014-9428-7 [Google Scholar]
  36. Fisher RA. 1925. Theory of statistical estimation. Proc. Camb. Philos. Soc. 22:700–25 [Google Scholar]
  37. Foy P, Olson JF. 2009. TIMSS 2007 user Guide for the International Database Chestnut Hill, MA: TIMSS /PIRLS Int. Study Cent., Boston Coll. [Google Scholar]
  38. Gibbons RD, Bock RD, Hedeker D, Weiss DJ, Segawa E. et al. 2007. Full-information item bifactor analysis of graded response data. Appl. Psychol. Meas. 31:4–19 [Google Scholar]
  39. Gibbons RD, Hedeker DR. 1992. Full-information item bi-factor analysis. Psychometrika 57:3423–36 [Google Scholar]
  40. Gibbons RD, Weiss DJ, Kupfer DJ, Frank E, Fagiolini A. et al. 2008. Using computerized adaptive testing to reduce the burden of mental health assessment. Psychiatr. Serv. 59:361–68 [Google Scholar]
  41. Glas CAW, Meijer RR. 2003. A Bayesian approach to person fit analysis in item response theory models. Appl. Psychol. Meas. 27:217–33 [Google Scholar]
  42. Glas CAW, van der Linden WJ. 2003. Computerized adaptive testing with item cloning. Appl. Psychol. Meas. 27:247–61 [Google Scholar]
  43. Haberman SJ. 2008. When can subscores have value?. J. Educ. Behav. Stat. 33:204–29 [Google Scholar]
  44. Hambleton RK, Swaminathan H, Rogers HJ. 1991. Fundamentals of Item Response Theory Newbury Park, CA: Sage [Google Scholar]
  45. Harrell LA. 2015. Analysis strategies for planned missing data in health sciences and education research PhD Thesis, Dep. Biostat., Univ. Calif., Los Angeles [Google Scholar]
  46. Hartz SM. 2002. A Bayesian framework for the unified model for assessing cognitive abilities blending theory with practicality PhD Thesis, Univ. Ill., Urbana-Champaign [Google Scholar]
  47. Hastings WK. 1970. Monte Carlo simulation methods using Markov chains and their applications.. Biometrika 57:97–109 [Google Scholar]
  48. Henson R, Templin JL, Willse JT. 2009. Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika 74:191–210 [Google Scholar]
  49. Hoijtink H, Molenaar IW. 1997. A multidimensional item response model: constrained latent class analysis using the Gibbs sampler and posterior predictive checks. Psychometrika 62:171–80 [Google Scholar]
  50. Holzinger KJ, Swineford F. 1937. The bi-factor method. Psychometrika 2:41–54 [Google Scholar]
  51. Houts CR, Cai L. 2013. flexMIRT user's manual version 2.0: Flexible multilevel multidimensional item analysis and test scoring Chapel Hill, NC: Vector Psychom. Group [Google Scholar]
  52. Janssen R, Tuerlinckx F, Meulders M, De Boeck P. 2000. A hierarchical IRT model for criterion-referenced measurement. J. Educ. Behav. Stat. 25:285–306 [Google Scholar]
  53. Junker B, Sijtsma K. 2001. Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Appl. Psychol. Meas. 25:258–72 [Google Scholar]
  54. Khorramdel L, von Davier M. 2014. Measuring response styles across the big five: a multiscale extension of an approach using multinomial processing trees. Multivar. Behav. Res. 49:2161–77 [Google Scholar]
  55. Lazarsfeld PF. 1950. The logical and mathematical foundations of latent structure analysis. Measurement and Prediction 4 Studies in Social Psychology in World War II SA Stouffer SA, L Buttman, EA Suchman, PF Lazarsfeld, SA Star, JA Clausen 362–412 Princeton, NJ: Princeton Univ. Press [Google Scholar]
  56. Lee YS, Park YS, Taylan D. 2011. A cognitive diagnostic modeling of attribute mastery in Massachusetts, Minnesota, and the US national sample using the TIMSS 2007. Int. J. Test. 11144–77 [Google Scholar]
  57. Levy R, Mislevy RJ, Sinharay S. 2009. Posterior predictive model checking for multidimensionality in item response theory. Appl. Psychol. Meas. 33:519–37 [Google Scholar]
  58. Levy R, Svetina D. 2011. A generalized dimensionality discrepancy measure for dimensionality assessment in multidimensional item response theory. Br. J. Math. Stat. Psychol. 64:208–32 [Google Scholar]
  59. Liang L. 2007. A semi-parametric approach to estimating item response functions. PhD Thesis, Dep. Psychol., Ohio State Univ., Columbus, OH [Google Scholar]
  60. Lindley DV, Smith AFM. 1972. Bayes estimates for the linear model (with discussion). J. R. Stat. Soc. Ser. B 34:1–41 [Google Scholar]
  61. Little RJA, Rubin DB. 2002. Statistical Analysis with Missing Data New York: Wiley, 2nd ed.. [Google Scholar]
  62. Liu Y, Maydeu-Olivares A. 2013. Local dependence diagnostics in IRT modeling of binary data. Educ. Psychol. Meas. 73:2254–74 [Google Scholar]
  63. Liu Y, Maydeu-Olivares A. 2014. Identifying the source of misfit in item response theory models. Multivar. Behav. Res. 49:354–71 [Google Scholar]
  64. Lord FM, Wingersky MS. 1984. Comparison of IRT true-score and equipercentile observed-score “equatings.”. Appl. Psychol. Meas. 8: 453–61 [Google Scholar]
  65. Maydeu-Olivares A. 2013. Goodness-of-fit assessment of item response theory models. Meas. Interdiscip. Res. Perspect. 11:71–101 [Google Scholar]
  66. Maydeu-Olivares A, Hernández A, McDonald RP. 2006. A multidimensional ideal point IRT model for binary data. Multivar. Behav. Res. 44:445–72 [Google Scholar]
  67. Maydeu-Olivares A, Joe H. 2005. Limited and full information estimation and goodness-of-fit testing in 2n contingency tables: a unified framework. J. Am. Stat. Assoc. 100:1009–20 [Google Scholar]
  68. McArdle JJ. 2009. Latent variable modeling of difference and changes with longitudinal data. Annu. Rev. Psychol. 60:577–605 [Google Scholar]
  69. McCullagh P. 1986. The conditional distribution of goodness-of-fit statistics for discrete data. J. Am. Stat. Assoc. 81:104–7 [Google Scholar]
  70. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. 1953. Equations of state space calculations by fast computing machines. J. Chem. Phys. 21:1087–92 [Google Scholar]
  71. Mislevy RJ. 1991. Randomization-based inference about latent variables from complex samples. Psychometrika 56:177–96 [Google Scholar]
  72. Mislevy RJ, Johnson EG, Muraki E. 1992. Scaling procedures in NAEP. J. Educ. Stat. 17131–54 [Google Scholar]
  73. Miyazaki K, Hoshino T. 2009. A Bayesian semiparametric item response model with Dirichlet process priors. Psychometrika 74:3375–93 [Google Scholar]
  74. Monroe S, Cai L. 2014. Estimation of a Ramsay-curve item response theory model by the Metropolis–Hastings Robbins–Monro algorithm. Educ. Psychol. Meas. 74343–69 [Google Scholar]
  75. Moustaki I, Knott M. 2014. Latent variable models that account for atypical responses. J. R. Stat. Soc. Ser. C 63:2343–60 [Google Scholar]
  76. Muraki E. 1992. A generalized partial credit model: application of an EM algorithm. Appl. Psychol. Meas. 16159–76 [Google Scholar]
  77. Orlando M, Thissen D. 2000. New item fit indices for dichotomous item response theory models. Appl. Psychol. Meas. 2450–64 [Google Scholar]
  78. Plieninger H, Meiser T. 2014. Validity of multi-process IRT models for separating content and response styles. Educ. Psychol. Meas. 74:5875–99 [Google Scholar]
  79. Ramsay JO. 1991. Kernel smoothing approaches to nonparametric item characteristic curve estimation. Psychometrika 56:4611–30 [Google Scholar]
  80. Rasch G. 1961. On general laws and the meaning of measurement in psychology. Proc. 4th Berkeley Symp. Math. Stat. Probab. 4:321–34 [Google Scholar]
  81. Raudenbush SW, Bryk AS. 2002. Hierarchical Linear Models: Applications and Data Analysis Methods. Newbury Park, CA: Sage, 2nd ed.. [Google Scholar]
  82. Raudenbush SW, Liu X. 2000. Statistical power and optimal design for multisite randomized trials. Psychol. Methods 5:2199–213 [Google Scholar]
  83. Reckase M. 2009. Multidimensional Item Response Theory New York: Springer [Google Scholar]
  84. Reise SP. 2012. The rediscovery of bifactor measurement models. Multivar. Behav. Res. 47:5667–96 [Google Scholar]
  85. Reise SP, Waller NG. 2009. Item response theory and clinical measurement. Annu. Rev. Clin. Psychol. 5:27–48 [Google Scholar]
  86. Rijmen F. 2009. Efficient full-information maximum likelihood estimation for multidimensional IRT models Tech. Rep. RR-09-03, Educ. Test. Serv., Princeton, NJ [Google Scholar]
  87. Robbins H, Monro S. 1951. A stochastic approximation method. Ann. Math. Stat. 22400–7 [Google Scholar]
  88. Rubin DB. 1981. The Bayesian bootstrap. Ann. Stat. 9:130–34 [Google Scholar]
  89. Rubin DB. 1984. Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann. Stat. 12:1151–72 [Google Scholar]
  90. Rubin DB. 1987. Multiple Imputation for Nonresponse in Surveys. New York: Wiley [Google Scholar]
  91. Rupp AA, Templin J, Henson RA. 2010. Diagnostic Measurement: Theory, Methods, and Applications. New York: Guilford [Google Scholar]
  92. Samejima F. 1969. Estimation of latent ability using a response pattern of graded scores. Psychom. Monogr. 17 Richmond, VA: Psychom. Soc. [Google Scholar]
  93. Schilling S, Bock RD. 2005. High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika 70:533–55 [Google Scholar]
  94. Schmid J, Leiman JM. 1957. The development of hierarchical factor solutions. Psychometrika 22:53–61 [Google Scholar]
  95. Sinharay S, Johnson MS, Stern HS. 2006. Posterior predictive assessment of item response theory models. Appl. Psychol. Meas. 30:4298–321 [Google Scholar]
  96. Sinharay S, Johnson MS, Williamson DM. 2003. An application of a Bayesian hierarchical model for item family calibration ETS Rep. RR-03-04, Educ. Test. Serv., Princeton, NJ [Google Scholar]
  97. Spybrook J, Raudenbush SW. 2009. An examination of the precision and technical accuracy of the first wave of group-randomized trials funded by the Institute of Education Sciences. Educ. Eval. Policy Anal. 31:3298–318 [Google Scholar]
  98. Tatsuoka KK. 1983. Rule space: an approach for dealing with misconceptions based on item response theory. J. Educ. Meas. 20:345–54 [Google Scholar]
  99. Templin J, Henson RA. 2006. Measurement of psychological disorders using cognitive diagnosis models. Psychol. Methods 11:287–305 [Google Scholar]
  100. Thissen D, Cai L, Bock RD. 2010. The nominal categories item response model. Handbook of Polytomous Item Response Theory Models: Development and Applications ML Nering, R Ostini 43–75 New York: Taylor & Francis [Google Scholar]
  101. Thissen D, Steinberg L. 2009. Item response theory. Handbook of Quantitative Methods in Psychology R Millsap, A Maydeu-Olivares 148–78 London: Sage [Google Scholar]
  102. Thissen D, Wainer H. 2001. Test Scoring Hillsdale, NJ: Erlbaum [Google Scholar]
  103. Thissen-Roe A, Thissen D. 2013. A two-decision model for responses to Likert-type items. J. Educ. Behav. Stat. 38:522–47 [Google Scholar]
  104. Thomas N, Gan N. 1997. Generating multiple imputations for matrix sampling data analyzed with item response models. J. Educ. Behav. Stat. 22:4425–46 [Google Scholar]
  105. Thurstone LL. 1925. A method of scaling psychological and educational tests. J. Educ. Psychol. 16:433–51 [Google Scholar]
  106. van der Linden WJ. 2006. A lognormal model for response times on test items. J. Educ. Behav. Stat. 31:181–204 [Google Scholar]
  107. van der Linden WJ. 2008. Using response times for item selection in adaptive testing. J. Educ. Behav. Stat. 33:5–20 [Google Scholar]
  108. von Davier M. 2005. A general diagnostic model applied to language testing data. ETS Rep. RR-05-16, Educ. Test. Serv., Princeton, NJ [Google Scholar]
  109. von Davier M, Gonzalez E, Mislevy R. 2009. What are plausible values and why are they useful?. Issues and Methodologies in Large-Scale Assessments M von Davier, D Hastedt 9–36 IERI Monogr. Ser. 2 Princeton, NJ: Inst. Econ. Res. Innov. [Google Scholar]
  110. Wainer H, Bradlow ET, Wang X. 2007. Testlet Response Theory and Its Applications Cambridge, UK: Cambridge Univ. Press [Google Scholar]
  111. Wainer H, Vevea JL, Camacho F, Reeve BB, Rosa K, Nelson L. 2001. Augmented scores—“borrowing strength” to compute scores based on small numbers of items. Test Scoring D Thissen, H Wainer 343–87 Hillsdale, NJ: Erlbaum [Google Scholar]
  112. Wirth RJ, Edwards MC. 2007. Item factor analysis: current approaches and future directions. Psychol. Methods 12:58–79 [Google Scholar]
  113. Woods CM, Thissen D. 2006. Item response theory with estimation of the latent population distribution using spline-based densities. Psychometrika 71281–301 [Google Scholar]
  114. Yang JS, Cai L. 2014. Estimation of contextual effects through nonlinear multilevel latent variable modeling with a Metropolis–Hastings Robbins–Monro algorithm. J. Educ. Behav. Stat. 39:550–82 [Google Scholar]
/content/journals/10.1146/annurev-statistics-041715-033702
Loading
/content/journals/10.1146/annurev-statistics-041715-033702
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error