Item Response Theory

Li Cai; Kilchan Choi; Mark Hansen; Lauren Harrell

doi:10.1146/annurev-statistics-041715-033702

Annual Review of Statistics and Its Application

Volume 3, 2016

Review Article

Free

Item Response Theory

Li Cai¹, Kilchan Choi¹, Mark Hansen¹, and Lauren Harrell¹
View Affiliations Hide Affiliations

Affiliations: CRESST, University of California, Los Angeles, California 90095-1521; email: [email protected]
Vol. 3:297-321 (Volume publication date June 2016) https://doi.org/10.1146/annurev-statistics-041715-033702
First published as a Review in Advance on March 02, 2016
© Annual Reviews

Abstract

This review introduces classical item response theory (IRT) models as well as more contemporary extensions to the case of multilevel, multidimensional, and mixtures of discrete and continuous latent variables through the lens of discrete multivariate analysis. A general modeling framework is discussed, and the applications of this framework in diverse contexts are presented, including large-scale educational surveys, randomized efficacy studies, and diagnostic measurement. Other topics covered include parameter estimation and model fit evaluation. Both classical (numerical integration based) and more modern (stochastic) parameter estimation approaches are discussed. Similarly, limited information goodness-of-fit testing and posterior predictive model checking are reviewed and contrasted. The review concludes with a discussion of some emerging strands in IRT research such as response time modeling, crossed random effects models, and non-standard models for response processes.

Keyword(s): diagnostic classification models, discrete multivariate analysis, item factor analysis, model fit testing, multidimensional IRT, multilevel IRT

Article metrics loading...

/content/journals/10.1146/annurev-statistics-041715-033702

2016-06-01

2024-05-10

Full text loading...

/deliver/fulltext/statistics/3/1/annurev-statistics-041715-033702.html?itemId=/content/journals/10.1146/annurev-statistics-041715-033702&mimeType=html&fmt=ahah

Literature Cited

Adams RJ, Wu ML. 2002. PISA 2000 technical report. Tech. Rep., Organ. Econ. Coop. Dev., Paris
AERA (Am. Educ. Res. Assoc.), APA (Am. Psychol. Assoc.), NCME (Natl. Counc. Meas. Educ.) 2014. Standards for Educational and Psychological Testing Washington, DC: Am. Educ. Res. Assoc.
Andrich D. 1996. A hyperbolic cosine latent trait model for unfolding polytomous responses: reconciling Thurstone and Likert methodologies. Br. J. Math. Stat. Psychol. 49:347–65 [Google Scholar]
Baker FB, Kim S-H. 2004. Item Response Theory: Parameter Estimation Techniques New York: Dekker
Bartholomew DJ, Tzamourani P. 1999. The goodness-of-fit of latent trait models in attitude measurement. Sociol. Res. 27:525–46 [Google Scholar]
Birnbaum A. 1968. Some latent trait models and their use in inferring an examinee's ability. Statistical Theories of Mental Test Scores FM Lord, MR Novick 397–479 Reading, MA: Addison-Wesley [Google Scholar]
Bock RD. 1997. A brief history of item theory response. Educ. Meas. Issues Pract. 1621–33
Bock RD, Aitkin M. 1981. Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika 46443–59
Bock RD, Bargmann RE. 1966. Analysis of covariance structures. Psychometrika 31:507–33
Bock RD, Gibbons RD, Muraki E. 1988. Full-information item factor analysis. Appl. Psychol. Meas. 12:261–80 [Google Scholar]
Böckenholt U. 2012. Modeling multiple response processes in judgment and choice. Psychol. Methods 17:665–78 [Google Scholar]
Böckenholt U. 2014. Modeling motivated misreports to sensitive survey questions. Psychometrika 79:515–37 [Google Scholar]
Booth JG, Hobert JP. 1999. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J. R. Stat. Soc. Ser. B 61:265–85 [Google Scholar]
Briggs DC, Wilson M. 2007. Generalizability in item response modeling. J. Educ. Meas. 44:2131–55 [Google Scholar]
Brown A. 2016. Item response models for forced-choice questionnaires: a common framework. Psychometrika 81135–60
Cai L. 2010a. A two-tier full-information item factor analysis model with applications. Psychometrika 75:581–612 [Google Scholar]
Cai L. 2010b. High-dimensional exploratory item factor analysis by a Metropolis–Hastings Robbins–Monro algorithm. Psychometrika 75:33–57 [Google Scholar]
Cai L. 2010c. Metropolis–Hastings Robbins–Monro algorithm for confirmatory item factor analysis. J. Educ. Behav. Stat. 35:307–35 [Google Scholar]
Cai L. 2015. Lord–Wingersky algorithm version 2.0 for hierarchical item factor models with applications in test scoring, scale alignment, and model fit testing. Psychometrika 80:535–59 [Google Scholar]
Cai L, Choi K, Kuhfeld M. 2016. On the role of multilevel item response models in multi-site evaluation studies for serious games. Issues Regarding the Use of Games and Simulations for Teaching and Assessment HF O'Neil, EL Baker, R Perez New York: Taylor & Francis. In press [Google Scholar]
Cai L, Hansen M. 2013. Limited-information goodness-of-fit testing of hierarchical item factor models. Br. J. Math. Stat. Psychol. 66:245–76 [Google Scholar]
Cai L, Maydeu-Olivares A, Coffman DL, Thissen D. 2006. Limited information goodness-of-fit testing of item response theory models for sparse 2^P tables. Br. J. Math. Stat. Psychol. 59:173–94 [Google Scholar]
Cai L, Thissen D. 2014. Modern approaches to parameter estimation in item response theory. Handbook of Item Response Theory Modeling: Applications to Typical Performance Assessment SP Reise, D Revicki 41–59 New York: Taylor & Francis [Google Scholar]
Cai L, Thissen D, du Toit SHC. 2011a. IRTPRO: flexible, multidimensional, multiple categorical IRT modeling. Computer Software. Lincolnwood, IL: Sci. Softw. Int. [Google Scholar]
Cai L, Yang J, Hansen M. 2011b. Generalized full-information item bifactor analysis. Psychol. Methods 16221–48
Chalmers RP. 2012. mirt: a multidimensional item response theory package for the R environment. J. Stat. Softw. 48:1–29 [Google Scholar]
Choi H-J, Rupp AA, Pan M. 2013. Standardized diagnostic assessment design and analysis: key ideas from modern measurement theory. Self-Directed Learning Oriented Assessment in the Asia-Pacific R Maclean 61–85 New York: Springer [Google Scholar]
Chung GKWK, Choi K, Baker EL, Cai L. 2014. The effects of math video games on learning: a randomized evaluation study with innovative impact estimation techniques CRESST Rep. 841, Natl. Cent. Res. Eval., Stand., Stud. Test., Univ. Calif., Los Angeles
Curran P, Hussong A, Cai L, Huang W, Chassin L. et al. 2008. Pooling data from multiple longitudinal studies: the role of item response theory in integrative data analysis. Dev. Psychol. 44:2365–80 [Google Scholar]
De Boeck P. 2008. Random item IRT models. Psychometrika 73:533–59 [Google Scholar]
de la Torre J, Douglas JA. 2004. Higher-order latent trait models for cognitive diagnosis. Psychometrika 69:333–53 [Google Scholar]
Edwards MC. 2010. A Markov chain Monte Carlo approach to confirmatory item factor analysis. Psychometrika 75:474–97 [Google Scholar]
Embretson S. 1991. A multidimensional latent trait model for measuring learning and change. Psychometrika 56:495–515 [Google Scholar]
Embretson SE. 1999. Generating items during testing: psychometric issues and models. Psychometrika 64:407–33 [Google Scholar]
Falk CF, Cai L. 2016. Maximum marginal likelihood estimation of a monotonic polynomial generalized partial credit model with applications to multiple group analysis. Psychometrika. In press. doi: 10.1007/s11336-014-9428-7
Fisher RA. 1925. Theory of statistical estimation. Proc. Camb. Philos. Soc. 22:700–25 [Google Scholar]
Foy P, Olson JF. 2009. TIMSS 2007 user Guide for the International Database Chestnut Hill, MA: TIMSS /PIRLS Int. Study Cent., Boston Coll.
Gibbons RD, Bock RD, Hedeker D, Weiss DJ, Segawa E. et al. 2007. Full-information item bifactor analysis of graded response data. Appl. Psychol. Meas. 31:4–19 [Google Scholar]
Gibbons RD, Hedeker DR. 1992. Full-information item bi-factor analysis. Psychometrika 57:3423–36 [Google Scholar]
Gibbons RD, Weiss DJ, Kupfer DJ, Frank E, Fagiolini A. et al. 2008. Using computerized adaptive testing to reduce the burden of mental health assessment. Psychiatr. Serv. 59:361–68 [Google Scholar]
Glas CAW, Meijer RR. 2003. A Bayesian approach to person fit analysis in item response theory models. Appl. Psychol. Meas. 27:217–33 [Google Scholar]
Glas CAW, van der Linden WJ. 2003. Computerized adaptive testing with item cloning. Appl. Psychol. Meas. 27:247–61 [Google Scholar]
Haberman SJ. 2008. When can subscores have value?. J. Educ. Behav. Stat. 33:204–29 [Google Scholar]
Hambleton RK, Swaminathan H, Rogers HJ. 1991. Fundamentals of Item Response Theory Newbury Park, CA: Sage
Harrell LA. 2015. Analysis strategies for planned missing data in health sciences and education research PhD Thesis, Dep. Biostat., Univ. Calif., Los Angeles
Hartz SM. 2002. A Bayesian framework for the unified model for assessing cognitive abilities blending theory with practicality PhD Thesis, Univ. Ill., Urbana-Champaign
Hastings WK. 1970. Monte Carlo simulation methods using Markov chains and their applications.. Biometrika 57:97–109 [Google Scholar]
Henson R, Templin JL, Willse JT. 2009. Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika 74:191–210 [Google Scholar]
Hoijtink H, Molenaar IW. 1997. A multidimensional item response model: constrained latent class analysis using the Gibbs sampler and posterior predictive checks. Psychometrika 62:171–80 [Google Scholar]
Holzinger KJ, Swineford F. 1937. The bi-factor method. Psychometrika 2:41–54 [Google Scholar]
Houts CR, Cai L. 2013. flexMIRT user's manual version 2.0: Flexible multilevel multidimensional item analysis and test scoring Chapel Hill, NC: Vector Psychom. Group
Janssen R, Tuerlinckx F, Meulders M, De Boeck P. 2000. A hierarchical IRT model for criterion-referenced measurement. J. Educ. Behav. Stat. 25:285–306 [Google Scholar]
Junker B, Sijtsma K. 2001. Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Appl. Psychol. Meas. 25:258–72 [Google Scholar]
Khorramdel L, von Davier M. 2014. Measuring response styles across the big five: a multiscale extension of an approach using multinomial processing trees. Multivar. Behav. Res. 49:2161–77 [Google Scholar]
Lazarsfeld PF. 1950. The logical and mathematical foundations of latent structure analysis. Measurement and Prediction 4 Studies in Social Psychology in World War II SA Stouffer SA, L Buttman, EA Suchman, PF Lazarsfeld, SA Star, JA Clausen 362–412 Princeton, NJ: Princeton Univ. Press [Google Scholar]
Lee YS, Park YS, Taylan D. 2011. A cognitive diagnostic modeling of attribute mastery in Massachusetts, Minnesota, and the US national sample using the TIMSS 2007. Int. J. Test. 11144–77
Levy R, Mislevy RJ, Sinharay S. 2009. Posterior predictive model checking for multidimensionality in item response theory. Appl. Psychol. Meas. 33:519–37
Levy R, Svetina D. 2011. A generalized dimensionality discrepancy measure for dimensionality assessment in multidimensional item response theory. Br. J. Math. Stat. Psychol. 64:208–32 [Google Scholar]
Liang L. 2007. A semi-parametric approach to estimating item response functions. PhD Thesis, Dep. Psychol., Ohio State Univ., Columbus, OH
Lindley DV, Smith AFM. 1972. Bayes estimates for the linear model (with discussion). J. R. Stat. Soc. Ser. B 34:1–41 [Google Scholar]
Little RJA, Rubin DB. 2002. Statistical Analysis with Missing Data New York: Wiley, 2nd ed..
Liu Y, Maydeu-Olivares A. 2013. Local dependence diagnostics in IRT modeling of binary data. Educ. Psychol. Meas. 73:2254–74 [Google Scholar]
Liu Y, Maydeu-Olivares A. 2014. Identifying the source of misfit in item response theory models. Multivar. Behav. Res. 49:354–71 [Google Scholar]
Lord FM, Wingersky MS. 1984. Comparison of IRT true-score and equipercentile observed-score “equatings.”. Appl. Psychol. Meas. 8: 453–61
Maydeu-Olivares A. 2013. Goodness-of-fit assessment of item response theory models. Meas. Interdiscip. Res. Perspect. 11:71–101 [Google Scholar]
Maydeu-Olivares A, Hernández A, McDonald RP. 2006. A multidimensional ideal point IRT model for binary data. Multivar. Behav. Res. 44:445–72 [Google Scholar]
Maydeu-Olivares A, Joe H. 2005. Limited and full information estimation and goodness-of-fit testing in 2ⁿ contingency tables: a unified framework. J. Am. Stat. Assoc. 100:1009–20 [Google Scholar]
McArdle JJ. 2009. Latent variable modeling of difference and changes with longitudinal data. Annu. Rev. Psychol. 60:577–605 [Google Scholar]
McCullagh P. 1986. The conditional distribution of goodness-of-fit statistics for discrete data. J. Am. Stat. Assoc. 81:104–7 [Google Scholar]
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. 1953. Equations of state space calculations by fast computing machines. J. Chem. Phys. 21:1087–92 [Google Scholar]
Mislevy RJ. 1991. Randomization-based inference about latent variables from complex samples. Psychometrika 56:177–96 [Google Scholar]
Mislevy RJ, Johnson EG, Muraki E. 1992. Scaling procedures in NAEP. J. Educ. Stat. 17131–54
Miyazaki K, Hoshino T. 2009. A Bayesian semiparametric item response model with Dirichlet process priors. Psychometrika 74:3375–93 [Google Scholar]
Monroe S, Cai L. 2014. Estimation of a Ramsay-curve item response theory model by the Metropolis–Hastings Robbins–Monro algorithm. Educ. Psychol. Meas. 74343–69
Moustaki I, Knott M. 2014. Latent variable models that account for atypical responses. J. R. Stat. Soc. Ser. C 63:2343–60 [Google Scholar]
Muraki E. 1992. A generalized partial credit model: application of an EM algorithm. Appl. Psychol. Meas. 16159–76
Orlando M, Thissen D. 2000. New item fit indices for dichotomous item response theory models. Appl. Psychol. Meas. 2450–64
Plieninger H, Meiser T. 2014. Validity of multi-process IRT models for separating content and response styles. Educ. Psychol. Meas. 74:5875–99 [Google Scholar]
Ramsay JO. 1991. Kernel smoothing approaches to nonparametric item characteristic curve estimation. Psychometrika 56:4611–30 [Google Scholar]
Rasch G. 1961. On general laws and the meaning of measurement in psychology. Proc. 4th Berkeley Symp. Math. Stat. Probab. 4:321–34 [Google Scholar]
Raudenbush SW, Bryk AS. 2002. Hierarchical Linear Models: Applications and Data Analysis Methods. Newbury Park, CA: Sage, 2nd ed..
Raudenbush SW, Liu X. 2000. Statistical power and optimal design for multisite randomized trials. Psychol. Methods 5:2199–213 [Google Scholar]
Reckase M. 2009. Multidimensional Item Response Theory New York: Springer
Reise SP. 2012. The rediscovery of bifactor measurement models. Multivar. Behav. Res. 47:5667–96 [Google Scholar]
Reise SP, Waller NG. 2009. Item response theory and clinical measurement. Annu. Rev. Clin. Psychol. 5:27–48 [Google Scholar]
Rijmen F. 2009. Efficient full-information maximum likelihood estimation for multidimensional IRT models Tech. Rep. RR-09-03, Educ. Test. Serv., Princeton, NJ
Robbins H, Monro S. 1951. A stochastic approximation method. Ann. Math. Stat. 22400–7
Rubin DB. 1981. The Bayesian bootstrap. Ann. Stat. 9:130–34 [Google Scholar]
Rubin DB. 1984. Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann. Stat. 12:1151–72 [Google Scholar]
Rubin DB. 1987. Multiple Imputation for Nonresponse in Surveys. New York: Wiley
Rupp AA, Templin J, Henson RA. 2010. Diagnostic Measurement: Theory, Methods, and Applications. New York: Guilford
Samejima F. 1969. Estimation of latent ability using a response pattern of graded scores. Psychom. Monogr. 17 Richmond, VA: Psychom. Soc. [Google Scholar]
Schilling S, Bock RD. 2005. High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika 70:533–55 [Google Scholar]
Schmid J, Leiman JM. 1957. The development of hierarchical factor solutions. Psychometrika 22:53–61 [Google Scholar]
Sinharay S, Johnson MS, Stern HS. 2006. Posterior predictive assessment of item response theory models. Appl. Psychol. Meas. 30:4298–321 [Google Scholar]
Sinharay S, Johnson MS, Williamson DM. 2003. An application of a Bayesian hierarchical model for item family calibration ETS Rep. RR-03-04, Educ. Test. Serv., Princeton, NJ
Spybrook J, Raudenbush SW. 2009. An examination of the precision and technical accuracy of the first wave of group-randomized trials funded by the Institute of Education Sciences. Educ. Eval. Policy Anal. 31:3298–318 [Google Scholar]
Tatsuoka KK. 1983. Rule space: an approach for dealing with misconceptions based on item response theory. J. Educ. Meas. 20:345–54 [Google Scholar]
Templin J, Henson RA. 2006. Measurement of psychological disorders using cognitive diagnosis models. Psychol. Methods 11:287–305 [Google Scholar]
Thissen D, Cai L, Bock RD. 2010. The nominal categories item response model. Handbook of Polytomous Item Response Theory Models: Development and Applications ML Nering, R Ostini 43–75 New York: Taylor & Francis [Google Scholar]
Thissen D, Steinberg L. 2009. Item response theory. Handbook of Quantitative Methods in Psychology R Millsap, A Maydeu-Olivares 148–78 London: Sage [Google Scholar]
Thissen D, Wainer H. 2001. Test Scoring Hillsdale, NJ: Erlbaum
Thissen-Roe A, Thissen D. 2013. A two-decision model for responses to Likert-type items. J. Educ. Behav. Stat. 38:522–47 [Google Scholar]
Thomas N, Gan N. 1997. Generating multiple imputations for matrix sampling data analyzed with item response models. J. Educ. Behav. Stat. 22:4425–46 [Google Scholar]
Thurstone LL. 1925. A method of scaling psychological and educational tests. J. Educ. Psychol. 16:433–51 [Google Scholar]
van der Linden WJ. 2006. A lognormal model for response times on test items. J. Educ. Behav. Stat. 31:181–204 [Google Scholar]
van der Linden WJ. 2008. Using response times for item selection in adaptive testing. J. Educ. Behav. Stat. 33:5–20 [Google Scholar]
von Davier M. 2005. A general diagnostic model applied to language testing data. ETS Rep. RR-05-16, Educ. Test. Serv., Princeton, NJ
von Davier M, Gonzalez E, Mislevy R. 2009. What are plausible values and why are they useful?. Issues and Methodologies in Large-Scale Assessments M von Davier, D Hastedt 9–36 IERI Monogr. Ser. 2 Princeton, NJ: Inst. Econ. Res. Innov. [Google Scholar]
Wainer H, Bradlow ET, Wang X. 2007. Testlet Response Theory and Its Applications Cambridge, UK: Cambridge Univ. Press
Wainer H, Vevea JL, Camacho F, Reeve BB, Rosa K, Nelson L. 2001. Augmented scores—“borrowing strength” to compute scores based on small numbers of items. Test Scoring D Thissen, H Wainer 343–87 Hillsdale, NJ: Erlbaum [Google Scholar]
Wirth RJ, Edwards MC. 2007. Item factor analysis: current approaches and future directions. Psychol. Methods 12:58–79 [Google Scholar]
Woods CM, Thissen D. 2006. Item response theory with estimation of the latent population distribution using spline-based densities. Psychometrika 71281–301
Yang JS, Cai L. 2014. Estimation of contextual effects through nonlinear multilevel latent variable modeling with a Metropolis–Hastings Robbins–Monro algorithm. J. Educ. Behav. Stat. 39:550–82 [Google Scholar]

/content/journals/10.1146/annurev-statistics-041715-033702

Item Response Theory

Annual Review of Statistics and Its Application 3, 297 (2016); https://doi.org/10.1146/annurev-statistics-041715-033702

/content/journals/10.1146/annurev-statistics-041715-033702

Data & Media loading...

Article Type: Review Article

Most Cited Most Cited RSS feed

- Functional Data Analysis
  
  Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller
  
  Vol. 3 (2016), pp. 257–295
- Probabilistic Forecasting
  
  Tilmann Gneiting, and Matthias Katzfuss
  
  Vol. 1 (2014), pp. 125–151
- Bayesian Computing with INLA: A Review
  
  Håvard Rue, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren
  
  Vol. 4 (2017), pp. 395–421
- Functional Regression
  
  Jeffrey S. Morris
  
  Vol. 2 (2015), pp. 321–359
- Topological Data Analysis
  
  Larry Wasserman
  
  Vol. 5 (2018), pp. 501–532
- Algorithmic Fairness: Choices, Assumptions, and Definitions
  
  Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, and Kristian Lum
  
  Vol. 8 (2021), pp. 141–163
- Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis
  
  Hongzhe Li
  
  Vol. 2 (2015), pp. 73–94
- Learning Deep Generative Models
  
  Ruslan Salakhutdinov
  
  Vol. 2 (2015), pp. 361–385
- On p-Values and Bayes Factors
  
  Leonhard Held, and Manuela Ott
  
  Vol. 5 (2018), pp. 393–419
- High-Dimensional Statistics with a View Toward Applications in Biology
  
  Peter Bühlmann, Markus Kalisch, and Lukas Meier
  
  Vol. 1 (2014), pp. 255–278
More Less

Annual Review of Statistics and Its Application

Volume 3, 2016

Review Article

Free

Item Response Theory

Abstract

Most Read This Month

Most Cited Most Cited RSS feed