1932

Abstract

Educational measurement assigns numbers to individuals based on observed data to represent individuals’ educational properties such as abilities, aptitudes, achievements, progress, and performance. The current review introduces a selection of statistical applications to educational measurement, ranging from classical statistical theory (e.g., Pearson correlation and the Mantel–Haenszel test) to more sophisticated models (e.g., latent variable, survival, and mixture modeling) and statistical and machine learning (e.g., high-dimensional modeling, deep and reinforcement learning). Three main subjects are discussed: evaluations for test validity, computer-based assessments, and psychometrics informing learning. Specific topics include item bias detection, high-dimensional latent variable modeling, computerized adaptive testing, response time and log data analysis, cognitive diagnostic models, and individualized learning.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-042720-104044
2021-03-07
2024-04-19
Loading full text...

Full text loading...

/deliver/fulltext/statistics/8/1/annurev-statistics-042720-104044.html?itemId=/content/journals/10.1146/annurev-statistics-042720-104044&mimeType=html&fmt=ahah

Literature Cited

  1. Allen M, Yen W. 1979. Introduction to Measurement Theory. Monterey, CA: Brooks Cole
    [Google Scholar]
  2. Almond RG, Mislevy RJ, Steinberg LS, Yan D, Williamson DM 2015. Bayesian Networks in Educational Assessment New York: Springer
  3. Am. Educ. Res. Assoc., Am. Psychol. Assoc., Natl. Counc. Meas. Educ 2014. The Standards for Educational and Psychological Testing Washington, DC: Am. Educ. Res. Assoc.
  4. Bolt DM, Cohen AS, Wollack JA 2002. Item parameter estimation under conditions of test speededness: application of a mixture Rasch model with ordinal constraints. J. Educ. Meas. 39:4331–48
    [Google Scholar]
  5. Brennan R. 2001. Generalizability Theory New York: Springer
  6. Breslow NE. 1972. Discussion of the paper by D. R. Cox. J. R. Stat. Soc. B 34:216–17
    [Google Scholar]
  7. Bridgeman B, Cline F. 2004. Effects of differentially time‐consuming tests on computer‐adaptive test scores. J. Educ. Meas. 41:2137–48
    [Google Scholar]
  8. Bulut O, Suh Y. 2017. Detecting multidimensional differential item functioning with the multiple indicators multiple causes model, the item response theory likelihood ratio test, and logistic regression. Front. Educ. 2. https://doi.org/10.3389/feduc.2017.00051
    [Crossref] [Google Scholar]
  9. Cai L, Choi K, Hanson M, Harrell L 2016. Item response theory. Annu. Rev. Stat. Appl. 3:297–321
    [Google Scholar]
  10. Carlson S. 2000. ETS finds flaws in the way online GRE rates some students. Chron. High. Educ. 47:8A47
    [Google Scholar]
  11. Chang H-H. 2015. Psychometrics behind computerized adaptive testing. Psychometrika 80:11–20
    [Google Scholar]
  12. Chang H-H, Mazzeo J. 1994. The unique correspondence of the item response function and item category response functions in polytomously scored item response models. Psychometrika 59:3391–404
    [Google Scholar]
  13. Chang H-H, Mazzeo J, Roussos L 1996. Detecting DIF for polytomously scored items: an adaptation of the SIBTEST procedure. J. Educ. Meas. 33:3333–53
    [Google Scholar]
  14. Chang H-H, Qian J, Ying Z 2001. a-Stratified multistage computerized adaptive testing with b blocking. Appl. Psychol. Meas. 25:4333–41
    [Google Scholar]
  15. Chang H-H, Stout W. 1993. The asymptotic posterior normality of the latent trait in an IRT model. Psychometrika 3:37–52
    [Google Scholar]
  16. Chang H-H, Ying Z. 1996. A global information approach to computerized adaptive testing. Appl. Psychol. Meas. 20:3213–29
    [Google Scholar]
  17. Chang H-H, Ying Z. 1999. a-Stratified multistage computerized adaptive testing. Appl. Psychol. Meas. 23:3211–22
    [Google Scholar]
  18. Chang H-H, Ying Z. 2008. To weight or not to weight? Balancing influence of initial items in adaptive testing. Psychometrika 73:3441–50
    [Google Scholar]
  19. Chang H-H, Ying Z. 2009. Nonlinear sequential designs for logistic item response theory models with applications to computerized adaptive tests. Ann. Stat. 37:31466–88
    [Google Scholar]
  20. Chen Y, Culpepper SA, Chen Y, Douglas J 2018a. Bayesian estimation of the DINA Q matrix. Psychometrika 83:189–108
    [Google Scholar]
  21. Chen Y, Culpepper SA, Wang S, Douglas J 2018b. A hidden Markov model for learning trajectories in cognitive diagnosis with application to spatial rotation skills. Appl. Psychol. Meas. 42:15–23
    [Google Scholar]
  22. Chen Y, Li X, Liu J, Ying Z 2018c. Recommendation system for adaptive learning. Appl. Psychol. Meas. 42:124–41
    [Google Scholar]
  23. Chen Y, Li X, Liu J, Ying Z 2018d. Robust measurement via a fused latent and graphical item response theory model. Psychometrika 83:538–62
    [Google Scholar]
  24. Chen Y, Li X, Liu J, Ying Z 2019. Statistical analysis of complex problem-solving process data: an event history analysis approach. Front. Psychol. 10:486
    [Google Scholar]
  25. Chen Y, Li X, Zhang S 2019. Joint maximum likelihood estimation for high-dimensional exploratory item factor analysis. Psychometrika 84:124–46
    [Google Scholar]
  26. Chen Y, Liu J, Xu G, Ying Z 2015. Statistical analysis of Q-matrix based diagnostic classification models. J. Am. Stat. Assoc. 110:510850–66
    [Google Scholar]
  27. Cheng S, Wei L, Ying Z 1995. Analysis of transformation models with censored data. Biometrika 82:4835–45
    [Google Scholar]
  28. Cheng Y, Chen P, Qian J, Chang H-H 2013. Equated pooled booklet method in DIF testing. Appl. Psychol. Meas. 37:4276–88
    [Google Scholar]
  29. Chiu CY. 2013. Statistical refinement of the Q-matrix in cognitive diagnosis. Appl. Psychol. Meas. 37:8598–618
    [Google Scholar]
  30. Cizek GJ. 1999. Cheating on Tests: How to Do It, Detect It, and Prevent It London: Routledge
  31. Cox DR. 1972. Regression models and life‐tables. J. R. Stat. Soc. Ser. B 34:2187–202
    [Google Scholar]
  32. de la Torre J. 2008. An empirically based method of Q-matrix validation for the DINA model: development and applications. J. Educ. Meas. 45:343–62
    [Google Scholar]
  33. de la Torre J. 2011. The generalized DINA model framework. Psychometrika 76:2179–99
    [Google Scholar]
  34. de la, Torre J, Chiu CY 2016. A general method of empirical Q-matrix validation. Psychometrika 81:2253–73
    [Google Scholar]
  35. de la Torre J, Douglas JA 2004. Higher-order latent trait models for cognitive diagnosis. Psychometrika 69:3333–53
    [Google Scholar]
  36. DeCarlo LT. 2011. On the analysis of fraction subtraction data: the DINA model, classification, latent class sizes, and the Q-matrix. Appl. Psychol. Meas. 35:18–26
    [Google Scholar]
  37. Embretson S. 1984. A general latent trait model for response processes. Psychometrika 49:2175–86
    [Google Scholar]
  38. Embretson SE. 1997. Multicomponent response models. Handbook of Modern Item Response Theory WJ van der Linden, RK Hambleton 305–21 New York: Springer
    [Google Scholar]
  39. Embretson SE, Yang X. 2013. A multicomponent latent trait model for diagnosis. Psychometrika 78:114–36
    [Google Scholar]
  40. Fan Z, Wang C, Chang H-H, Douglas J 2012. Utilizing response time distributions for item selection in CAT. J. Educ. Behav. Stat. 37:5655–70
    [Google Scholar]
  41. Fang G, Liu J, Ying Z 2019. On the identifiability of diagnostic classification models. Psychometrika 84:119–40
    [Google Scholar]
  42. Finkelman M, Wang C. 2019. Time-efficient adaptive measurement of change. J. Comput. Adapt. Test. 7:215–34
    [Google Scholar]
  43. Gierl MJ. 2005. Using dimensionality‐based DIF analyses to identify and interpret constructs that elicit group differences. Educ. Meas. Issues Pract. 24:13–14
    [Google Scholar]
  44. Gu Y, Xu G. 2019a. Sufficient and necessary conditions for the identifiability of the Q-matrix. Stat. Sin. arXiv:1810.03819 [math.ST]
    [Google Scholar]
  45. Gu Y, Xu G. 2019b. The sufficient and necessary condition for the identifiability and estimability of the DINA model. Psychometrika 84:2468–83
    [Google Scholar]
  46. Haertel EH. 1989. Using restricted latent class models to map the skill structure of achievement items. J. Educ. Meas. 26:4301–21
    [Google Scholar]
  47. Hambleton RK, Swaminathan H, Rogers HJ 1991. Measurement Methods for the Social Sciences, Vol. 2: Fundamentals of Item Response Theory Thousand Oaks, CA: SAGE
    [Google Scholar]
  48. Henson RA, Templin JL, Willse JT 2009. Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika 74:2191
    [Google Scholar]
  49. Holland PW, Thayer DT. 1988. Differential item performance and the Mantel–Haenszel procedure. Test Validity H Wainer, HI Braun 129–45 Mahwah, NJ: Lawrence Erlbaum
    [Google Scholar]
  50. Jöreskog KG, Goldberger AS. 1975. Estimation of a model with multiple indicators and multiple causes of a single latent variable. J. Am. Stat. Assoc. 70:351a631–39
    [Google Scholar]
  51. Junker BW, Sijtsma K. 2001. Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Appl. Psychol. Meas. 25:3258–72
    [Google Scholar]
  52. Kaya Y, Leite WL. 2017. Assessing change in latent skills across time with longitudinal cognitive diagnosis modeling: an evaluation of model performance. Educ. Psychol. Meas. 77:3369–88
    [Google Scholar]
  53. Klein Entink RH, Kuhn J-T, Hornke LF, Fox J-P 2009. Evaluating cognitive theory: a joint modeling approach using responses and response times. Psychol. Methods 14:154
    [Google Scholar]
  54. Leighton J, Gierl M. 2007. Cognitive Diagnostic Assessment for Education: Theory and Applications Cambridge, UK: Cambridge Univ. Press
  55. Li F, Cohen A, Bottge B, Templin J 2016. A latent transition analysis model for assessing change in cognitive skills. Educ. Psychol. Meas. 76:2181–204
    [Google Scholar]
  56. Li X, Xu H, Zhang J, Chang HH 2020. Deep reinforcement learning for adaptive learning systems. arXiv:2004.08410 [cs.LG]
  57. Liu H, You X, Wang W, Ding S, Chang H-H 2014. Large-scale implementation of computerized adaptive testing with cognitive diagnosis in China. Advancing Methodologies to Support Both Summative and Formative Assessments Y Cheng, H-H Chang, 245–61 Charlotte, NC: IAP
    [Google Scholar]
  58. Liu J. 2017. On the consistency of Q-matrix estimation: a commentary. Psychometrika 82:2523–27
    [Google Scholar]
  59. Liu J, Xu G, Ying Z 2012. Data-driven learning of Q-matrix. Appl. Psychol. Meas. 36:7548–64
    [Google Scholar]
  60. Liu J, Xu G, Ying Z 2013. Theory of the self-learning Q-matrix. Bernoulli 19:5A 1790.
    [Google Scholar]
  61. Lord FM. 1970. Some test theory for tailored testing. Computer Assisted Instruction, Testing, and Guidance WH Holzman 139–83 New York: Harper & Row
    [Google Scholar]
  62. Lord FM. 1980. Applications of Item Response Theory to Practical Testing Problems Mahwah, NJ: Lawrence Erlbaum
  63. Lord FM, Novick MR. 1968. Statistical Theories of Mental Test Scores Reading, MA: Addison-Wesley
  64. Lu J, Wang C, Zhang J, Tao J 2019. A mixture model for responses and response times with a higher‐order ability structure to detect rapid guessing behaviour. Br. J. Math. Stat. Psychol. 73:261–88
    [Google Scholar]
  65. Macready GB, Dayton CM. 1977. The use of probabilistic models in the assessment of mastery. J. Educ. Stat. 2:299–120
    [Google Scholar]
  66. Mantel N, Haenszel W. 1959. Statistical aspects of the analysis of data from retrospective studies of disease. J. Natl. Cancer Inst. 22:4719–48
    [Google Scholar]
  67. Merritt J. 2003. Why the folks at ETS flunked the course—a tech-savvy service will soon be giving B-school applicants their GMATs. Business Week Dec. 29
    [Google Scholar]
  68. Messick S. 1987. Validity. ETS Res. Rep. Ser. 1987:2 https://doi.org/10.1002/j.2330-8516.1987.tb00244.x
    [Crossref] [Google Scholar]
  69. Mislevy RJ. 1995. Evidence and inference in educational assessment ETS Res. Rep. RR-95-08, ETS Princeton, NJ:
  70. OECD (Organ. Econ. Co-op. Dev.) 2012. PISA 2012 Results: Creative Problem Solving: Students’ Skills in Tackling Real-Life Problems (Volume V) Paris: OECD
  71. Pellegrino JW, Chudowsky N, Glaser R 2001. Knowing What Students Know: The Science and Design of Educational Assessment Washington, DC: Natl. Acad.
  72. Penfield RD, Camilli G. 2006. 5 Differential item functioning and item bias. Handb. Stat. 26:125–67
    [Google Scholar]
  73. Popham WJ. 2008. Formative assessment: seven stepping-stones to success. Princ. Leadersh. 9:116–20
    [Google Scholar]
  74. Rasch G. 1993. Probabilistic Models for Some Intelligence and Attainment Tests Chicago: MESA
  75. Robbins H, Monro S. 1951. A stochastic approximation method. Ann. Math. Stat. 22:400–7
    [Google Scholar]
  76. Rouder JN, Sun D, Speckman PL, Lu J, Zhou D 2003. A hierarchical Bayesian statistical framework for response time distributions. Psychometrika 68:4589–606
    [Google Scholar]
  77. Roussos LA, DiBello LV, Stout W, Hartz SM, Henson RA, Templin J 2007. The fusion model skills diagnosis system. Cognitive Diagnostic Assessment for Education J Leighton, M Gierl 275–318 Cambridge, UK: Cambridge Univ. Press
    [Google Scholar]
  78. Rupp AA, Templin JL. 2008. Unique characteristics of diagnostic classification models: a comprehensive review of the current state-of-the-art. Measurement 6:4219–62
    [Google Scholar]
  79. Schnipke DL, Scrams DJ. 1997. Modeling item response times with a two‐state mixture model: a new method of measuring speededness. J. Educ. Meas. 34:3213–32
    [Google Scholar]
  80. Shealy R, Stout W. 1993. A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika 58:2159–94
    [Google Scholar]
  81. Shepard LA. 2009. Commentary: evaluating the validity of formative and interim assessment. Educ. Meas. Issues Pract. 28:332–37
    [Google Scholar]
  82. Siem FM. 1996. The use of response latencies to self-report personality measures. Mil. Psychol. 8:115–27
    [Google Scholar]
  83. Sinharay S, Almond RG. 2007. Assessing fit of cognitive diagnostic models a case study. Educ. Psychol. Meas. 67:2239–57
    [Google Scholar]
  84. Sun J, Chen Y, Liu J, Ying Z, Xin T 2016. Latent variable selection for multidimensional item response theory models via L1 regularization. Psychometrika 81:4921–39
    [Google Scholar]
  85. Tan C, Han R, Ye R, Chen K 2020. Adaptive learning recommendation strategy based on deep Q-learning. Appl. Psychol. Meas. 44:4251–66
    [Google Scholar]
  86. Tang X, Chen Y, Li X, Liu J, Ying Z 2019. A reinforcement learning approach to personalized learning recommendation systems. Br. J. Math. Stat. Psychol. 72:1108–35
    [Google Scholar]
  87. Tang X, Wang Z, He Q, Liu J, Ying Z 2020a. Latent feature extraction for process data via multidimensional scaling. Psychometrika 85:2378–97
    [Google Scholar]
  88. Tang X, Wang Z, Liu J, Ying Z 2020b. An exploratory analysis of the latent structure of process data via action sequence autoencoders. Br. J. Math. Stat. Psychol. https://doi.org/10.1111/bmsp.12203
    [Crossref] [Google Scholar]
  89. Tatsuoka KK 1985. A probabilistic model for diagnosing misconceptions by the pattern classification approach. J. Educ. Stat. 10:155–73
    [Google Scholar]
  90. Tatsuoka KK. 1990. Toward an integration of item-response theory and cognitive error diagnosis. Diagnostic Monitoring of Skill and Knowledge Acquisition N Frederiksen, R Glaser, A Lesgold, MG Shafto 452–88 Mahwah, NJ: Lawrence Erlbaum
    [Google Scholar]
  91. Templin J, Bradshaw L. 2013. Measuring the reliability of diagnostic classification model examinee estimates. J. Classif. 30:2251–75
    [Google Scholar]
  92. van der Linden WJ. 2007. A hierarchical framework for modeling speed and accuracy on test items. Psychometrika 72:3287
    [Google Scholar]
  93. van der Linden WJ. 2009. A bivariate lognormal response-time model for the detection of collusion between test takers. J. Educ. Behav. Stat. 34:3378–94
    [Google Scholar]
  94. van der Linden WJ, Glas CAW 2010. Elements of Adaptive Testing New York: Springer
  95. van der Linden WJ, Lewis C 2015. Bayesian checks on cheating on tests. Psychometrika 80:3689–706
    [Google Scholar]
  96. Wainer H. 2000. Computerized Adaptive Testing: A Primer. London: Taylor and Francis
    [Google Scholar]
  97. Wang C, Chang HH, Douglas JA 2013a. The linear transformation model with frailties for the analysis of item response times. Br. J. Math. Stat. Psychol. 66:1144–68
    [Google Scholar]
  98. Wang C, Fan Z, Chang H-H, Douglas JA 2013b. A semiparametric model for jointly analyzing response times and accuracy in computerized testing. J. Educ. Behav. Stat. 38:4381–417
    [Google Scholar]
  99. Wang C, Nydick SW. 2015. Comparing two algorithms for calibrating the restricted non-compensatory multidimensional IRT model. Appl. Psychol. Meas. 39:2119–34
    [Google Scholar]
  100. Wang C, Shu Z, Shang Z, Xu G 2015. Assessing item-level fit for the DINA model. Appl. Psychol. Meas. 39:7525–38
    [Google Scholar]
  101. Wang C, Weiss DJ, Su S 2019. Modeling response time and responses in multidimensional health measurement. Front. Psychol. 10:51
    [Google Scholar]
  102. Wang C, Xu G. 2015. A mixture hierarchical model for response times and response accuracy. Br. J. Math. Stat. Psychol. 68:3456–77
    [Google Scholar]
  103. Wang C, Xu G, Shang Z 2018a. A two-stage approach to differentiating normal and aberrant behavior in computer based testing. Psychometrika 83:1223–54
    [Google Scholar]
  104. Wang C, Xu G, Shang Z, Kuncel N 2018b. Detecting aberrant behavior and item preknowledge: a comparison of mixture modeling method and residual method. J. Educ. Behav. Stat. 43:4469–501
    [Google Scholar]
  105. Wang S, Fellouris G, Chang H-H 2017. Computerized adaptive testing that allows for response revision: design and asymptotic theory. Stat. Sin. 27:1987–2010
    [Google Scholar]
  106. Wang S, Yang Y, Culpepper SA, Douglas JA 2018. Tracking skill acquisition with cognitive diagnosis models: a higher-order, hidden Markov model with covariates. J. Educ. Behav. Stat. 43:157–87
    [Google Scholar]
  107. Wang S, Zhang S, Shen Y 2020. A joint modeling framework of responses and response times to assess learning outcomes. Multivar. Behav. Res. 55:149–68
    [Google Scholar]
  108. Wise SL, Kong X. 2005. Response time effort: a new measure of examinee motivation in computer-based tests. Appl. Meas. Educ. 18:2163–83
    [Google Scholar]
  109. Woods CM. 2009. Evaluation of MIMIC-model methods for DIF testing with comparison to two-group analysis. Multivar. Behav. Res. 44:11–27
    [Google Scholar]
  110. Woods CM, Grimm KJ. 2011. Testing for nonuniform differential item functioning with multiple indicator multiple cause models. Appl. Psychol. Meas. 35:5339–61
    [Google Scholar]
  111. Xu G. 2017. Identifiability of restricted latent class models with binary responses. Ann. Stat. 45:2675–707
    [Google Scholar]
  112. Xu G. 2019. Identifiability and cognitive diagnosis models. Handbook of Diagnostic Classification Models: Models and Model Extensions, Applications, Software Packages M von Davier, Y-S Lee 333–57 New York: Springer
    [Google Scholar]
  113. Xu H, Fang G, Chen Y, Liu J, Ying Z 2018. Latent class analysis of recurrent events in problem-solving items. Appl. Psychol. Meas. 42:6478–98
    [Google Scholar]
  114. Xu H, Fang G, Ying Z 2020. A latent topic model with Markov transition for process data. Br. J. Math. Stat. Psychol. 73:3474–505
    [Google Scholar]
  115. Zhang S, Chang H-H. 2016. From smart testing to smart learning: how testing technology can assist the new generation of education. Int. J. Smart Technol. Learn. 1:167–92
    [Google Scholar]
  116. Zhang S, Chang H-H. 2020. A multilevel logistic hidden Markov model for learning under cognitive diagnosis. Behav. Res. Methods 52:1408–21
    [Google Scholar]
/content/journals/10.1146/annurev-statistics-042720-104044
Loading
/content/journals/10.1146/annurev-statistics-042720-104044
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error