1932

Abstract

Model diagnostics and forecast evaluation are closely related tasks, with the former concerning in-sample goodness (or lack) of fit and the latter addressing predictive performance out-of-sample. We review the ubiquitous setting in which forecasts are cast in the form of quantiles or quantile-bounded prediction intervals. We distinguish unconditional calibration, which corresponds to classical coverage criteria, from the stronger notion of conditional calibration, as can be visualized in quantile reliability diagrams. Consistent scoring functions—including, but not limited to, the widely used asymmetricpiecewise linear score or pinball loss—provide for comparative assessment and ranking, and link to the coefficient of determination and skill scores. We illustrate the use of these tools on Engel's food expenditure data, the Global Energy Forecasting Competition 2014, and the US COVID-19 Forecast Hub.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-032921-020240
2023-03-09
2024-06-15
Loading full text...

Full text loading...

/deliver/fulltext/statistics/10/1/annurev-statistics-032921-020240.html?itemId=/content/journals/10.1146/annurev-statistics-032921-020240&mimeType=html&fmt=ahah

Literature Cited

  1. Adrian T, Boyarchenko N, Giannone D. 2019. Vulnerable growth. Am. Econ. Rev. 109:1263–89
    [Google Scholar]
  2. Ayer M, Brunk HD, Ewing GM, Reid WT, Silverman E. 1955. An empirical distribution function for sampling with incomplete information. Ann. Math. Stat. 26:641–47Introduces the pool-adjacent-violators (PAV) algorithm.
    [Google Scholar]
  3. Basle Committee on Banking Supervision. 1996. Overview of the amendment to the Capital Accord to Incorporate Market Risks. Tech. Rep., Bank Int. Settl. Basel, Switz: http://www.bis.org/publ/bcbs23.pdf
    [Google Scholar]
  4. Bentzien S, Friederichs P. 2014. Decomposition and graphical portrayal of the quantile score. Q. J. R. Meteorol. Soc. 140:1924–34Introduces quantile reliability diagrams.
    [Google Scholar]
  5. Berkowitz J, Christoffersen P, Pelletier D. 2011. Evaluating value-at-risk models with desk-level data. Manag. Sci. 57:2213–27
    [Google Scholar]
  6. Blundell R, Chen X, Kristensen D 2007. Semi-nonparametric IV estimation of shape-invariant Engel curves. Econometrica 75:1613–69
    [Google Scholar]
  7. Bracher J, Ray EL, Gneiting T, Reich NG. 2021. Evaluating epidemic forecasts in an interval format. PLOS Comput. Biol. 17:e1008618
    [Google Scholar]
  8. Bracher J, Wolffram D, Deuschel J, Görgen K, Ketterer JL et al. 2021. A pre-registered short-term forecasting study of COVID-19 in Germany and Poland during the second wave. Nat. Commun. 12:5173
    [Google Scholar]
  9. Brehmer JR, Gneiting T. 2021. Scoring interval forecasts: equal-tailed, shortest, and modal interval. Bernoulli 27:1993–2010
    [Google Scholar]
  10. Brehmer JR, Strokorb K. 2019. Why scoring functions cannot assess tail properties. Electron. J. Stat. 13:4015–34
    [Google Scholar]
  11. Chen Z, Gaba A, Tsetlin I, Winkler RL. 2022. Evaluating quantile forecasts in the M5 uncertainty competition. Int. J. Forecast. 38:1531–45
    [Google Scholar]
  12. Choe YJ, Ramdas A. 2021. Comparing sequential forecasters. arXiv:2111.00115 [stat.ME]
  13. Christoffersen PF. 1998. Evaluating interval forecasts. Int. Econ. Rev. 39:841–62
    [Google Scholar]
  14. Chung Y, Neiswanger W, Char I, Schneider J 2021. Beyond pinball loss: quantile methods for calibrated uncertainty quantification. Advances in Neural Information Processing Systems 34 (NeurIPS 2021) M Ranzato, A Beygelzimer, Y Dauphin, PS Liang, J Wortman Vaughan Red Hook, NY: Curran
    [Google Scholar]
  15. Conde-Amboage M, Van Keilegom I, González-Manteiga W. 2021. A new lack-of-fit test for quantile regression with censored data. Scand. J. Stat. 48:655–88
    [Google Scholar]
  16. Cramer EY, Huang Y, Wang Y, Ray EL, Cornell M et al. 2022a. The United States COVID-19 Forecast Hub dataset. Sci. Data 9:462
    [Google Scholar]
  17. Cramer EY, Ray EL, Lopez VK, Bracher J, Brennen A et al. 2022b. Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States. PNAS 119:e2113561119
    [Google Scholar]
  18. Czado C, Gneiting T, Held L. 2009. Predictive model assessment for count data. Biometrics 65:1254–61
    [Google Scholar]
  19. Dawid AP. 1984. Statistical theory: the prequential approach. J. R. Stat. Soc. Ser. A 147:278–92
    [Google Scholar]
  20. De Backer M, El Ghouch A, Van Keilegom I. 2019. An adapted loss function for censored quantile regression. J. Am. Stat. Assoc. 114:1126–37
    [Google Scholar]
  21. Diebold FX, Mariano RS. 1995. Comparing predictive accuracy. J. Bus. Econ. Stat. 13:253–63Introduces the Diebold–Mariano test of equal predictive performance.
    [Google Scholar]
  22. Dimitriadis T, Gneiting T, Jordan AI. 2021. Stable reliability diagrams for probabilistic classifiers. PNAS 118:e2016191118
    [Google Scholar]
  23. Duffie D, Pan J 1997. An overview of value at risk. J. Derivatives 4:7–49
    [Google Scholar]
  24. Ehm W, Gneiting T, Jordan A, Krüger F 2016. Of quantiles and expectiles: consistent scoring functions, Choquet representations and forecast rankings. J. R. Stat. Soc. Ser. B 78:505–62Introduces Murphy diagrams.
    [Google Scholar]
  25. Engel E. 1857. Die vorherrschenden Gewerbszweige in den Gerichtsämtern mit Beziehung auf die Productions- und Consumtionsverhältnisse des Königreichs Sachsen. Z. Stat. Bur. Königl. Sächs. Min. Innern 8–9:153–82
    [Google Scholar]
  26. Fasiolo M, Wood SN, Zaffran M, Nedellec R, Goude Y. 2021. Fast calibrated additive quantile regression. J. Am. Stat. Assoc. 116:1402–12
    [Google Scholar]
  27. Fissler T, Frongillo R, Hlavinová J, Rudloff B. 2021. Forecast evaluation of quantiles, prediction intervals, and other set-valued functionals. Electron. J. Stat. 15:1034–84
    [Google Scholar]
  28. Fissler T, Pesenti SM. 2022. Sensitivity measures based on scoring functions. SSRN http://dx.doi.org/10.2139/ssrn.4046894
    [Crossref] [Google Scholar]
  29. Fissler T, Ziegel JF, Gneiting T. 2016. Expected shortfall is jointly elicitable with value-at-risk: implications for backtesting. Risk January 58–61Proposes comparative backtests.
    [Google Scholar]
  30. Gandy A, Jana K, Veraart AED 2022. Scoring predictions at extreme quantiles. AStA Adv. Stat. Anal. 106:527–44
    [Google Scholar]
  31. Gasthaus J, Benidis K, Wang Y, Rangapuram SS, Salinas D et al. 2019. Probabilistic forecasting with spline quantile function RNNs. PMLR 89:1901–10
    [Google Scholar]
  32. Gelman A, Goegebeur Y, Tuerlinckx F, Van Mechelen I. 2000. Diagnostic checks for discrete data regression models using posterior predictive simulations. J. R. Stat. Soc. Ser. C 49:247–68
    [Google Scholar]
  33. Giacomini R, Komunjer I. 2005. Evaluation and combination of conditional quantile forecasts. J. Bus. Econ. Stat. 23:416–31
    [Google Scholar]
  34. Gneiting T. 2011a. Making and evaluating point forecasts. J. Am. Stat. Assoc. 106:746–62
    [Google Scholar]
  35. Gneiting T. 2011b. Quantiles as optimal point forecasts. Int. J. Forecast. 27:197–207
    [Google Scholar]
  36. Gneiting T, Balabdaoui F, Raftery AE. 2007. Probabilistic forecasts, calibration and sharpness. J. R. Stat. Soc. Ser. B 69:243–68
    [Google Scholar]
  37. Gneiting T, Katzfuss M. 2014. Probabilistic forecasting. Annu. Rev. Stat. Appl. 1:125–51
    [Google Scholar]
  38. Gneiting T, Raftery AE. 2007. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 102:359–78
    [Google Scholar]
  39. Gneiting T, Ranjan R. 2013. Combining predictive distributions. Electron. J. Stat. 7:1747–82
    [Google Scholar]
  40. Gneiting T, Resin J. 2021. Regression diagnostics meets forecast evaluation: conditional calibration, reliability diagrams, and coefficient of determination. arXiv:2108.03210 [stat.ME] Develops a theory of calibration.
  41. González Ordiano JA, Gröll L, Mikut R, Hagenmeyer V 2020. Probabilistic energy forecasting using the nearest neighbors quantile filter and quantile regression. Int. J. Forecast. 36:310–23
    [Google Scholar]
  42. Grant K, Gneiting T 2013. Consistent scoring functions for quantiles. From Probability to Statistics and Back: High-Dimensional Models and Processes. A Festschrift in Honor of Jon Wellner M Banerjee, F Bunea, J Huang, V Koltchinskii, MH Maathuis 163–73 Beachwood, OH: Inst. Math. Stat.
    [Google Scholar]
  43. He XD, Kou S, Peng X. 2022. Risk measures: robustness, elicitability, and backtesting. Annu. Rev. Stat. Appl. 9:141–66
    [Google Scholar]
  44. Henzi A. 2021. isodistrreg: Isotonic distributional regression (IDR). R Package version 0.1.0. https://cran.r-project.org/package=isodistrreg
    [Google Scholar]
  45. Henzi A, Ziegel JF, Gneiting T. 2021. Isotonic distributional regression. J. R. Stat. Soc. Ser. B 83:963–93
    [Google Scholar]
  46. Hoga Y, Dimitriadis T. 2022. On testing equal conditional predictive ability under measurement error. J. Bus. Econ. Stat. https://doi.org/10.1080/07350015.2021.2021923
    [Crossref] [Google Scholar]
  47. Homburg A, Weiß CH, Alwan LC, Frahm G, Göb R 2019. Evaluating approximate point forecasting of count processes. Econometrics 7:30
    [Google Scholar]
  48. Hong T, Pinson P, Fan S, Zareipour H, Troccoli A, Hyndman RJ. 2016. Probabilistic energy forecasting: Global Energy Forecasting Competition 2014 and beyond. Int. J. Forecast. 32:896–913
    [Google Scholar]
  49. Hong T, Pinson P, Wang Y, Weron R, Yang D, Zareipour H 2020. Energy forecasting: a review and outlook. IEEE Open Access J. Power Energy 7:376–88
    [Google Scholar]
  50. Huber PJ, Ronchetti EM. 2009. Robust Statistics New York: Wiley. , 2nd ed..
    [Google Scholar]
  51. Jordan AI, Mühlemann A, Ziegel JF. 2022. Characterizing the optimal solutions to the isotonic regression problem for identifiable functionals. Ann. Inst. Stat. Math. 74:489–514
    [Google Scholar]
  52. Jose VRR, Winkler R. 2009. Evaluating quantile assessments. Oper. Res. 57:1287–97
    [Google Scholar]
  53. Koenker R. 2005. Quantile Regression Cambridge, UK: Cambridge Univ. Press
    [Google Scholar]
  54. Koenker R. 2017. Quantile regression: 40 years on. Annu. Rev. Econ. 9:155–76
    [Google Scholar]
  55. Koenker R, Bassett G. 1978. Regression quantiles.. Econometrica 46:33–50Introduces linear quantile regression.
    [Google Scholar]
  56. Koenker R, Machado JAF. 1999. Goodness of fit and related inference processes for quantile regression. J. Am. Stat. Assoc. 94:1296–310
    [Google Scholar]
  57. Krüger F, Ziegel JF. 2021. Generic conditions for forecast dominance. J. Bus. Econ. Stat. 39:972–83
    [Google Scholar]
  58. Lerch S, Thorarinsdottir TL, Ravazzolo F, Gneiting T. 2017. Forecaster's dilemma: extreme events and forecast evaluation. Stat. Sci. 32:106–27
    [Google Scholar]
  59. Li R, Peng L. 2017. Assessing quantile prediction with censored quantile regression models. Biometrics 73:517–28
    [Google Scholar]
  60. Meakin S, Abbott S, Bosse N, Munday J, Gruson H et al. 2022. Comparative assessment of methods for short-term forecasts of COVID-19 hospital admissions in England at the local level. BMC Med. 20:86
    [Google Scholar]
  61. Meinshausen N. 2006. Quantile regression forests. J. Mach. Learn. Res. 7:983–99
    [Google Scholar]
  62. Mösching A, Dümbgen L. 2020. Monotone least squares and isotonic quantiles. Electron. J. Stat. 14:24–49
    [Google Scholar]
  63. Murphy AH, Epstein ES. 1989. Skill scores and correlation coefficients in model verification. Mon. Weather Rev. 117:572–81
    [Google Scholar]
  64. Noh H, El Ghouch A, Van Keilegom I. 2013. Assessing model adequacy in possibly misspecified quantile regression. Computat. Stat. Data Anal. 57:558–69
    [Google Scholar]
  65. Nolde N, Ziegel JF. 2017. Elicitability and backtesting: perspectives for banking regulation. Ann. Appl. Stat. 11:1833–74Presents a detailed study of comparative backtests.
    [Google Scholar]
  66. Patton A. 2011. Volatility forecast comparison using imperfect volatility proxies. J. Econom. 160:246–56
    [Google Scholar]
  67. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B et al. 2011. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12:2825–30
    [Google Scholar]
  68. Peng R. 2021. Quantile regression for survival data. Annu. Rev. Stat. Appl. 8:413–37
    [Google Scholar]
  69. Pinson P, Chevallier C, Kariniotakis GN. 2007. Trading wind generation from short-term probabilistic forecasts of wind power. IEEE Trans. Power Syst. 22:1148–56
    [Google Scholar]
  70. Pohle MO. 2020. The Murphy decomposition and the calibration-resolution principle: a new perspective on forecast evaluation. arXiv:2005.01835 [stat.ME]
  71. Python Softw. Found. 2022. Python 3.10.9 documentation Software Documentation. https://docs.python.org/3.10/
    [Google Scholar]
  72. R Core Team. 2022. R: A language and environment for statistical computing. Statistical Software R Found. Stat. Comput. Vienna:
    [Google Scholar]
  73. Ray EL, Brooks LC, Bien J, Biggerstaff M, Bosse NI et al. 2022. Comparing trained and untrained probabilistic ensemble forecasts of COVID-19 cases and deaths in the United States. arXiv:2201.12387 [stat.ME]
  74. Reich NG, Brooks LC, Fox SJ, Kandula S, McGowan CJ et al. 2019. A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. PNAS 116:3146–54
    [Google Scholar]
  75. Reich NG, Lessler J, Funk S, Viboud C, Vespignani A et al. 2022. Collaborative hubs: making the most of predictive epidemic modeling. Am. J. Public Health 112:839–42
    [Google Scholar]
  76. Saerens M. 2000. Building cost functions minimizing to some summary statistics. IEEE Trans. Neural. Netw. 11:1263–71
    [Google Scholar]
  77. Taggart R. 2022. Evaluation of point forecasts for extreme events using consistent scoring functions. Q. J. R. Meteorol. Soc. 148:306–20
    [Google Scholar]
  78. Taylor JW. 2021. Evaluating quantile-bounded and expectile-bounded interval forecasts. Int. J. Forecast. 37:800–11
    [Google Scholar]
  79. Thomson W. 1979. Eliciting production possibilities from a well-informed manager. J. Econ. Theory 20:360–80Establishes the general form of consistent scoring functions.
    [Google Scholar]
  80. Winkler RL. 1972. A decision-theoretic approach to interval estimation. J. Am. Stat. Assoc. 67:187–91
    [Google Scholar]
  81. Wolffram D, Resin J, Kraus K, Jordan AI. 2022. Replication package for “Model Diagnostics and Forecast Evaluation for Quantiles. .” Software Repository version revision1. https://doi.org/10.5281/zenodo.6546490
    [Crossref] [Google Scholar]
  82. Wright FT. 1984. The asymptotic behavior of monotone percentile regression estimates. Can. J. Stat. 12:229–36
    [Google Scholar]
  83. Wright MN, Ziegler A. 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Softw. 77:1–17
    [Google Scholar]
  84. Zhang Y, Nadarajah S. 2018. A review of backtesting for value at risk. Commun. Stat. Theory Methods 47:3616–39
    [Google Scholar]
/content/journals/10.1146/annurev-statistics-032921-020240
Loading
/content/journals/10.1146/annurev-statistics-032921-020240
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error