Model Diagnostics and Forecast Evaluation for Quantiles

Tilmann Gneiting; Daniel Wolffram; Johannes Resin; Kristof Kraus; Johannes Bracher; Timo Dimitriadis; Veit Hagenmeyer; Alexander I. Jordan; Sebastian Lerch; Kaleb Phipps; Melanie Schienle

doi:10.1146/annurev-statistics-032921-020240

Annual Review of Statistics and Its Application

Volume 10, 2023

Review Article

Open Access

Model Diagnostics and Forecast Evaluation for Quantiles

Tilmann Gneiting^1,2, Daniel Wolffram^1,3, Johannes Resin^1,2, Kristof Kraus^1,2, Johannes Bracher^1,3, Timo Dimitriadis^1,4, Veit Hagenmeyer⁵, Alexander I. Jordan¹, Sebastian Lerch^1,3, Kaleb Phipps⁵, and Melanie Schienle^1,3
View Affiliations Hide Affiliations

Affiliations: ¹Computational Statistics Group, Heidelberg Institute for Theoretical Studies (HITS), Heidelberg, Germany; email: [email protected][email protected][email protected][email protected] ²Institute for Stochastics, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany ³Department of Statistical Methods and Econometrics, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany; email: [email protected][email protected][email protected][email protected] ⁴Alfred Weber Institute of Economics, Heidelberg University, Heidelberg, Germany; email: [email protected] ⁵Institute for Automation and Applied Informatics, Karlsruhe Institute of Technology (KIT), Eggenstein–Leopoldshafen, Germany; email: [email protected][email protected]
Vol. 10:597-621 (Volume publication date March 2023) https://doi.org/10.1146/annurev-statistics-032921-020240
First published as a Review in Advance on November 01, 2022
Copyright © 2023 by the author(s).

This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See credit lines of images or other third-party material in this article for license information

Abstract

Model diagnostics and forecast evaluation are closely related tasks, with the former concerning in-sample goodness (or lack) of fit and the latter addressing predictive performance out-of-sample. We review the ubiquitous setting in which forecasts are cast in the form of quantiles or quantile-bounded prediction intervals. We distinguish unconditional calibration, which corresponds to classical coverage criteria, from the stronger notion of conditional calibration, as can be visualized in quantile reliability diagrams. Consistent scoring functions—including, but not limited to, the widely used asymmetricpiecewise linear score or pinball loss—provide for comparative assessment and ranking, and link to the coefficient of determination and skill scores. We illustrate the use of these tools on Engel's food expenditure data, the Global Energy Forecasting Competition 2014, and the US COVID-19 Forecast Hub.

Keyword(s): calibration, coverage plot, Murphy diagram, quantile regression, reliability diagram, scoring function

Article metrics loading...

/content/journals/10.1146/annurev-statistics-032921-020240

2023-03-09

2024-05-01

Full text loading...

/deliver/fulltext/statistics/10/1/annurev-statistics-032921-020240.html?itemId=/content/journals/10.1146/annurev-statistics-032921-020240&mimeType=html&fmt=ahah

Literature Cited

Adrian T, Boyarchenko N, Giannone D. 2019. Vulnerable growth. Am. Econ. Rev. 109:1263–89
[Google Scholar]
Ayer M, Brunk HD, Ewing GM, Reid WT, Silverman E. 1955. An empirical distribution function for sampling with incomplete information. Ann. Math. Stat. 26:641–47Introduces the pool-adjacent-violators (PAV) algorithm.
[Google Scholar]
Basle Committee on Banking Supervision. 1996. Overview of the amendment to the Capital Accord to Incorporate Market Risks. Tech. Rep., Bank Int. Settl. Basel, Switz: http://www.bis.org/publ/bcbs23.pdf
Bentzien S, Friederichs P. 2014. Decomposition and graphical portrayal of the quantile score. Q. J. R. Meteorol. Soc. 140:1924–34Introduces quantile reliability diagrams.
[Google Scholar]
Berkowitz J, Christoffersen P, Pelletier D. 2011. Evaluating value-at-risk models with desk-level data. Manag. Sci. 57:2213–27
[Google Scholar]
Blundell R, Chen X, Kristensen D 2007. Semi-nonparametric IV estimation of shape-invariant Engel curves. Econometrica 75:1613–69
[Google Scholar]
Bracher J, Ray EL, Gneiting T, Reich NG. 2021. Evaluating epidemic forecasts in an interval format. PLOS Comput. Biol. 17:e1008618
[Google Scholar]
Bracher J, Wolffram D, Deuschel J, Görgen K, Ketterer JL et al. 2021. A pre-registered short-term forecasting study of COVID-19 in Germany and Poland during the second wave. Nat. Commun. 12:5173
[Google Scholar]
Brehmer JR, Gneiting T. 2021. Scoring interval forecasts: equal-tailed, shortest, and modal interval. Bernoulli 27:1993–2010
[Google Scholar]
Brehmer JR, Strokorb K. 2019. Why scoring functions cannot assess tail properties. Electron. J. Stat. 13:4015–34
[Google Scholar]
Chen Z, Gaba A, Tsetlin I, Winkler RL. 2022. Evaluating quantile forecasts in the M5 uncertainty competition. Int. J. Forecast. 38:1531–45
[Google Scholar]
Choe YJ, Ramdas A. 2021. Comparing sequential forecasters. arXiv:2111.00115 [stat.ME]
Christoffersen PF. 1998. Evaluating interval forecasts. Int. Econ. Rev. 39:841–62
[Google Scholar]
Chung Y, Neiswanger W, Char I, Schneider J 2021. Beyond pinball loss: quantile methods for calibrated uncertainty quantification. Advances in Neural Information Processing Systems 34 (NeurIPS 2021) M Ranzato, A Beygelzimer, Y Dauphin, PS Liang, J Wortman Vaughan Red Hook, NY: Curran
[Google Scholar]
Conde-Amboage M, Van Keilegom I, González-Manteiga W. 2021. A new lack-of-fit test for quantile regression with censored data. Scand. J. Stat. 48:655–88
[Google Scholar]
Cramer EY, Huang Y, Wang Y, Ray EL, Cornell M et al. 2022a. The United States COVID-19 Forecast Hub dataset. Sci. Data 9:462
[Google Scholar]
Cramer EY, Ray EL, Lopez VK, Bracher J, Brennen A et al. 2022b. Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States. PNAS 119:e2113561119
[Google Scholar]
Czado C, Gneiting T, Held L. 2009. Predictive model assessment for count data. Biometrics 65:1254–61
[Google Scholar]
Dawid AP. 1984. Statistical theory: the prequential approach. J. R. Stat. Soc. Ser. A 147:278–92
[Google Scholar]
De Backer M, El Ghouch A, Van Keilegom I. 2019. An adapted loss function for censored quantile regression. J. Am. Stat. Assoc. 114:1126–37
[Google Scholar]
Diebold FX, Mariano RS. 1995. Comparing predictive accuracy. J. Bus. Econ. Stat. 13:253–63Introduces the Diebold–Mariano test of equal predictive performance.
[Google Scholar]
Dimitriadis T, Gneiting T, Jordan AI. 2021. Stable reliability diagrams for probabilistic classifiers. PNAS 118:e2016191118
[Google Scholar]
Duffie D, Pan J 1997. An overview of value at risk. J. Derivatives 4:7–49
[Google Scholar]
Ehm W, Gneiting T, Jordan A, Krüger F 2016. Of quantiles and expectiles: consistent scoring functions, Choquet representations and forecast rankings. J. R. Stat. Soc. Ser. B 78:505–62Introduces Murphy diagrams.
[Google Scholar]
Engel E. 1857. Die vorherrschenden Gewerbszweige in den Gerichtsämtern mit Beziehung auf die Productions- und Consumtionsverhältnisse des Königreichs Sachsen. Z. Stat. Bur. Königl. Sächs. Min. Innern 8–9:153–82
[Google Scholar]
Fasiolo M, Wood SN, Zaffran M, Nedellec R, Goude Y. 2021. Fast calibrated additive quantile regression. J. Am. Stat. Assoc. 116:1402–12
[Google Scholar]
Fissler T, Frongillo R, Hlavinová J, Rudloff B. 2021. Forecast evaluation of quantiles, prediction intervals, and other set-valued functionals. Electron. J. Stat. 15:1034–84
[Google Scholar]
Fissler T, Pesenti SM. 2022. Sensitivity measures based on scoring functions. SSRN http://dx.doi.org/10.2139/ssrn.4046894
[Crossref] [Google Scholar]
Fissler T, Ziegel JF, Gneiting T. 2016. Expected shortfall is jointly elicitable with value-at-risk: implications for backtesting. Risk January 58–61Proposes comparative backtests.
[Google Scholar]
Gandy A, Jana K, Veraart AED 2022. Scoring predictions at extreme quantiles. AStA Adv. Stat. Anal. 106:527–44
[Google Scholar]
Gasthaus J, Benidis K, Wang Y, Rangapuram SS, Salinas D et al. 2019. Probabilistic forecasting with spline quantile function RNNs. PMLR 89:1901–10
[Google Scholar]
Gelman A, Goegebeur Y, Tuerlinckx F, Van Mechelen I. 2000. Diagnostic checks for discrete data regression models using posterior predictive simulations. J. R. Stat. Soc. Ser. C 49:247–68
[Google Scholar]
Giacomini R, Komunjer I. 2005. Evaluation and combination of conditional quantile forecasts. J. Bus. Econ. Stat. 23:416–31
[Google Scholar]
Gneiting T. 2011a. Making and evaluating point forecasts. J. Am. Stat. Assoc. 106:746–62
[Google Scholar]
Gneiting T. 2011b. Quantiles as optimal point forecasts. Int. J. Forecast. 27:197–207
[Google Scholar]
Gneiting T, Balabdaoui F, Raftery AE. 2007. Probabilistic forecasts, calibration and sharpness. J. R. Stat. Soc. Ser. B 69:243–68
[Google Scholar]
Gneiting T, Katzfuss M. 2014. Probabilistic forecasting. Annu. Rev. Stat. Appl. 1:125–51
[Google Scholar]
Gneiting T, Raftery AE. 2007. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 102:359–78
[Google Scholar]
Gneiting T, Ranjan R. 2013. Combining predictive distributions. Electron. J. Stat. 7:1747–82
[Google Scholar]
Gneiting T, Resin J. 2021. Regression diagnostics meets forecast evaluation: conditional calibration, reliability diagrams, and coefficient of determination. arXiv:2108.03210 [stat.ME] Develops a theory of calibration.
González Ordiano JA, Gröll L, Mikut R, Hagenmeyer V 2020. Probabilistic energy forecasting using the nearest neighbors quantile filter and quantile regression. Int. J. Forecast. 36:310–23
[Google Scholar]
Grant K, Gneiting T 2013. Consistent scoring functions for quantiles. From Probability to Statistics and Back: High-Dimensional Models and Processes. A Festschrift in Honor of Jon Wellner M Banerjee, F Bunea, J Huang, V Koltchinskii, MH Maathuis 163–73 Beachwood, OH: Inst. Math. Stat.
[Google Scholar]
He XD, Kou S, Peng X. 2022. Risk measures: robustness, elicitability, and backtesting. Annu. Rev. Stat. Appl. 9:141–66
[Google Scholar]
Henzi A. 2021. isodistrreg: Isotonic distributional regression (IDR). R Package version 0.1.0. https://cran.r-project.org/package=isodistrreg
[Google Scholar]
Henzi A, Ziegel JF, Gneiting T. 2021. Isotonic distributional regression. J. R. Stat. Soc. Ser. B 83:963–93
[Google Scholar]
Hoga Y, Dimitriadis T. 2022. On testing equal conditional predictive ability under measurement error. J. Bus. Econ. Stat. https://doi.org/10.1080/07350015.2021.2021923
[Crossref] [Google Scholar]
Homburg A, Weiß CH, Alwan LC, Frahm G, Göb R 2019. Evaluating approximate point forecasting of count processes. Econometrics 7:30
[Google Scholar]
Hong T, Pinson P, Fan S, Zareipour H, Troccoli A, Hyndman RJ. 2016. Probabilistic energy forecasting: Global Energy Forecasting Competition 2014 and beyond. Int. J. Forecast. 32:896–913
[Google Scholar]
Hong T, Pinson P, Wang Y, Weron R, Yang D, Zareipour H 2020. Energy forecasting: a review and outlook. IEEE Open Access J. Power Energy 7:376–88
[Google Scholar]
Huber PJ, Ronchetti EM. 2009. Robust Statistics New York: Wiley. , 2nd ed..
Jordan AI, Mühlemann A, Ziegel JF. 2022. Characterizing the optimal solutions to the isotonic regression problem for identifiable functionals. Ann. Inst. Stat. Math. 74:489–514
[Google Scholar]
Jose VRR, Winkler R. 2009. Evaluating quantile assessments. Oper. Res. 57:1287–97
[Google Scholar]
Koenker R. 2005. Quantile Regression Cambridge, UK: Cambridge Univ. Press
Koenker R. 2017. Quantile regression: 40 years on. Annu. Rev. Econ. 9:155–76
[Google Scholar]
Koenker R, Bassett G. 1978. Regression quantiles.. Econometrica 46:33–50Introduces linear quantile regression.
[Google Scholar]
Koenker R, Machado JAF. 1999. Goodness of fit and related inference processes for quantile regression. J. Am. Stat. Assoc. 94:1296–310
[Google Scholar]
Krüger F, Ziegel JF. 2021. Generic conditions for forecast dominance. J. Bus. Econ. Stat. 39:972–83
[Google Scholar]
Lerch S, Thorarinsdottir TL, Ravazzolo F, Gneiting T. 2017. Forecaster's dilemma: extreme events and forecast evaluation. Stat. Sci. 32:106–27
[Google Scholar]
Li R, Peng L. 2017. Assessing quantile prediction with censored quantile regression models. Biometrics 73:517–28
[Google Scholar]
Meakin S, Abbott S, Bosse N, Munday J, Gruson H et al. 2022. Comparative assessment of methods for short-term forecasts of COVID-19 hospital admissions in England at the local level. BMC Med. 20:86
[Google Scholar]
Meinshausen N. 2006. Quantile regression forests. J. Mach. Learn. Res. 7:983–99
[Google Scholar]
Mösching A, Dümbgen L. 2020. Monotone least squares and isotonic quantiles. Electron. J. Stat. 14:24–49
[Google Scholar]
Murphy AH, Epstein ES. 1989. Skill scores and correlation coefficients in model verification. Mon. Weather Rev. 117:572–81
[Google Scholar]
Noh H, El Ghouch A, Van Keilegom I. 2013. Assessing model adequacy in possibly misspecified quantile regression. Computat. Stat. Data Anal. 57:558–69
[Google Scholar]
Nolde N, Ziegel JF. 2017. Elicitability and backtesting: perspectives for banking regulation. Ann. Appl. Stat. 11:1833–74Presents a detailed study of comparative backtests.
[Google Scholar]
Patton A. 2011. Volatility forecast comparison using imperfect volatility proxies. J. Econom. 160:246–56
[Google Scholar]
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B et al. 2011. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12:2825–30
[Google Scholar]
Peng R. 2021. Quantile regression for survival data. Annu. Rev. Stat. Appl. 8:413–37
[Google Scholar]
Pinson P, Chevallier C, Kariniotakis GN. 2007. Trading wind generation from short-term probabilistic forecasts of wind power. IEEE Trans. Power Syst. 22:1148–56
[Google Scholar]
Pohle MO. 2020. The Murphy decomposition and the calibration-resolution principle: a new perspective on forecast evaluation. arXiv:2005.01835 [stat.ME]
Python Softw. Found. 2022. Python 3.10.9 documentation Software Documentation. https://docs.python.org/3.10/
R Core Team. 2022. R: A language and environment for statistical computing. Statistical Software R Found. Stat. Comput. Vienna:
[Google Scholar]
Ray EL, Brooks LC, Bien J, Biggerstaff M, Bosse NI et al. 2022. Comparing trained and untrained probabilistic ensemble forecasts of COVID-19 cases and deaths in the United States. arXiv:2201.12387 [stat.ME]
Reich NG, Brooks LC, Fox SJ, Kandula S, McGowan CJ et al. 2019. A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. PNAS 116:3146–54
[Google Scholar]
Reich NG, Lessler J, Funk S, Viboud C, Vespignani A et al. 2022. Collaborative hubs: making the most of predictive epidemic modeling. Am. J. Public Health 112:839–42
[Google Scholar]
Saerens M. 2000. Building cost functions minimizing to some summary statistics. IEEE Trans. Neural. Netw. 11:1263–71
[Google Scholar]
Taggart R. 2022. Evaluation of point forecasts for extreme events using consistent scoring functions. Q. J. R. Meteorol. Soc. 148:306–20
[Google Scholar]
Taylor JW. 2021. Evaluating quantile-bounded and expectile-bounded interval forecasts. Int. J. Forecast. 37:800–11
[Google Scholar]
Thomson W. 1979. Eliciting production possibilities from a well-informed manager. J. Econ. Theory 20:360–80Establishes the general form of consistent scoring functions.
[Google Scholar]
Winkler RL. 1972. A decision-theoretic approach to interval estimation. J. Am. Stat. Assoc. 67:187–91
[Google Scholar]
Wolffram D, Resin J, Kraus K, Jordan AI. 2022. Replication package for “Model Diagnostics and Forecast Evaluation for Quantiles. .” Software Repository version revision1. https://doi.org/10.5281/zenodo.6546490
[Crossref] [Google Scholar]
Wright FT. 1984. The asymptotic behavior of monotone percentile regression estimates. Can. J. Stat. 12:229–36
[Google Scholar]
Wright MN, Ziegler A. 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Softw. 77:1–17
[Google Scholar]
Zhang Y, Nadarajah S. 2018. A review of backtesting for value at risk. Commun. Stat. Theory Methods 47:3616–39
[Google Scholar]

/content/journals/10.1146/annurev-statistics-032921-020240

Model Diagnostics and Forecast Evaluation for Quantiles

Annual Review of Statistics and Its Application 10, 597 (2023); https://doi.org/10.1146/annurev-statistics-032921-020240

/content/journals/10.1146/annurev-statistics-032921-020240

Data & Media loading...

Article Type: Review Article

Most Cited Most Cited RSS feed

- Probabilistic Forecasting
  
  Tilmann Gneiting, and Matthias Katzfuss
  
  Vol. 1 (2014), pp. 125–151
- Functional Data Analysis
  
  Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller
  
  Vol. 3 (2016), pp. 257–295
- Bayesian Computing with INLA: A Review
  
  Håvard Rue, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren
  
  Vol. 4 (2017), pp. 395–421
- Functional Regression
  
  Jeffrey S. Morris
  
  Vol. 2 (2015), pp. 321–359
- Topological Data Analysis
  
  Larry Wasserman
  
  Vol. 5 (2018), pp. 501–532
- Algorithmic Fairness: Choices, Assumptions, and Definitions
  
  Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, and Kristian Lum
  
  Vol. 8 (2021), pp. 141–163
- Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis
  
  Hongzhe Li
  
  Vol. 2 (2015), pp. 73–94
- Learning Deep Generative Models
  
  Ruslan Salakhutdinov
  
  Vol. 2 (2015), pp. 361–385
- On p-Values and Bayes Factors
  
  Leonhard Held, and Manuela Ott
  
  Vol. 5 (2018), pp. 393–419
- High-Dimensional Statistics with a View Toward Applications in Biology
  
  Peter Bühlmann, Markus Kalisch, and Lukas Meier
  
  Vol. 1 (2014), pp. 255–278
More Less

Annual Review of Statistics and Its Application

Volume 10, 2023

Review Article

Open Access

Model Diagnostics and Forecast Evaluation for Quantiles

Abstract

Most Read This Month

Most Cited Most Cited RSS feed