Post-Selection Inference

Arun K. Kuchibhotla; John E. Kolassa; Todd A. Kuffner

doi:10.1146/annurev-statistics-100421-044639

Annual Review of Statistics and Its Application

Volume 9, 2022

Review Article

Free

Post-Selection Inference

Arun K. Kuchibhotla¹, John E. Kolassa², and Todd A. Kuffner³
View Affiliations Hide Affiliations

Affiliations: ¹Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15232, USA; email: [email protected] ²Department of Statistics, Rutgers University, Piscataway, New Jersey 08854, USA; email: [email protected] ³Department of Mathematics and Statistics, Washington University, St. Louis, Missouri 63130, USA; email: [email protected]
Vol. 9:505-527 (Volume publication date March 2022) https://doi.org/10.1146/annurev-statistics-100421-044639
First published as a Review in Advance on November 16, 2021
Copyright © 2022 by Annual Reviews. All rights reserved

Abstract

We discuss inference after data exploration, with a particular focus on inference after model or variable selection. We review three popular approaches to this problem: sample splitting, simultaneous inference, and conditional selective inference. We explain how each approach works and highlight its advantages and disadvantages. We also provide an illustration of these post-selection inference approaches.

Keyword(s): data transformation, exploratory data analysis, model selection, post-selection inference, sample splitting, selective inference

Article metrics loading...

/content/journals/10.1146/annurev-statistics-100421-044639

2022-03-07

2024-06-08

Full text loading...

/deliver/fulltext/statistics/9/1/annurev-statistics-100421-044639.html?itemId=/content/journals/10.1146/annurev-statistics-100421-044639&mimeType=html&fmt=ahah

Literature Cited

Andrews I, Kitagawa T, McCloskey A. 2019. Inference on winners NBER Work. Pap. 25456
Arcones MA. 2005. Convergence of the optimal M-estimator over a parametric family of M-estimators. Test 14:1281–315
[Google Scholar]
Austin PC, Mamdani MM, Juurlink DN, Hux JE. 2006. Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health. J. Clin. Epidemiol. 59:9964–69
[Google Scholar]
Bachoc F, Leeb H, Pötscher BM. 2019. Valid confidence intervals for post-model-selection predictors. Ann. Stat. 47:31475–504
[Google Scholar]
Bachoc F, Preinerstorfer D, Steinberger L. 2020. Uniformly valid confidence intervals post-model-selection. Ann. Stat. 48:1440–63
[Google Scholar]
Belloni A, Chernozhukov V, Chetverikov D, Hansen C, Kato K. 2018. High-dimensional econometrics and regularized GMM. arXiv:1806.01888 [math.ST]
Belloni A, Chernozhukov V, Kato K. 2015. Uniform post-selection inference for least absolute deviation regression and other Z-estimation problems. Biometrika 102:177–94
[Google Scholar]
Belloni A, Chernozhukov V, Wei Y 2016. Post-selection inference for generalized linear models with many controls. J. Bus. Econ. Stat. 34:4606–19
[Google Scholar]
Benjamini Y, Heller R, Yekutieli D. 2009. Selective inference in complex research. Philos. Trans. R. Soc. A 367:19064255–71
[Google Scholar]
Beran RJ. 1988. Balanced simultaneous confidence sets. J. Am. Stat. Assoc. 83:403679–86
[Google Scholar]
Berk R, Brown L, Buja A, Zhang K, Zhao L. 2013. Valid post-selection inference. Ann. Stat. 41:2802–37
[Google Scholar]
Breiman L. 1992. The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. J. Am. Stat. Assoc. 87:419738–54
[Google Scholar]
Cai J. 2020. Tmax: valid post-selection inference under misspecified linear model. R Package version 1.0. https://github.com/post-selection-inference/R
[Google Scholar]
Chernozhukov V, Chetverikov D, Kato K 2013. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Stat. 41:62786–819
[Google Scholar]
Chernozhukov V, Chetverikov D, Kato K 2014. Gaussian approximation of suprema of empirical processes. Ann. Stat. 42:41564–97
[Google Scholar]
Chernozhukov V, Chetverikov D, Kato K 2017. Central limit theorems and bootstrap in high dimensions. Ann. Probab. 45:42309–52
[Google Scholar]
Chernozhukov V, Chetverikov D, Kato K, Koike Y. 2019. Improved central limit theorem and bootstrap approximations in high dimensions. arXiv:1912.10529 [math.ST]
Chernozhukov V, Chetverikov D, Koike Y 2020. Nearly optimal central limit theorem and bootstrap approximations in high dimensions. arXiv:2012.09513 [math.PR]
Chernozhukov V, Hansen C, Spindler M. 2015. Valid post-selection and post-regularization inference: an elementary, general approach. Annu. Rev. Econ. 7:649–88
[Google Scholar]
Cole JH. 2020. Multimodality neuroimaging brain-age in UK biobank: relationship to biomedical, lifestyle, and cognitive factors. Neurobiol. Aging 92:34–42
[Google Scholar]
Dodge Y, Jurevckova J. 2000. Adaptive Regression New York: Springer
Efron B. 1979. Bootstrap methods: another look at the jackknife. Ann. Stat. 7:11–26
[Google Scholar]
Fithian W, Sun D, Taylor J. 2014. Optimal inference after model selection. arXiv:1410.2597 [math.ST]
Freedman DA. 1983. A note on screening regression equations. Am. Stat. 37:2152–55
[Google Scholar]
Freedman DA. 2009. Statistical Models: Theory and Practice Cambridge, UK: Cambridge Univ. Press
Gelman A, Loken E. 2014. The statistical crisis in science. Am. Sci. 102:6460–65
[Google Scholar]
Gelman A, Vehtari A, Simpson D, Margossian CC, Carpenter B et al. 2020. Bayesian workflow. arXiv:2011.01808 [stat.ME]
Giessing A, Fan J. 2020. Bootstrapping -statistics in high dimensions. arXiv:2006.13099 [math.ST]
Harrison D Jr., Rubinfeld D. 1978. Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manag. 5:181–102
[Google Scholar]
Hong L, Kuffner TA, Martin R. 2018. On overfitting and post-selection uncertainty assessments. Biometrika 105:1221–24
[Google Scholar]
Hotelling H. 1940. The selection of variates for use in prediction with some comments on the general problem of nuisance parameters. Ann. Math. Stat. 11:3271–83
[Google Scholar]
Kafadar K. 2021. Editorial: statistical significance, p-values, and replicability. Ann. Appl. Stat. 15:31081–83
[Google Scholar]
Kivaranovic D, Leeb H. 2018. Expected length of post-model-selection confidence intervals conditional on polyhedral constraints. arXiv:1803.01665 [math.ST]
Kivaranovic D, Leeb H. 2020. A (tight) upper bound for the length of confidence intervals with conditional coverage. arXiv:2007.12448 [stat.ME]
Kuchibhotla AK. 2018. Deterministic inequalities for smooth m-estimators. arXiv:1809.05172 [math.ST]
Kuchibhotla AK. 2020. Unified framework for post-selection inference PhD Thesis Univ. Pa. Philadelphia, PA:
Kuchibhotla AK, Brown LD, Buja A, George EI, Zhao L 2020. Valid post-selection inference in model-free linear regression. Ann. Stat. 48:52953–81
[Google Scholar]
Kuchibhotla AK, Brown LD, Buja A, George EI, Zhao L. 2021a. Uniform-in-submodel bounds for linear regression in a model-free framework. Econom. Theory In press
[Google Scholar]
Kuchibhotla AK, Mukherjee S, Banerjee D. 2021b. High-dimensional CLT: improvements, non-uniform extensions and large deviations. Bernoulli 27:1192–217
[Google Scholar]
Kuchibhotla AK, Rinaldo A. 2020. High-dimensional CLT for sums of non-degenerate random vectors: -rate. arXiv:2009.13673 [math.ST]
Kuffner TA, Young GA 2018. Principled statistical inference in data science. Statistical Data Science N Adams, E Cohen, YK Guo 21–36 Singapore: World Sci.
[Google Scholar]
Lee JD, Sun DL, Sun Y, Taylor JE 2016. Exact post-selection inference, with application to the lasso. Ann. Stat. 44:3907–27
[Google Scholar]
Leeb H, Pötscher BM. 2005. Model selection and inference: facts and fiction. Econom. Theory 21:21–59
[Google Scholar]
Leeb H, Pötscher BM. 2006. Can one estimate the conditional distribution of post-model-selection estimators?. Ann. Stat. 34:52554–91
[Google Scholar]
Leeb H, Pötscher BM. 2008. Can one estimate the unconditional distribution of post-model-selection estimators?. Econom. Theory 24:2338–76
[Google Scholar]
Liquet B, Commenges D. 2001. Correction of the P-value after multiple coding of an explanatory variable in logistic regression. Stat. Med. 20:192815–26
[Google Scholar]
Liquet B, Riou J. 2013. Correction of the significance level when attempting multiple transformations of an explanatory variable in generalized linear models. BMC Med. Res. Methodol. 13:75
[Google Scholar]
Liquet B, Riou J. 2019. CPMCGLM: an R package for p-value adjustment when looking for an optimal transformation of a single explanatory variable in generalized linear models. BMC Med. Res. Methodol. 19:79
[Google Scholar]
Lunde R. 2019. Sample splitting and weak assumption inference for time series. arXiv:1902.07425 [math.ST]
Mammen E. 1992. Bootstrap, wild bootstrap, and asymptotic normality. Probab. Theory Related Fields 93:4439–55
[Google Scholar]
Markovic J, Xia L, Taylor J 2017. Unifying approach to selective inference with applications to cross-validation. arXiv:1703.06559 [stat.ME]
McCloskey A. 2020. Hybrid confidence intervals for informative uniform asymptotic inference after model selection. arXiv:2011.12873 [stat.ME]
Moore D, McCabe G 1998. Introduction to the Practice of Statistics New York: W. H. Freeman
Norvaiša R, Paulauskas V. 1991. Rate of convergence in the central limit theorem for empirical processes. J. Theor. Probab. 4:3511–34
[Google Scholar]
Panigrahi S, Taylor J, Weinstein A 2016. Integrative methods for post-selection inference under convex constraints. arXiv:1605.08824 [stat.ME]
Pardoe I. 2008. Modeling home prices using realtor data. J. Stat. Educ. 16:2
[Google Scholar]
Paulauskas V, Račkauskas A. 1989. Approximation Theory in the Central Limit Theorem Dordrecht, Neth: Kluwer Acad.
Politis DN, Romano JP. 1994. Large sample confidence regions based on subsamples under minimal assumptions. Ann. Stat. 22:42031–50
[Google Scholar]
Politis DN, Romano JP, Wolf M. 1999. Subsampling New York: Springer
Rao RR. 1962. Relations between weak and uniform convergence of measures with applications. Ann. Math. Stat. 33:2659–80
[Google Scholar]
Rasines DG, Young GA. 2020. Bayesian selective inference: sampling models and non-informative priors. arXiv:2008.04584 [math.ST]
Rinaldo A, Wasserman L, G'Sell M 2019. Bootstrapping and sample splitting for high-dimensional, assumption-lean inference. Ann. Stat. 47:63438–69
[Google Scholar]
Sampson AR, Sill MW. 2005. Drop-the-losers design: normal case. Biometrical J. 47:3257–68
[Google Scholar]
Scheffé H. 1953. A method for judging all contrasts in the analysis of variance. Biometrika 40:1–287–110
[Google Scholar]
Sill MW, Sampson AR. 2009. Drop-the-losers design: binomial case. Comput. Stat. Data Anal. 53:3586–95
[Google Scholar]
Stine R, Foster D. 2013. Statistics for Business: Decision Making and Analysis New York: Pearson
Tian X, Bi N, Taylor J. 2016. MAGIC: a general, powerful and tractable method for selective inference. arXiv:1607.02630 [math.ST]
Tian X, Taylor J. 2017. Asymptotics of selective inference. Scand. J. Stat. 44:2480–99
[Google Scholar]
Tian X, Taylor J. 2018. Selective inference with a randomized response. Ann. Stat. 46:2679–710
[Google Scholar]
Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58:1267–88
[Google Scholar]
Tibshirani R, Tibshirani R, Taylor J, Loftus J, Reid S, Markovic J 2019. selectiveInference: tools for post-selection inference. R Package version 1.2.5. https://CRAN.R-project.org/package=selectiveInference
[Google Scholar]
Tibshirani RJ, Rinaldo A, Tibshirani R, Wasserman L 2018. Uniform asymptotic inference and the bootstrap after model selection. Ann. Stat. 46:31255–87
[Google Scholar]
Tukey JW. 1949. Comparing individual means in the analysis of variance. Biometrics 5:299–114
[Google Scholar]
Tukey JW. 1953. The problem of multiple comparisons: introduction and parts a, b, and c Work. Pap., Princeton Univ. Princeton, NJ:
Tullock G. 2001. A comment on Daniel Klein's “a plea to economists who favor liberty. .” East. Econ. J. 27:2203–7
[Google Scholar]
Weisberg S. 2005. Applied Linear Regression New York: Wiley. , 3rd ed..
Whittingham MJ, Stephens PA, Bradbury RB, Freckleton RP. 2006. Why do we still use stepwise modelling in ecology and behaviour?. J. Anim. Ecol. 75:51182–89
[Google Scholar]
Yekutieli D. 2012. Adjusted Bayesian inference for selected parameters. J. R. Stat. Soc. Ser. B 74:3515–41
[Google Scholar]
Zhang K. 2012. Valid post-selection inference. PhD Thesis Univ. Pa. Philadelphia, PA:
Zhang X, Cheng G. 2014. Bootstrapping high dimensional time series. arXiv:1406.1037 [math.ST]
Zhang X, Cheng G. 2018. Gaussian approximation for high dimensional vector under physical dependence. Bernoulli 24:4A2640–75
[Google Scholar]

/content/journals/10.1146/annurev-statistics-100421-044639

Post-Selection Inference

Annual Review of Statistics and Its Application 9, 505 (2022); https://doi.org/10.1146/annurev-statistics-100421-044639

/content/journals/10.1146/annurev-statistics-100421-044639

Data & Media loading...

Supplementary Data

Download the Supplemental Appendix (PDF).

Article Type: Review Article

Most Cited Most Cited RSS feed

- Functional Data Analysis
  
  Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller
  
  Vol. 3 (2016), pp. 257–295
- Probabilistic Forecasting
  
  Tilmann Gneiting, and Matthias Katzfuss
  
  Vol. 1 (2014), pp. 125–151
- Bayesian Computing with INLA: A Review
  
  Håvard Rue, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren
  
  Vol. 4 (2017), pp. 395–421
- Functional Regression
  
  Jeffrey S. Morris
  
  Vol. 2 (2015), pp. 321–359
- Topological Data Analysis
  
  Larry Wasserman
  
  Vol. 5 (2018), pp. 501–532
- Algorithmic Fairness: Choices, Assumptions, and Definitions
  
  Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, and Kristian Lum
  
  Vol. 8 (2021), pp. 141–163
- Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis
  
  Hongzhe Li
  
  Vol. 2 (2015), pp. 73–94
- Learning Deep Generative Models
  
  Ruslan Salakhutdinov
  
  Vol. 2 (2015), pp. 361–385
- On p-Values and Bayes Factors
  
  Leonhard Held, and Manuela Ott
  
  Vol. 5 (2018), pp. 393–419
- High-Dimensional Statistics with a View Toward Applications in Biology
  
  Peter Bühlmann, Markus Kalisch, and Lukas Meier
  
  Vol. 1 (2014), pp. 255–278
More Less

Annual Review of Statistics and Its Application

Volume 9, 2022

Review Article

Free

Post-Selection Inference

Abstract

Supplementary Data

Most Read This Month

Most Cited Most Cited RSS feed