1932

Abstract

We discuss inference after data exploration, with a particular focus on inference after model or variable selection. We review three popular approaches to this problem: sample splitting, simultaneous inference, and conditional selective inference. We explain how each approach works and highlight its advantages and disadvantages. We also provide an illustration of these post-selection inference approaches.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-100421-044639
2022-03-07
2024-04-16
Loading full text...

Full text loading...

/deliver/fulltext/statistics/9/1/annurev-statistics-100421-044639.html?itemId=/content/journals/10.1146/annurev-statistics-100421-044639&mimeType=html&fmt=ahah

Literature Cited

  1. Andrews I, Kitagawa T, McCloskey A. 2019. Inference on winners NBER Work. Pap. 25456
  2. Arcones MA. 2005. Convergence of the optimal M-estimator over a parametric family of M-estimators. Test 14:1281–315
    [Google Scholar]
  3. Austin PC, Mamdani MM, Juurlink DN, Hux JE. 2006. Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health. J. Clin. Epidemiol. 59:9964–69
    [Google Scholar]
  4. Bachoc F, Leeb H, Pötscher BM. 2019. Valid confidence intervals for post-model-selection predictors. Ann. Stat. 47:31475–504
    [Google Scholar]
  5. Bachoc F, Preinerstorfer D, Steinberger L. 2020. Uniformly valid confidence intervals post-model-selection. Ann. Stat. 48:1440–63
    [Google Scholar]
  6. Belloni A, Chernozhukov V, Chetverikov D, Hansen C, Kato K. 2018. High-dimensional econometrics and regularized GMM. arXiv:1806.01888 [math.ST]
  7. Belloni A, Chernozhukov V, Kato K. 2015. Uniform post-selection inference for least absolute deviation regression and other Z-estimation problems. Biometrika 102:177–94
    [Google Scholar]
  8. Belloni A, Chernozhukov V, Wei Y 2016. Post-selection inference for generalized linear models with many controls. J. Bus. Econ. Stat. 34:4606–19
    [Google Scholar]
  9. Benjamini Y, Heller R, Yekutieli D. 2009. Selective inference in complex research. Philos. Trans. R. Soc. A 367:19064255–71
    [Google Scholar]
  10. Beran RJ. 1988. Balanced simultaneous confidence sets. J. Am. Stat. Assoc. 83:403679–86
    [Google Scholar]
  11. Berk R, Brown L, Buja A, Zhang K, Zhao L. 2013. Valid post-selection inference. Ann. Stat. 41:2802–37
    [Google Scholar]
  12. Breiman L. 1992. The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. J. Am. Stat. Assoc. 87:419738–54
    [Google Scholar]
  13. Cai J. 2020. Tmax: valid post-selection inference under misspecified linear model. R Package version 1.0. https://github.com/post-selection-inference/R
    [Google Scholar]
  14. Chernozhukov V, Chetverikov D, Kato K 2013. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Stat. 41:62786–819
    [Google Scholar]
  15. Chernozhukov V, Chetverikov D, Kato K 2014. Gaussian approximation of suprema of empirical processes. Ann. Stat. 42:41564–97
    [Google Scholar]
  16. Chernozhukov V, Chetverikov D, Kato K 2017. Central limit theorems and bootstrap in high dimensions. Ann. Probab. 45:42309–52
    [Google Scholar]
  17. Chernozhukov V, Chetverikov D, Kato K, Koike Y. 2019. Improved central limit theorem and bootstrap approximations in high dimensions. arXiv:1912.10529 [math.ST]
  18. Chernozhukov V, Chetverikov D, Koike Y 2020. Nearly optimal central limit theorem and bootstrap approximations in high dimensions. arXiv:2012.09513 [math.PR]
  19. Chernozhukov V, Hansen C, Spindler M. 2015. Valid post-selection and post-regularization inference: an elementary, general approach. Annu. Rev. Econ. 7:649–88
    [Google Scholar]
  20. Cole JH. 2020. Multimodality neuroimaging brain-age in UK biobank: relationship to biomedical, lifestyle, and cognitive factors. Neurobiol. Aging 92:34–42
    [Google Scholar]
  21. Dodge Y, Jurevckova J. 2000. Adaptive Regression New York: Springer
  22. Efron B. 1979. Bootstrap methods: another look at the jackknife. Ann. Stat. 7:11–26
    [Google Scholar]
  23. Fithian W, Sun D, Taylor J. 2014. Optimal inference after model selection. arXiv:1410.2597 [math.ST]
  24. Freedman DA. 1983. A note on screening regression equations. Am. Stat. 37:2152–55
    [Google Scholar]
  25. Freedman DA. 2009. Statistical Models: Theory and Practice Cambridge, UK: Cambridge Univ. Press
  26. Gelman A, Loken E. 2014. The statistical crisis in science. Am. Sci. 102:6460–65
    [Google Scholar]
  27. Gelman A, Vehtari A, Simpson D, Margossian CC, Carpenter B et al. 2020. Bayesian workflow. arXiv:2011.01808 [stat.ME]
  28. Giessing A, Fan J. 2020. Bootstrapping -statistics in high dimensions. arXiv:2006.13099 [math.ST]
  29. Harrison D Jr., Rubinfeld D. 1978. Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manag. 5:181–102
    [Google Scholar]
  30. Hong L, Kuffner TA, Martin R. 2018. On overfitting and post-selection uncertainty assessments. Biometrika 105:1221–24
    [Google Scholar]
  31. Hotelling H. 1940. The selection of variates for use in prediction with some comments on the general problem of nuisance parameters. Ann. Math. Stat. 11:3271–83
    [Google Scholar]
  32. Kafadar K. 2021. Editorial: statistical significance, p-values, and replicability. Ann. Appl. Stat. 15:31081–83
    [Google Scholar]
  33. Kivaranovic D, Leeb H. 2018. Expected length of post-model-selection confidence intervals conditional on polyhedral constraints. arXiv:1803.01665 [math.ST]
  34. Kivaranovic D, Leeb H. 2020. A (tight) upper bound for the length of confidence intervals with conditional coverage. arXiv:2007.12448 [stat.ME]
  35. Kuchibhotla AK. 2018. Deterministic inequalities for smooth m-estimators. arXiv:1809.05172 [math.ST]
  36. Kuchibhotla AK. 2020. Unified framework for post-selection inference PhD Thesis Univ. Pa. Philadelphia, PA:
  37. Kuchibhotla AK, Brown LD, Buja A, George EI, Zhao L 2020. Valid post-selection inference in model-free linear regression. Ann. Stat. 48:52953–81
    [Google Scholar]
  38. Kuchibhotla AK, Brown LD, Buja A, George EI, Zhao L. 2021a. Uniform-in-submodel bounds for linear regression in a model-free framework. Econom. Theory In press
    [Google Scholar]
  39. Kuchibhotla AK, Mukherjee S, Banerjee D. 2021b. High-dimensional CLT: improvements, non-uniform extensions and large deviations. Bernoulli 27:1192–217
    [Google Scholar]
  40. Kuchibhotla AK, Rinaldo A. 2020. High-dimensional CLT for sums of non-degenerate random vectors: -rate. arXiv:2009.13673 [math.ST]
  41. Kuffner TA, Young GA 2018. Principled statistical inference in data science. Statistical Data Science N Adams, E Cohen, YK Guo 21–36 Singapore: World Sci.
    [Google Scholar]
  42. Lee JD, Sun DL, Sun Y, Taylor JE 2016. Exact post-selection inference, with application to the lasso. Ann. Stat. 44:3907–27
    [Google Scholar]
  43. Leeb H, Pötscher BM. 2005. Model selection and inference: facts and fiction. Econom. Theory 21:21–59
    [Google Scholar]
  44. Leeb H, Pötscher BM. 2006. Can one estimate the conditional distribution of post-model-selection estimators?. Ann. Stat. 34:52554–91
    [Google Scholar]
  45. Leeb H, Pötscher BM. 2008. Can one estimate the unconditional distribution of post-model-selection estimators?. Econom. Theory 24:2338–76
    [Google Scholar]
  46. Liquet B, Commenges D. 2001. Correction of the P-value after multiple coding of an explanatory variable in logistic regression. Stat. Med. 20:192815–26
    [Google Scholar]
  47. Liquet B, Riou J. 2013. Correction of the significance level when attempting multiple transformations of an explanatory variable in generalized linear models. BMC Med. Res. Methodol. 13:75
    [Google Scholar]
  48. Liquet B, Riou J. 2019. CPMCGLM: an R package for p-value adjustment when looking for an optimal transformation of a single explanatory variable in generalized linear models. BMC Med. Res. Methodol. 19:79
    [Google Scholar]
  49. Lunde R. 2019. Sample splitting and weak assumption inference for time series. arXiv:1902.07425 [math.ST]
  50. Mammen E. 1992. Bootstrap, wild bootstrap, and asymptotic normality. Probab. Theory Related Fields 93:4439–55
    [Google Scholar]
  51. Markovic J, Xia L, Taylor J 2017. Unifying approach to selective inference with applications to cross-validation. arXiv:1703.06559 [stat.ME]
  52. McCloskey A. 2020. Hybrid confidence intervals for informative uniform asymptotic inference after model selection. arXiv:2011.12873 [stat.ME]
  53. Moore D, McCabe G 1998. Introduction to the Practice of Statistics New York: W. H. Freeman
  54. Norvaiša R, Paulauskas V. 1991. Rate of convergence in the central limit theorem for empirical processes. J. Theor. Probab. 4:3511–34
    [Google Scholar]
  55. Panigrahi S, Taylor J, Weinstein A 2016. Integrative methods for post-selection inference under convex constraints. arXiv:1605.08824 [stat.ME]
  56. Pardoe I. 2008. Modeling home prices using realtor data. J. Stat. Educ. 16:2
    [Google Scholar]
  57. Paulauskas V, Račkauskas A. 1989. Approximation Theory in the Central Limit Theorem Dordrecht, Neth: Kluwer Acad.
  58. Politis DN, Romano JP. 1994. Large sample confidence regions based on subsamples under minimal assumptions. Ann. Stat. 22:42031–50
    [Google Scholar]
  59. Politis DN, Romano JP, Wolf M. 1999. Subsampling New York: Springer
  60. Rao RR. 1962. Relations between weak and uniform convergence of measures with applications. Ann. Math. Stat. 33:2659–80
    [Google Scholar]
  61. Rasines DG, Young GA. 2020. Bayesian selective inference: sampling models and non-informative priors. arXiv:2008.04584 [math.ST]
  62. Rinaldo A, Wasserman L, G'Sell M 2019. Bootstrapping and sample splitting for high-dimensional, assumption-lean inference. Ann. Stat. 47:63438–69
    [Google Scholar]
  63. Sampson AR, Sill MW. 2005. Drop-the-losers design: normal case. Biometrical J. 47:3257–68
    [Google Scholar]
  64. Scheffé H. 1953. A method for judging all contrasts in the analysis of variance. Biometrika 40:1–287–110
    [Google Scholar]
  65. Sill MW, Sampson AR. 2009. Drop-the-losers design: binomial case. Comput. Stat. Data Anal. 53:3586–95
    [Google Scholar]
  66. Stine R, Foster D. 2013. Statistics for Business: Decision Making and Analysis New York: Pearson
  67. Tian X, Bi N, Taylor J. 2016. MAGIC: a general, powerful and tractable method for selective inference. arXiv:1607.02630 [math.ST]
  68. Tian X, Taylor J. 2017. Asymptotics of selective inference. Scand. J. Stat. 44:2480–99
    [Google Scholar]
  69. Tian X, Taylor J. 2018. Selective inference with a randomized response. Ann. Stat. 46:2679–710
    [Google Scholar]
  70. Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58:1267–88
    [Google Scholar]
  71. Tibshirani R, Tibshirani R, Taylor J, Loftus J, Reid S, Markovic J 2019. selectiveInference: tools for post-selection inference. R Package version 1.2.5. https://CRAN.R-project.org/package=selectiveInference
    [Google Scholar]
  72. Tibshirani RJ, Rinaldo A, Tibshirani R, Wasserman L 2018. Uniform asymptotic inference and the bootstrap after model selection. Ann. Stat. 46:31255–87
    [Google Scholar]
  73. Tukey JW. 1949. Comparing individual means in the analysis of variance. Biometrics 5:299–114
    [Google Scholar]
  74. Tukey JW. 1953. The problem of multiple comparisons: introduction and parts a, b, and c Work. Pap., Princeton Univ. Princeton, NJ:
  75. Tullock G. 2001. A comment on Daniel Klein's “a plea to economists who favor liberty. .” East. Econ. J. 27:2203–7
    [Google Scholar]
  76. Weisberg S. 2005. Applied Linear Regression New York: Wiley. , 3rd ed..
  77. Whittingham MJ, Stephens PA, Bradbury RB, Freckleton RP. 2006. Why do we still use stepwise modelling in ecology and behaviour?. J. Anim. Ecol. 75:51182–89
    [Google Scholar]
  78. Yekutieli D. 2012. Adjusted Bayesian inference for selected parameters. J. R. Stat. Soc. Ser. B 74:3515–41
    [Google Scholar]
  79. Zhang K. 2012. Valid post-selection inference. PhD Thesis Univ. Pa. Philadelphia, PA:
  80. Zhang X, Cheng G. 2014. Bootstrapping high dimensional time series. arXiv:1406.1037 [math.ST]
  81. Zhang X, Cheng G. 2018. Gaussian approximation for high dimensional vector under physical dependence. Bernoulli 24:4A2640–75
    [Google Scholar]
/content/journals/10.1146/annurev-statistics-100421-044639
Loading
/content/journals/10.1146/annurev-statistics-100421-044639
Loading

Data & Media loading...

Supplemental Material

Supplementary Data

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error