High-Dimensional Data Bootstrap

Victor Chernozhukov; Denis Chetverikov; Kengo Kato; Yuta Koike

doi:10.1146/annurev-statistics-040120-022239

Annual Review of Statistics and Its Application

Volume 10, 2023

Review Article

Open Access

High-Dimensional Data Bootstrap

Victor Chernozhukov¹, Denis Chetverikov², Kengo Kato³, and Yuta Koike⁴
View Affiliations Hide Affiliations

Affiliations: ¹Department of Economics and Center for Statistics and Data Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; email: [email protected] ²Department of Economics, University of California, Los Angeles, California, USA; email: [email protected] ³Department of Statistics and Data Science, Cornell University, Ithaca, New York, USA; email: [email protected] ⁴Mathematics and Informatics Center and Graduate School of Mathematical Sciences, The University of Tokyo, Tokyo, Japan; email: [email protected]
Vol. 10:427-449 (Volume publication date March 2023) https://doi.org/10.1146/annurev-statistics-040120-022239
Copyright © 2023 by the author(s).

This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See credit lines of images or other third-party material in this article for license information

Abstract

This article reviews recent progress in high-dimensional bootstrap. We first review high-dimensional central limit theorems for distributions of sample mean vectors over the rectangles, bootstrap consistency results in high dimensions, and key techniques used to establish those results. We then review selected applications of high-dimensional bootstrap: construction of simultaneous confidence sets for high-dimensional vector parameters, multiple hypothesis testing via step-down, postselection inference, intersection bounds for partially identified parameters, and inference on best policies in policy evaluation. Finally, we also comment on a couple of future research directions.

Keyword(s): empirical bootstrap, high-dimensional central limit theorem, multiple testing, multiplier bootstrap, simultaneous inference

Article metrics loading...

/content/journals/10.1146/annurev-statistics-040120-022239

2023-03-09

2024-04-30

Full text loading...

/deliver/fulltext/statistics/10/1/annurev-statistics-040120-022239.html?itemId=/content/journals/10.1146/annurev-statistics-040120-022239&mimeType=html&fmt=ahah

Literature Cited

Asriev A, Rotar' V 1986. On the convergence rate in the infinite-dimensional central limit theorem for probabilities of hitting parallelepipeds. Theory Probab. Appl. 30:4691–701
[Google Scholar]
Athey S, Wager S. 2021. Policy learning with observational data. Econometrica 89:1133–61
[Google Scholar]
Bach P, Chernozhukov V, Spindler M. 2018. Valid simultaneous inference in high-dimensional settings (with the hdm package for R). arXiv:1809.04951 [econ.EM]
Ball K. 1993. The reverse isoperimetric problem for Gaussian measure. Discrete Comput. Geometry 10:4411–20
[Google Scholar]
Belloni A, Chernozhukov V. 2013. Least squares after model selection in high-dimensional sparse models. Bernoulli 19:2521–47
[Google Scholar]
Belloni A, Chernozhukov V, Chetverikov D, Hansen C, Kato K. 2018a. High-dimensional econometrics and regularized GMM. arXiv:1806.01888 [math.ST]
Belloni A, Chernozhukov V, Chetverikov D, Wei Y. 2018b. Uniformly valid post-regularization confidence regions for many functional parameters in Z-estimation framework. Ann. Stat. 46:6B3643–75
[Google Scholar]
Belloni A, Chernozhukov V, Hansen C. 2014. Inference on treatment effects after selection among high-dimensional controls. Rev. Econ. Stud. 81:2608–50
[Google Scholar]
Belloni A, Chernozhukov V, Kato K. 2015. Uniform post selection inference for LAD regression and other Z-estimation problems. Biometrika 102:77–94
[Google Scholar]
Bentkus V. 2003. On the dependence of the Berry–Esseen bound on dimension. J. Stat. Plan. Inference 113:2385–402
[Google Scholar]
Bentkus V. 2005. A Lyapunov-type bound in R^d. Theory Probab. Appl. 49:2311–23
[Google Scholar]
Berk R, Brown L, Buja A, Zhang K, Zhao L. 2013. Valid post-selection inference. Ann. Stat. 41:2802–37
[Google Scholar]
Bickel PJ, Ritov Y, Tsybakov AB. 2009. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37:41705–32
[Google Scholar]
Bonis T. 2020. Stein's method for normal approximation in Wasserstein distances with application to the multivariate central limit theorem. Probab. Theory Relat. Fields 178:3827–60
[Google Scholar]
Boucheron S, Lugosi G, Massart P. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence Oxford, UK: Oxford Univ. Press
Bühlmann P, van de Geer S. 2011. Statistics for High-Dimensional Data New York: Springer
Chang J, Chen X, Wu M 2021. Central limit theorems for high dimensional dependent data. arXiv:2104.12929 [math.ST]
Chang J, Zheng C, Zhou WX, Zhou W. 2017a. Simulation-based hypothesis testing of high dimensional means under covariance heterogeneity. Biometrics 73:41300–10
[Google Scholar]
Chang J, Zhou W, Zhou WX, Wang L. 2017b. Comparing large covariance matrices under weak conditions on the dependence structure and its application to gene clustering. Biometrics 73:31–41
[Google Scholar]
Chatterjee S, Meckes E. 2008. Multivariate normal approximation using exchangeable pairs. ALEA 4:257–83
[Google Scholar]
Chen X. 2018. Gaussian and bootstrap approximations for high-dimensional U-statistics and their applications. Ann. Stat. 46:2642–78
[Google Scholar]
Chen X, Kato K. 2019. Randomized incomplete U-statistics in high dimensions. Ann. Stat. 47:63127–56
[Google Scholar]
Chen X, Kato K. 2020. Jackknife multiplier bootstrap: finite sample approximations to the U-process supremum with applications. Probab. Theory Relat. Fields 176:31097–163
[Google Scholar]
Chen YC, Genovese CR, Tibshirani RJ, Wasserman L. 2016. Nonparametric modal regression. Ann. Stat. 44:2489–514
[Google Scholar]
Chen YC, Genovese CR, Wasserman L. 2015. Asymptotic theory for density ridges. Ann. Stat. 43:51896–928
[Google Scholar]
Chernozhukov V, Chetverikov D, Kato K. 2013a. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Stat. 41:62786–819
[Google Scholar]
Chernozhukov V, Chetverikov D, Kato K. 2014a. Anti-concentration and honest, adaptive confidence bands. Ann. Stat. 42:51787–818
[Google Scholar]
Chernozhukov V, Chetverikov D, Kato K. 2014b. Gaussian approximation of suprema of empirical processes. Ann. Stat. 42:41564–97
[Google Scholar]
Chernozhukov V, Chetverikov D, Kato K. 2015. Comparison and anti-concentration bounds for maxima of Gaussian random vectors. Probab. Theory Relat. Fields 162:47–70
[Google Scholar]
Chernozhukov V, Chetverikov D, Kato K. 2016a. Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related Gaussian couplings. Stoch. Proc. Appl. 126:123632–51
[Google Scholar]
Chernozhukov V, Chetverikov D, Kato K. 2017a. Central limit theorems and bootstrap in high dimensions. Ann. Probab. 45:42309–52
[Google Scholar]
Chernozhukov V, Chetverikov D, Kato K. 2017b. Detailed proof of Nazarov's inequality. arXiv:1711.10696 [math.ST]
Chernozhukov V, Chetverikov D, Kato K. 2019. Inference on causal and structural parameters using many moment inequalities. Rev. Econ. Stud. 86:51867–900
[Google Scholar]
Chernozhukov V, Chetverikov D, Kato K, Koike Y. 2022. Improved central limit theorem and bootstrap approximations in high dimensions. Ann. Stat. In press
[Google Scholar]
Chernozhukov V, Chetverikov D, Koike Y. 2020. Nearly optimal central limit theorem and bootstrap approximations in high dimensions. arXiv:2012.09513 [math.PR]
Chernozhukov V, Hansen C, Spindler M. 2016b. High-dimensional metrics in R. arXiv:1603.01700 [stat.ML]
Chernozhukov V, Lee S, Rosen AM. 2013b. Intersection bounds: estimation and inference. Econometrica 81:2667–737
[Google Scholar]
Chesher A, Rosen AM. 2017. Generalized instrumental variable models. Econometrica 85:3959–89
[Google Scholar]
Chetverikov D. 2018. Adaptive tests of conditional moment inequalities. Econom. Theory 34:1186–227
[Google Scholar]
Chetverikov D. 2019. Testing regression monotonicity in econometric models. Econom. Theory 35:4729–76
[Google Scholar]
Chiang HD, Kato K, Sasaki Y. 2021. Inference for high-dimensional exchangeable arrays. J. Am. Stat. Assoc. https://doi.org/10.1080/01621459.2021.2000868
[Crossref] [Google Scholar]
Courtade TA, Fathi M, Pananjady A. 2019. Existence of Stein kernels under a spectral gap, and discrepancy bounds. Ann. Inst. Henri Poincaré Probab. Stat. 55:2777–90
[Google Scholar]
Das D, Lahiri S. 2021. Central limit theorem in high dimensions: the optimal bound on dimension growth rate. Trans. Am. Math. Soc. 374:106991–7009
[Google Scholar]
Deng H, Zhang CH. 2020. Beyond Gaussian approximation: bootstrap for maxima of sums of independent random vectors. Ann. Stat. 48:63643–71
[Google Scholar]
Dezeure R, Buehlmann P, Zhang CH. 2017. High-dimensional simultaneous inference with the bootstrap. Test 26:685–719
[Google Scholar]
Efron B. 1979. Bootstrap methods: another look at the jackknife. Ann. Stat. 7:11–26
[Google Scholar]
Eldan R, Mikulincer D, Zhai A. 2020. The CLT in high dimensions: quantitative bounds via martingale embedding. Ann. Probab. 48:52494–524
[Google Scholar]
Fan J, Hall P, Yao Q. 2007. To how many simultaneous hypothesis tests can normal, Student's t or bootstrap calibration be applied?. J. Am. Stat. Assoc. 102:4801282–88
[Google Scholar]
Fan J, Shao QM, Zhou WX. 2018. Are discoveries spurious? Distributions of maximum spurious correlations and their applications. Ann. Stat. 46:3989–1017
[Google Scholar]
Fang X, Koike Y. 2020. Large-dimensional central limit theorem with fourth-moment error bounds on convex sets and balls. arXiv:2009.00339 [math.PR]
Fang X, Koike Y. 2021. High-dimensional central limit theorems by Stein's method. Ann. Appl. Probab. 31:41660–86
[Google Scholar]
Fang X, Shao QM, Xu L. 2019. Multivariate approximations in Wasserstein distance by Stein's method and Bismut's formula. Probab. Theory Relat. Fields 174:3945–79
[Google Scholar]
Giraud C. 2014. Introduction to High-Dimensional Statistics Boca Raton, FL: Chapman and Hall/CRC
Götze F. 1991. On the rate of convergence in the multivariate CLT. Ann. Probab. 19:724–39
[Google Scholar]
Guo X, He X. 2021. Inference on selected subgroups in clinical trials. J. Am. Stat. Assoc. 116:5351498–506
[Google Scholar]
Hall P. 1992. The Bootstrap and Edgeworth Expansion New York: Springer
Hastie T, Tibshirani RJ, Wainwright MJ. 2015. Statistical Learning with Sparsity: the Lasso and Generalizations. Boca Raton, FL: Chapman and Hall/CRC
Horowitz JL 2001. The bootstrap. Handbook of Econometrics, Vol. 5 JJ Heckman, E Leamer 3159–228 Amsterdam: Elsevier
[Google Scholar]
James G, Witten D, Hastie T, Tibshirani R. 2021. An Introduction to Statistical Learning New York: Springer. , 2nd ed..
Janková J, Shah RD, Bühlmann P, Samworth RJ. 2020. Goodness-of-fit testing in high dimensional generalized linear models. J. R. Stat. Soc. Ser. B 82:3773–95
[Google Scholar]
Javanmard A, Montanari A. 2014. Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15:12869–909
[Google Scholar]
Klivans AR, O'Donnell R, Servedio RA 2008. Learning geometric concepts via Gaussian surface area. 2008 49th Annual IEEE Symposium on Foundations of Computer Science541–50 New York: IEEE
[Google Scholar]
Koike Y. 2019. Mixed-normal limit theorems for multiple Skorohod integrals in high-dimensions, with application to realized covariance. Electron. J. Stat. 13:11443–522
[Google Scholar]
Koike Y. 2021. Notes on the dimension dependence in high-dimensional central limit theorems for hyperrectangles. Jpn. J. Stat. Data Sci. 4:1257–97
[Google Scholar]
Kuchibhotla AK, Brown LD, Buja A, Cai J, George EI, Zhao LH. 2020. Valid post-selection inference in model-free linear regression. Ann. Stat. 48:52953–81
[Google Scholar]
Kuchibhotla AK, Kolassa JE, Kuffner TA. 2021a. Post-selection inference. Annu. Rev. Stat. Appl. 9:505–27
[Google Scholar]
Kuchibhotla AK, Mukherjee S, Banerjee D. 2021b. High-dimensional CLT: improvements, non-uniform extensions and large deviations. Bernoulli 27:1192–217
[Google Scholar]
Kuchibhotla AK, Rinaldo A 2020. High-dimensional CLT for sums of non-degenerate random vectors: n^–1/2-rate. arXiv:2009.13673 [math.ST]
Kurisu D, Kato K, Shao X. 2021. Gaussian approximation and spatially dependent wild bootstrap for high-dimensional spatial data. arXiv:2103.10720 [math.ST]
Lopes ME. 2020. Central limit theorem and bootstrap approximation in high dimensions with near 1/ rates. arXiv:2009.06004 [math.ST]
Lopes ME, Lin Z, Müller HG. 2020. Bootstrapping max statistics in high dimensions: near-parametric rates under weak variance decay and application to functional and multinomial data. Ann. Stat. 48:21214–29
[Google Scholar]
Lopes ME, Wang S, Mahoney M. 2019. A bootstrap method for error estimation in randomized matrix multiplication. J. Mach. Learn. Res. 20:1434–73
[Google Scholar]
Mammen E. 1993. Bootstrap and wild bootstrap for high dimensional linear models. Ann. Stat. 21:1255–85
[Google Scholar]
Manski CF 2010. Partial identification in econometrics. Microeconometrics SN Durlauf, LE Blume 178–88 New York: Springer
[Google Scholar]
Manski CF, Pepper JV. 2009. More on monotone instrumental variables. Econom. J. 12:S200–16
[Google Scholar]
Nazarov FL 2003. On the maximal perimeter of a convex set in Rⁿ with respect to Gaussian measure. Geometric Aspects of Functional Analysis, Vol. 2003 VD Milman, G Schechtman 169–87 Berlin: Springer
[Google Scholar]
Ning Y, Liu H. 2017. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Stat. 45:1158–95
[Google Scholar]
Raič M. 2019. A multivariate Berry–Esseen theorem with explicit constants. Bernoulli 25:4A2824–53
[Google Scholar]
Reinert G, Röllin A. 2009. Multivariate normal approximation with Stein's method of exchangeable pairs under a general linearity condition. Ann. Probab. 37:62150–73
[Google Scholar]
Rinaldo A, Wasserman L, G'Sell M 2019. Bootstrapping and sample splitting for high-dimensional, assumption-lean inference. Ann. Stat. 47:63438–69
[Google Scholar]
Romano JP, Wolf M. 2005. Exact and approximate stepdown methods for multiple hypothesis testing. J. Am. Stat. Assoc. 100:46994–108
[Google Scholar]
Romano JP, Wolf M. 2016. Efficient computation of adjusted p-values for resampling-based stepdown multiple testing. Stat. Probab. Lett. 113:38–40
[Google Scholar]
Song Y, Chen X, Kato K 2019. Approximating high-dimensional infinite-order U-statistics: statistical and computational guarantees. Electron. J. Stat. 13:24794–848
[Google Scholar]
Song Y, Chen X, Kato K 2020. Stratified incomplete local simplex tests for curvature of nonparametric multiple regression. arXiv:2003.09091 [math.ST]
Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58:1267–88
[Google Scholar]
van de Geer S, Bühlmann P, Ritov Y, Dezeure R. 2014. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Stat. 42:31166–202
[Google Scholar]
van der Vaart AW. 2000. Asymptotic Statistics, Vol. 3 Cambridge, UK: Cambridge Univ. Press
van der Vaart AW, Wellner J. 1996. Weak Convergence and Empirical Processes: With Applications to Statistics New York: Springer
Wainwright MJ. 2019. High-Dimensional Statistics: A Non-Asymptotic Viewpoint Cambridge, UK: Cambridge Univ. Press
Wasserman L, Kolar M, Rinaldo A 2014. Berry-Esseen bounds for estimating undirected graphs. Electron. J. Stat. 8:11188–224
[Google Scholar]
Wellner JA, Zhan Y. 1996. Bootstrapping Z-estimators Tech. Rep. 308 Dep. Stat., Univ. Wash. Seattle:
Zhai A. 2018. A high-dimensional CLT in W₂ distance with near optimal convergence rate. Probab. Theory Relat. Fields 170:3821–45
[Google Scholar]
Zhang CH, Zhang SS. 2014. Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B 76:1217–42
[Google Scholar]
Zhang D, Wu W. 2017. Gaussian approximation for high dimensional time series. Ann. Stat. 45:1895–919
[Google Scholar]
Zhang X, Cheng G. 2017. Simultaneous inference for high-dimensional linear models. J. Am. Stat. Assoc. 112:757–68
[Google Scholar]
Zhang X, Cheng G. 2018. Gaussian approximation for high dimensional vector under physical dependence. Bernoulli 24:4A2640–75
[Google Scholar]
Zhilova M. 2020a. New Edgeworth-type expansions with finite sample guarantees. arXiv:2006.03959 [math.ST]
Zhilova M. 2020b. Nonclassical Berry–Esseen inequalities and accuracy of the bootstrap. Ann. Stat. 48:41922–39
[Google Scholar]

/content/journals/10.1146/annurev-statistics-040120-022239

High-Dimensional Data Bootstrap

Annual Review of Statistics and Its Application 10, 427 (2023); https://doi.org/10.1146/annurev-statistics-040120-022239

/content/journals/10.1146/annurev-statistics-040120-022239

Data & Media loading...

Article Type: Review Article

Most Cited Most Cited RSS feed

- Probabilistic Forecasting
  
  Tilmann Gneiting, and Matthias Katzfuss
  
  Vol. 1 (2014), pp. 125–151
- Functional Data Analysis
  
  Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller
  
  Vol. 3 (2016), pp. 257–295
- Bayesian Computing with INLA: A Review
  
  Håvard Rue, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren
  
  Vol. 4 (2017), pp. 395–421
- Functional Regression
  
  Jeffrey S. Morris
  
  Vol. 2 (2015), pp. 321–359
- Topological Data Analysis
  
  Larry Wasserman
  
  Vol. 5 (2018), pp. 501–532
- Algorithmic Fairness: Choices, Assumptions, and Definitions
  
  Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, and Kristian Lum
  
  Vol. 8 (2021), pp. 141–163
- Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis
  
  Hongzhe Li
  
  Vol. 2 (2015), pp. 73–94
- Learning Deep Generative Models
  
  Ruslan Salakhutdinov
  
  Vol. 2 (2015), pp. 361–385
- On p-Values and Bayes Factors
  
  Leonhard Held, and Manuela Ott
  
  Vol. 5 (2018), pp. 393–419
- High-Dimensional Statistics with a View Toward Applications in Biology
  
  Peter Bühlmann, Markus Kalisch, and Lukas Meier
  
  Vol. 1 (2014), pp. 255–278
More Less

Annual Review of Statistics and Its Application

Volume 10, 2023

Review Article

Open Access

High-Dimensional Data Bootstrap

Abstract

Most Read This Month

Most Cited Most Cited RSS feed