High-Dimensional Statistics with a View Toward Applications in Biology

Peter Bühlmann; Markus Kalisch; Lukas Meier

doi:10.1146/annurev-statistics-022513-115545

Annual Review of Statistics and Its Application

Volume 1, 2014

Review Article

Free

High-Dimensional Statistics with a View Toward Applications in Biology

Peter Bühlmann¹, Markus Kalisch¹, and Lukas Meier¹
View Affiliations Hide Affiliations

Affiliations: Seminar for Statistics, ETH Zürich, CH-8092 Zürich, Switzerland; email: [email protected], [email protected], [email protected]
Vol. 1:255-278 (Volume publication date January 2014) https://doi.org/10.1146/annurev-statistics-022513-115545
© Annual Reviews

Abstract

We review statistical methods for high-dimensional data analysis and pay particular attention to recent developments for assessing uncertainties in terms of controlling false positive statements (type I error) and p-values. The main focus is on regression models, but we also discuss graphical modeling and causal inference based on observational data. We illustrate the concepts and methods with various packages from the statistical software using a high-throughput genomic data set about riboflavin production with Bacillus subtilis, which we make publicly available for the first time.

Keyword(s): causal inference, graphical modeling, multiple testing, penalized estimation, regression

Article metrics loading...

/content/journals/10.1146/annurev-statistics-022513-115545

2014-01-03

2024-04-19

Full text loading...

/deliver/fulltext/statistics/1/1/annurev-statistics-022513-115545.html?itemId=/content/journals/10.1146/annurev-statistics-022513-115545&mimeType=html&fmt=ahah

Literature Cited

Banerjee O, El Ghaoui L, d'Aspremont A. 2008. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 9:485–516 [Google Scholar]
Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57:289–300 [Google Scholar]
Benjamini Y, Yekutieli D. 2001. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29:1165–88 [Google Scholar]
Bühlmann P. 2013. Statistical significance in high-dimensional linear models. Bernoulli J 19:1212–42 [Google Scholar]
Bühlmann P, Kalisch M, Maathuis M. 2010. Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm. Biometrika 97:261–78 [Google Scholar]
Bühlmann P, Mandozzi J. 2013. High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput. Stat. In press. doi: 10.1007/s00180-013-0436-3
Bühlmann P, van de Geer S. 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications Heidelberg, Ger.: Springer-Verlag
Candès E, Tao T. 2007. The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35:2313–51 [Google Scholar]
Chickering D. 2002. Optimal structure identification with greedy search. J. Mach. Learn. Res. 3:507–54 [Google Scholar]
Colombo D, Maathuis M, Kalisch M, Richardson T. 2012. Learning high-dimensional directed acyclic graphs with latent and selection variables. Ann. Stat. 40:294–321 [Google Scholar]
Fan J, Li R. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96:1348–60 [Google Scholar]
Fan J, Lv J. 2008. Sure independence screening for ultra-high dimensional feature space. J. R. Stat. Soc. Ser. B 70:849–911 [Google Scholar]
Fellinghauer B, Bühlmann P, Ryffel M, von Rhein M, Reinhardt J. 2013. Stable graphical model estimation with random forests for discrete, continuous, and mixed variables. Comput. Stat. Data Anal. 64:132–52 [Google Scholar]
Friedman J, Hastie T, Tibshirani R. 2007. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9:432–41 [Google Scholar]
Friedman J, Hastie T, Tibshirani R. 2010. Regularized paths for generalized linear models via coordinate descent. J. Stat. Softw. 33:1–22 [Google Scholar]
Friedman J, Hastie T, Tibshirani R. 2011. Glasso: graphical lasso—estimation of Gaussian graphical models. R Package Version 1.7
Gasser T, Kneip A, Köhler W. 1991. A flexible and fast method for automatic smoothing. J. Am. Stat. Assoc. 86:643–52 [Google Scholar]
Gautier L, Cope L, Bolstad B, Irizarry R. 2004. Affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20:307–15 [Google Scholar]
Genovese C, Jin J, Wasserman L, Yao Z. 2012. A comparison of the lasso and marginal regression. J. Mach. Learn. Res. 13:2107–43 [Google Scholar]
Hastie T, Tibshirani R, Friedman J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction New York: Springer, 2nd ed..
He Q, Lin D-Y. 2011. A variable selection method for genome-wide association studies. Bioinformatics 27:1–8 [Google Scholar]
Holm S. 1979. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6:65–70 [Google Scholar]
Kalisch M, Bühlmann P. 2007. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J. Mach. Learn. Res. 8:613–36 [Google Scholar]
Kalisch M, Mächler M, Colombo D, Maathuis MH, Bühlmann P. 2012. Causal inference using graphical models with the R package pcalg. J. Stat. Softw. 47:111–26 [Google Scholar]
Lauritzen S. 1996. Graphical Models Oxford: Oxford Univ. Press
Lee J-M, Zhang S, Saha S, Anna SS, Jiang C, Perkins J. 2001. RNA expression analysis using an antisense Bacillus subtilis genome array. J. Bacteriol. 183:7371–80 [Google Scholar]
Liu H, Han F, Yuan M, Lafferty J, Wasserman L. 2012. High-dimensional semiparametric Gaussian copula graphical models. Ann. Stat. 40:2293–326 [Google Scholar]
Liu H, Roeder K, Wasserman L. 2010. Stability approach to regularization selection (StARS) for high dimensional graphical models. Advances in Neural Information Processing Systems 23 J Lafferty, CKI Williams, J Shawe-Taylor, RS Zemel, A Culotta 1432–40 Red Hook, NY: Curran Assoc. [Google Scholar]
Maathuis M, Colombo D, Kalisch M, Bühlmann P. 2010. Predicting causal effects in large-scale systems from observational data. Nat. Methods 7:247–48 [Google Scholar]
Maathuis M, Kalisch M, Bühlmann P. 2009. Estimating high-dimensional intervention effects from observational data. Ann. Stat. 37:3133–64 [Google Scholar]
McCullagh P, Nelder J. 1989. Generalized Linear Models London: Chapman & Hall, 2nd ed..
Meier L. 2013. Hdi: high-dimensional inference. R Package Version 0.0-1/r2. http://hdi.r-forge.r-project.org
Meinshausen N. 2007. Relaxed Lasso. Comput. Stat. Data Anal. 52:374–93 [Google Scholar]
Meinshausen N. 2008. Hierarchical testing of variable importance. Biometrika 95:265–78 [Google Scholar]
Meinshausen N, Bühlmann P. 2006. High-dimensional graphs and variable selection with the Lasso. Ann. Stat. 34:1436–62 [Google Scholar]
Meinshausen N, Bühlmann P. 2010. Stability selection. J. R. Stat. Soc. Ser. B 72:417–73 [Google Scholar]
Meinshausen N, Maathuis M, Bühlmann P. 2011. Asymptotic optimality of the Westfall-Young permutation procedure for multiple testing under dependence. Ann. Stat. 39:3369–91 [Google Scholar]
Meinshausen N, Meier L, Bühlmann P. 2009. P-values for high-dimensional regression. J. Am. Stat. Assoc. 104:1671–81 [Google Scholar]
Mooij J, Janzing D, Heskes T, Schölkopf B. 2011. On causal discovery with cyclic additive noise models. Advances in Neural Information Processing Systems 24 J Shawe-Taylor, RS Zemel, P Bartlett, F Pereira, KQ Weinberger 639–47 Red Hook, NY: Curran Assoc. [Google Scholar]
Pearl J. 2000. Causality: Models, Reasoning, and Inference Cambridge, UK: Cambridge Univ. Press
Pinheiro J, Bates D. 2000. Mixed-Effects Models in S and S-PLUS New York: Springer
Pollard KS, Gilbert HN, Ge Y, Taylor S, Dudoit S. 2012. Multtest: resampling-based multiple hypothesis testing. R Package Version 2.14.0
R Development Core Team 2012. R: A Language and Environment for Statistical Computing Vienna: R Found. Stat. Comput.
Richardson T. 1996. A discovery algorithm for directed cyclic graphs. Proc. 12th Conf. Uncertain. Artif. Intell. (1996) E Horvitz, F Jensen 454–61 San Francisco: Morgan Kaufmann [Google Scholar]
Roeder K, Wasserman L. 2009. Genome-wide significance levels and weighted hypothesis testing. Stat. Sci. 24:398–413 [Google Scholar]
Schelldorfer J. 2011. Lmmlasso: linear mixed-effects models with Lasso. R Package Version 0.1-2
Schelldorfer J, Bühlmann P, van de Geer S. 2011. Estimation for high-dimensional linear mixed-effects models using ℓ₁-penalization. Scand. J. Stat. 38:197–214 [Google Scholar]
Schelldorfer J, Meier L, Bühlmann P. 2013. GLMMLasso: an algorithm for high-dimensional generalized linear mixed models using ℓ₁-penalization. J. Comput. Graph. Stat. In press. doi: 10.1080/10618600.2013.773239
Shah R, Samworth R. 2013. Variable selection with error control: another look at stability selection. J. R. Stat. Soc. Ser. B 75:55–80 [Google Scholar]
Spirtes P. 1995. Directed cyclic graphical representations of feedback models. Proc. 11th Conf. Uncertain. Artif. Intell. (1995) P Besnard, S Hanks 491–99 San Francisco: Morgan Kaufmann [Google Scholar]
Spirtes P, Glymour C, Scheines R. 2000. Causation, Prediction, and Search Cambridge: MIT Press, 2nd ed..
Stekhoven D, Moraes I, Sveinbjörnsson G, Hennig L, Maathuis M, Bühlmann P. 2012. Causal stability ranking. Bioinformatics 28:2819–23 [Google Scholar]
Sun T, Zhang C-H. 2012. Scaled sparse linear regression. Biometrika 99:879–98 [Google Scholar]
Theußl S, Zeisleis A. 2009. Collaborative software development using R-Forge. R J 1:9–14 [Google Scholar]
Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58:267–88 [Google Scholar]
van de Geer S. 2008. High-dimensional generalized linear models and the Lasso. Ann. Stat. 36:614–45 [Google Scholar]
van de Geer S, Bühlmann P. 2009. On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 3:1360–92 [Google Scholar]
van de Geer S, Bühlmann P. 2013. ℓ₀-Penalized maximum likelihood for sparse directed acyclic graphs. Ann. Stat. 41:536–67 [Google Scholar]
van de Geer S, Bühlmann P, Ritov Y. 2013. On asymptotically optimal confidence regions and tests for high-dimensional models. arXiv: 1303.0518 [math.ST]
van de Geer S, Bühlmann P, Zhou S. 2011. The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso). Electron. J. Stat. 5:688–749 [Google Scholar]
Wasserman L, Roeder K. 2009. High-dimensional variable selection. Ann. Stat. 37:2178–201 [Google Scholar]
Westfall P, Young S. 1989. p value adjustments for multiple tests in multivariate binomial models. J. Am. Stat. Assoc. 84:780–86 [Google Scholar]
Xue L, Zou H. 2012. Regularized rank-based estimation of high-dimensional nonparanormal graphical models. Ann. Stat. 40:2541–71 [Google Scholar]
Zamboni N, Fischer E, Muffler A, Wyss M, Hohmann H-P, Sauer U. 2005. Transient expression and flux changes during a shift from high to low riboflavin production in continuous cultures of Bacillus subtilis. Biotechnol. Bioeng. 89:219–32 [Google Scholar]
Zhang C-H. 2010. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38:894–942 [Google Scholar]
Zhang C-H, Zhang S. 2013. Confidence intervals for low dimensional parameters with high dimensional data. J. R. Stat. Soc. B. In press. doi: 10.1111/rssb.12026
Zhao P, Yu B. 2006. On model selection consistency of Lasso. J. Mach. Learn. Res. 7:2541–63 [Google Scholar]
Zhao T, Liu H, Roeder K, Lafferty J, Wasserman L. 2012. The package for high-dimensional undirected graph estimation in . J. Mach. Learn. Res. 13:1059–62 [Google Scholar]
Zou H. 2006. The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 101:1418–29 [Google Scholar]
Zou H, Li R. 2008. One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 36:1509–33. Discussion. 1534–66 [Google Scholar]

/content/journals/10.1146/annurev-statistics-022513-115545

High-Dimensional Statistics with a View Toward Applications in Biology

Annual Review of Statistics and Its Application 1, 255 (2014); https://doi.org/10.1146/annurev-statistics-022513-115545

/content/journals/10.1146/annurev-statistics-022513-115545

Data & Media loading...

Supplemental Material

Supplementary Data

Download Supplemental Text (PDF)
Download data sets:

riboflavin (CSV)

riboflavingrouped (CSV)

riboflavingrouped_structure (CSV)

riboflavinv100 (CSV)

Article Type: Review Article

Most Cited Most Cited RSS feed

- Probabilistic Forecasting
  
  Tilmann Gneiting, and Matthias Katzfuss
  
  Vol. 1 (2014), pp. 125–151
- Functional Data Analysis
  
  Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller
  
  Vol. 3 (2016), pp. 257–295
- Bayesian Computing with INLA: A Review
  
  Håvard Rue, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren
  
  Vol. 4 (2017), pp. 395–421
- Functional Regression
  
  Jeffrey S. Morris
  
  Vol. 2 (2015), pp. 321–359
- Topological Data Analysis
  
  Larry Wasserman
  
  Vol. 5 (2018), pp. 501–532
- Algorithmic Fairness: Choices, Assumptions, and Definitions
  
  Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, and Kristian Lum
  
  Vol. 8 (2021), pp. 141–163
- Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis
  
  Hongzhe Li
  
  Vol. 2 (2015), pp. 73–94
- Learning Deep Generative Models
  
  Ruslan Salakhutdinov
  
  Vol. 2 (2015), pp. 361–385
- On p-Values and Bayes Factors
  
  Leonhard Held, and Manuela Ott
  
  Vol. 5 (2018), pp. 393–419
- High-Dimensional Statistics with a View Toward Applications in Biology
  
  Peter Bühlmann, Markus Kalisch, and Lukas Meier
  
  Vol. 1 (2014), pp. 255–278
More Less

Annual Review of Statistics and Its Application

Volume 1, 2014

Review Article

Free

High-Dimensional Statistics with a View Toward Applications in Biology

Abstract

Supplementary Data

Most Read This Month

Most Cited Most Cited RSS feed