We review statistical methods for high-dimensional data analysis and pay particular attention to recent developments for assessing uncertainties in terms of controlling false positive statements (type I error) and -values. The main focus is on regression models, but we also discuss graphical modeling and causal inference based on observational data. We illustrate the concepts and methods with various packages from the statistical software using a high-throughput genomic data set about riboflavin production with , which we make publicly available for the first time.


Article metrics loading...

Loading full text...

Full text loading...


Literature Cited

  1. Banerjee O, El Ghaoui L, d'Aspremont A. 2008. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 9:485–516 [Google Scholar]
  2. Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57:289–300 [Google Scholar]
  3. Benjamini Y, Yekutieli D. 2001. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29:1165–88 [Google Scholar]
  4. Bühlmann P. 2013. Statistical significance in high-dimensional linear models. Bernoulli J 19:1212–42 [Google Scholar]
  5. Bühlmann P, Kalisch M, Maathuis M. 2010. Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm. Biometrika 97:261–78 [Google Scholar]
  6. Bühlmann P, Mandozzi J. 2013. High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput. Stat. In press. doi: 10.1007/s00180-013-0436-3
  7. Bühlmann P, van de Geer S. 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications Heidelberg, Ger.: Springer-Verlag
  8. Candès E, Tao T. 2007. The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35:2313–51 [Google Scholar]
  9. Chickering D. 2002. Optimal structure identification with greedy search. J. Mach. Learn. Res. 3:507–54 [Google Scholar]
  10. Colombo D, Maathuis M, Kalisch M, Richardson T. 2012. Learning high-dimensional directed acyclic graphs with latent and selection variables. Ann. Stat. 40:294–321 [Google Scholar]
  11. Fan J, Li R. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96:1348–60 [Google Scholar]
  12. Fan J, Lv J. 2008. Sure independence screening for ultra-high dimensional feature space. J. R. Stat. Soc. Ser. B 70:849–911 [Google Scholar]
  13. Fellinghauer B, Bühlmann P, Ryffel M, von Rhein M, Reinhardt J. 2013. Stable graphical model estimation with random forests for discrete, continuous, and mixed variables. Comput. Stat. Data Anal. 64:132–52 [Google Scholar]
  14. Friedman J, Hastie T, Tibshirani R. 2007. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9:432–41 [Google Scholar]
  15. Friedman J, Hastie T, Tibshirani R. 2010. Regularized paths for generalized linear models via coordinate descent. J. Stat. Softw. 33:1–22 [Google Scholar]
  16. Friedman J, Hastie T, Tibshirani R. 2011. Glasso: graphical lasso—estimation of Gaussian graphical models. R Package Version 1.7
  17. Gasser T, Kneip A, Köhler W. 1991. A flexible and fast method for automatic smoothing. J. Am. Stat. Assoc. 86:643–52 [Google Scholar]
  18. Gautier L, Cope L, Bolstad B, Irizarry R. 2004. Affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20:307–15 [Google Scholar]
  19. Genovese C, Jin J, Wasserman L, Yao Z. 2012. A comparison of the lasso and marginal regression. J. Mach. Learn. Res. 13:2107–43 [Google Scholar]
  20. Hastie T, Tibshirani R, Friedman J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction New York: Springer, 2nd ed..
  21. He Q, Lin D-Y. 2011. A variable selection method for genome-wide association studies. Bioinformatics 27:1–8 [Google Scholar]
  22. Holm S. 1979. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6:65–70 [Google Scholar]
  23. Kalisch M, Bühlmann P. 2007. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J. Mach. Learn. Res. 8:613–36 [Google Scholar]
  24. Kalisch M, Mächler M, Colombo D, Maathuis MH, Bühlmann P. 2012. Causal inference using graphical models with the R package pcalg. J. Stat. Softw. 47:111–26 [Google Scholar]
  25. Lauritzen S. 1996. Graphical Models Oxford: Oxford Univ. Press
  26. Lee J-M, Zhang S, Saha S, Anna SS, Jiang C, Perkins J. 2001. RNA expression analysis using an antisense Bacillus subtilis genome array. J. Bacteriol. 183:7371–80 [Google Scholar]
  27. Liu H, Han F, Yuan M, Lafferty J, Wasserman L. 2012. High-dimensional semiparametric Gaussian copula graphical models. Ann. Stat. 40:2293–326 [Google Scholar]
  28. Liu H, Roeder K, Wasserman L. 2010. Stability approach to regularization selection (StARS) for high dimensional graphical models. Advances in Neural Information Processing Systems 23 J Lafferty, CKI Williams, J Shawe-Taylor, RS Zemel, A Culotta 1432–40 Red Hook, NY: Curran Assoc. [Google Scholar]
  29. Maathuis M, Colombo D, Kalisch M, Bühlmann P. 2010. Predicting causal effects in large-scale systems from observational data. Nat. Methods 7:247–48 [Google Scholar]
  30. Maathuis M, Kalisch M, Bühlmann P. 2009. Estimating high-dimensional intervention effects from observational data. Ann. Stat. 37:3133–64 [Google Scholar]
  31. McCullagh P, Nelder J. 1989. Generalized Linear Models London: Chapman & Hall, 2nd ed..
  32. Meier L. 2013. Hdi: high-dimensional inference. R Package Version 0.0-1/r2. http://hdi.r-forge.r-project.org
  33. Meinshausen N. 2007. Relaxed Lasso. Comput. Stat. Data Anal. 52:374–93 [Google Scholar]
  34. Meinshausen N. 2008. Hierarchical testing of variable importance. Biometrika 95:265–78 [Google Scholar]
  35. Meinshausen N, Bühlmann P. 2006. High-dimensional graphs and variable selection with the Lasso. Ann. Stat. 34:1436–62 [Google Scholar]
  36. Meinshausen N, Bühlmann P. 2010. Stability selection. J. R. Stat. Soc. Ser. B 72:417–73 [Google Scholar]
  37. Meinshausen N, Maathuis M, Bühlmann P. 2011. Asymptotic optimality of the Westfall-Young permutation procedure for multiple testing under dependence. Ann. Stat. 39:3369–91 [Google Scholar]
  38. Meinshausen N, Meier L, Bühlmann P. 2009. P-values for high-dimensional regression. J. Am. Stat. Assoc. 104:1671–81 [Google Scholar]
  39. Mooij J, Janzing D, Heskes T, Schölkopf B. 2011. On causal discovery with cyclic additive noise models. Advances in Neural Information Processing Systems 24 J Shawe-Taylor, RS Zemel, P Bartlett, F Pereira, KQ Weinberger 639–47 Red Hook, NY: Curran Assoc. [Google Scholar]
  40. Pearl J. 2000. Causality: Models, Reasoning, and Inference Cambridge, UK: Cambridge Univ. Press
  41. Pinheiro J, Bates D. 2000. Mixed-Effects Models in S and S-PLUS New York: Springer
  42. Pollard KS, Gilbert HN, Ge Y, Taylor S, Dudoit S. 2012. Multtest: resampling-based multiple hypothesis testing. R Package Version 2.14.0
  43. R Development Core Team 2012. R: A Language and Environment for Statistical Computing Vienna: R Found. Stat. Comput.
  44. Richardson T. 1996. A discovery algorithm for directed cyclic graphs. Proc. 12th Conf. Uncertain. Artif. Intell. (1996) E Horvitz, F Jensen 454–61 San Francisco: Morgan Kaufmann [Google Scholar]
  45. Roeder K, Wasserman L. 2009. Genome-wide significance levels and weighted hypothesis testing. Stat. Sci. 24:398–413 [Google Scholar]
  46. Schelldorfer J. 2011. Lmmlasso: linear mixed-effects models with Lasso. R Package Version 0.1-2
  47. Schelldorfer J, Bühlmann P, van de Geer S. 2011. Estimation for high-dimensional linear mixed-effects models using ℓ1-penalization. Scand. J. Stat. 38:197–214 [Google Scholar]
  48. Schelldorfer J, Meier L, Bühlmann P. 2013. GLMMLasso: an algorithm for high-dimensional generalized linear mixed models using ℓ1-penalization. J. Comput. Graph. Stat. In press. doi: 10.1080/10618600.2013.773239
  49. Shah R, Samworth R. 2013. Variable selection with error control: another look at stability selection. J. R. Stat. Soc. Ser. B 75:55–80 [Google Scholar]
  50. Spirtes P. 1995. Directed cyclic graphical representations of feedback models. Proc. 11th Conf. Uncertain. Artif. Intell. (1995) P Besnard, S Hanks 491–99 San Francisco: Morgan Kaufmann [Google Scholar]
  51. Spirtes P, Glymour C, Scheines R. 2000. Causation, Prediction, and Search Cambridge: MIT Press, 2nd ed..
  52. Stekhoven D, Moraes I, Sveinbjörnsson G, Hennig L, Maathuis M, Bühlmann P. 2012. Causal stability ranking. Bioinformatics 28:2819–23 [Google Scholar]
  53. Sun T, Zhang C-H. 2012. Scaled sparse linear regression. Biometrika 99:879–98 [Google Scholar]
  54. Theußl S, Zeisleis A. 2009. Collaborative software development using R-Forge. R J 1:9–14 [Google Scholar]
  55. Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58:267–88 [Google Scholar]
  56. van de Geer S. 2008. High-dimensional generalized linear models and the Lasso. Ann. Stat. 36:614–45 [Google Scholar]
  57. van de Geer S, Bühlmann P. 2009. On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 3:1360–92 [Google Scholar]
  58. van de Geer S, Bühlmann P. 2013. 0-Penalized maximum likelihood for sparse directed acyclic graphs. Ann. Stat. 41:536–67 [Google Scholar]
  59. van de Geer S, Bühlmann P, Ritov Y. 2013. On asymptotically optimal confidence regions and tests for high-dimensional models. arXiv: 1303.0518 [math.ST]
  60. van de Geer S, Bühlmann P, Zhou S. 2011. The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso). Electron. J. Stat. 5:688–749 [Google Scholar]
  61. Wasserman L, Roeder K. 2009. High-dimensional variable selection. Ann. Stat. 37:2178–201 [Google Scholar]
  62. Westfall P, Young S. 1989. p value adjustments for multiple tests in multivariate binomial models. J. Am. Stat. Assoc. 84:780–86 [Google Scholar]
  63. Xue L, Zou H. 2012. Regularized rank-based estimation of high-dimensional nonparanormal graphical models. Ann. Stat. 40:2541–71 [Google Scholar]
  64. Zamboni N, Fischer E, Muffler A, Wyss M, Hohmann H-P, Sauer U. 2005. Transient expression and flux changes during a shift from high to low riboflavin production in continuous cultures of Bacillus subtilis. Biotechnol. Bioeng. 89:219–32 [Google Scholar]
  65. Zhang C-H. 2010. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38:894–942 [Google Scholar]
  66. Zhang C-H, Zhang S. 2013. Confidence intervals for low dimensional parameters with high dimensional data. J. R. Stat. Soc. B. In press. doi: 10.1111/rssb.12026
  67. Zhao P, Yu B. 2006. On model selection consistency of Lasso. J. Mach. Learn. Res. 7:2541–63 [Google Scholar]
  68. Zhao T, Liu H, Roeder K, Lafferty J, Wasserman L. 2012. The package for high-dimensional undirected graph estimation in . J. Mach. Learn. Res. 13:1059–62 [Google Scholar]
  69. Zou H. 2006. The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 101:1418–29 [Google Scholar]
  70. Zou H, Li R. 2008. One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 36:1509–33. Discussion. 1534–66 [Google Scholar]

Data & Media loading...

Supplemental Material

Supplementary Data

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error