Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis

Hongzhe Li

doi:10.1146/annurev-statistics-010814-020351

Annual Review of Statistics and Its Application

Volume 2, 2015

Review Article

Free

Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis

Hongzhe Li¹
View Affiliations Hide Affiliations

Affiliations: Department of Biostatistics and Epidemiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania 19014; email: [email protected]
Vol. 2:73-94 (Volume publication date May 2015) https://doi.org/10.1146/annurev-statistics-010814-020351
© Annual Reviews

Abstract

The human microbiome is the totality of all microbes in and on the human body, and its importance in health and disease has been increasingly recognized. High-throughput sequencing technologies have recently enabled scientists to obtain an unbiased quantification of all microbes constituting the microbiome. Often, a single sample can produce hundreds of millions of short sequencing reads. However, unique characteristics of the data produced by the new technologies, as well as the sheer magnitude of these data, make drawing valid biological inferences from microbiome studies difficult. Analysis of these big data poses great statistical and computational challenges. Important issues include normalization and quantification of relative taxa, bacterial genes, and metabolic abundances; incorporation of phylogenetic information into analysis of metagenomics data; and multivariate analysis of high-dimensional compositional data. We review existing methods, point out their limitations, and outline future research directions.

Keyword(s): differential abundance analysis, network, next-generation sequencing, phylogenetics, simplex data

Article metrics loading...

/content/journals/10.1146/annurev-statistics-010814-020351

2015-04-10

2024-04-27

Full text loading...

/deliver/fulltext/statistics/2/1/annurev-statistics-010814-020351.html?itemId=/content/journals/10.1146/annurev-statistics-010814-020351&mimeType=html&fmt=ahah

Literature Cited

Aitchison J. 1982. The statistical analysis of compositional data. J. R. Stat. Soc. B 44:139–77 [Google Scholar]
Aitchison J, Bacon-Shone J. 1984. Log contrast models for experiments with mixtures. Biometrika 71:323–30 [Google Scholar]
Angly FE, Willner D, Prieto-Davó A, Edwards RA, Schmieder R. et al. 2009. The GAAS metagenomic tool and its estimations of viral and microbial average genome size in four major biomes. PLOS Comput. Biol. 5:12e1000593 [Google Scholar]
Bäckhed F, Ley RE, Sonnenburg JL, Peterson DA, Gordon JI. 2005. Host-bacterial mutualism in the human intestine. Science 307:1915–20 [Google Scholar]
Billheimer D, Guttorp P, Fagan WF. 2001. Statistical interpretation of species composition. J. Am. Stat. Assoc. 96:1205–14 [Google Scholar]
Brady A, Salzberg SL. 2009. Phymm and phymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods 6:673–76 [Google Scholar]
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD. et al. 2010. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7:335–36 [Google Scholar]
Chaffron S, Rehrauer H, Pernthaler J, von Mering C. 2010. A global network of coexisting microbes from environmental and whole-genome sequence data. Genome Res. 20:947–59 [Google Scholar]
Chen J, Bittinger K, Charlson ES, Hoffmann C, Lewis J. et al. 2012. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics 28:162106–13 [Google Scholar]
Chen J, Bushman FD, Lewis J, Wu G, Li H. 2013. Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis. Biostatistics 14:2244–58 [Google Scholar]
Chen J, Li H. 2013a. Kernel methods for regression analysis of microbiome compositional data. Topics in Applied Statistics: 2012 Symposium of the International Chinese Statistical Association M Hu, Y Liu, J Lin 191–201 New York: Springer
Chen J, Li H. 2013b. Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. Ann. Appl. Stat. 7:418–42 [Google Scholar]
Diaz NN, Krause L, Goesmann A, Niehaus K, Nattkemper TW. 2009. TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinform. 10:56 [Google Scholar]
Evans SN, Matsen FA. 2012. The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples. J. R. Stat. Soc. B 74:569–92 [Google Scholar]
Friedman J, Alm EJ. 2012. Inferring correlation networks from genomic survey data. PLOS Comput. Biol. 8:e1002687 [Google Scholar]
Hamady M, Knight R. 2009. Microbial community profiling for human microbiome projects: tools, techniques, and challenges. Genome Res. 19:1141–52 [Google Scholar]
Holmes I, Harris K, Quince C. 2012. Dirichlet multinomial mixtures: generative models for microbial meta-genomics. PLOS ONE 7:e30126 [Google Scholar]
Hsiao EY, McBrite SW, Hsien S, Sharon G, Hyde ER. et al. 2013. Microbiota modulate behavioral and physiological abnormalities associated with neurodevelopmental disorders. Cell 155:1451–63 [Google Scholar]
Hum. Microbiome Proj. Consort 2012. Structure, function and diversity of the healthy human microbiome. Nature 486:207–14 [Google Scholar]
Huson DH, Mitra S, Weber N, Ruscheweyh HJ, Schuster SC. 2011. Integrative analysis of environmental sequences using MEGAN4. Genome Res. 21:1552–60 [Google Scholar]
Iida N, Dzutsev A, Stewart CA, Smith L, Bouladoux N. et al. 2013. Commensal bacteria control cancer response to therapy by modulating the tumor microenvironment. Science 342:967–70 [Google Scholar]
Kent JT. 1982. The Fisher–Bingham distribution on the sphere. J. R. Stat. Soc. B 44:71–80 [Google Scholar]
Keoth RA, Wang Z, Levison BS, Buffa JA, Org E. et al. 2013. Intestinal microbiota metabolism of L-carnitine, a nutrient in red meat, promotes atherosclerosis. Nat. Med. 19:576–85 [Google Scholar]
La Rosa PS, Brooks JP, Deych E, Boone EL, Edwards DJ. et al. 2012. Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLOS ONE 7:e52078 [Google Scholar]
Le Chatelier E, Nielsen T, Qin J, Prifti E, Hildebrand F. et al. 2013. Richness of human gut microbiome correlates with metabolic markers. Nature 500:541–46 [Google Scholar]
Lin W, Shi P, Feng R, Li H. 2014. Variable selection in regression with compositional covariates. Biometrika 101:785–97 [Google Scholar]
Lozupone CA, Hamady M, Kelley ST, Knight R. 2007. Quantitative and qualitative β diversity measures lead to different insights into factors that structure microbial communities. Appl. Environ. Microbiol. 73:1576–85 [Google Scholar]
Lozupone C, Knight R. 2005. UniFrac: a new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol. 71:8228–35 [Google Scholar]
Macklaim JM, Fernandes AD, Di Bella JM, Hammond J-A, Reid G, Gloor GB. 2013. Comparative meta-RNA-seq of the vaginal microbiota and differential expression by Lactobacillus iners in health and dysbiosis. Microbiome 1:12 [Google Scholar]
Matsen FA IV, Evans SN. 2013. Edge principal components and squash clustering: using the special structure of phylogenetic placement data for sample comparison. PLOS ONE 8:3e56859 [Google Scholar]
Matsen FA, Kodner RB, Armbrust EV. 2010. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinform. 11:538 [Google Scholar]
McArdle BH, Anderson MJ. 2001. Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology 82:290–97 [Google Scholar]
McCoy CO, Matsen FA IV. 2013. Abundance-weighted phylogenetic diversity measures distinguish microbial community states and are robust to sampling depth. PeerJ 1:e157 [Google Scholar]
McHardy A, Martín H, Tsirigos A, Hugenholtz P, Rigoutsos I. 2007. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4:63–72 [Google Scholar]
McMurdie P, Holmes S. 2014. Waste not, want not: Why rarefying microbiome data is inadmissible. PLOS Comput. Biol. 10:e1003531 [Google Scholar]
Mende D, Sunagawa S, Zeller G, Bork P. 2013. Accurate and universal delineation of prokaryotic species. Nat. Methods 10:881–84 [Google Scholar]
Nei M, Kumar S. 2000. Molecular Evolution and Phylogenetics Oxford, UK: Oxford Univ. Press
Paulson J, Stine O, Bravo H, Pop M. 2013. Differential abundance analysis for microbial marker-gene surveys. Nat. Methods 10:1200–12 [Google Scholar]
Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT. 2012. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. PNAS 109:13272–77 [Google Scholar]
Purdom E. 2011. Analysis of a data matrix and a graph: metagenomic data and the phylogenetic tree. Ann. Appl. Stat. 5:42326–58 [Google Scholar]
Qin J, Li Y, Cai Z, Li S, Zhu J. et al. 2012. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490:55–60 [Google Scholar]
Robinson MD, McCarthy DJ, Smyth GK. 2009. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–40 [Google Scholar]
Scealy JL, Welsh AH. 2011. Regression for compositional data by using distributions defined on the hypersphere. J. R. Stat. Soc. B 73:351–75 [Google Scholar]
Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M. et al. 2009. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75:7537–41 [Google Scholar]
Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. 2012. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 8:811–14 [Google Scholar]
Sunagawa S, Mende DR, Zeller G, Izquierdo-Carrasco F, Berger SA. et al. 2013. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10:1196–99 [Google Scholar]
Thomas T, Gilbert J, Meyer F. 2012. Metagenomics - a guide from sampling to data analysis. Microbial Inform. Exp. 2:3 [Google Scholar]
Tringe SG, Rubin EM. 2005. Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6:11805–14 [Google Scholar]
Tu Q, He Z, Zhou J. 2014. Strain/species identification in metagenomes using genome-specific markers. Nucleic Acids Res. 428e67
Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A. et al. 2009. A core gut microbiome in obese and lean twins. Nature 457:480–84 [Google Scholar]
White JR, Nagarajan N, Pop M. 2009. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLOS Comput. Biol. 5:e1000352 [Google Scholar]
Wu GD, Chen J, Hoffmann C, Bittinger K, Chen Y-Y. et al. 2011. Linking long-term dietary patterns with gut microbial enterotypes. Science 334:105–8 [Google Scholar]
Xia F, Chen J, Fung WK, Li H. 2013. A logistic normal multinomial regression model for microbiome compositional data analysis. Biometrics 69:121–39 [Google Scholar]

/content/journals/10.1146/annurev-statistics-010814-020351

Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis

Annual Review of Statistics and Its Application 2, 73 (2015); https://doi.org/10.1146/annurev-statistics-010814-020351

/content/journals/10.1146/annurev-statistics-010814-020351

Data & Media loading...

Article Type: Review Article

Most Cited Most Cited RSS feed

- Probabilistic Forecasting
  
  Tilmann Gneiting, and Matthias Katzfuss
  
  Vol. 1 (2014), pp. 125–151
- Functional Data Analysis
  
  Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller
  
  Vol. 3 (2016), pp. 257–295
- Bayesian Computing with INLA: A Review
  
  Håvard Rue, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren
  
  Vol. 4 (2017), pp. 395–421
- Functional Regression
  
  Jeffrey S. Morris
  
  Vol. 2 (2015), pp. 321–359
- Topological Data Analysis
  
  Larry Wasserman
  
  Vol. 5 (2018), pp. 501–532
- Algorithmic Fairness: Choices, Assumptions, and Definitions
  
  Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, and Kristian Lum
  
  Vol. 8 (2021), pp. 141–163
- Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis
  
  Hongzhe Li
  
  Vol. 2 (2015), pp. 73–94
- Learning Deep Generative Models
  
  Ruslan Salakhutdinov
  
  Vol. 2 (2015), pp. 361–385
- On p-Values and Bayes Factors
  
  Leonhard Held, and Manuela Ott
  
  Vol. 5 (2018), pp. 393–419
- High-Dimensional Statistics with a View Toward Applications in Biology
  
  Peter Bühlmann, Markus Kalisch, and Lukas Meier
  
  Vol. 1 (2014), pp. 255–278
More Less

Annual Review of Statistics and Its Application

Volume 2, 2015

Review Article

Free

Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis

Abstract

Most Read This Month

Most Cited Most Cited RSS feed