Analysis of Microbiome Data

Christine B. Peterson; Satabdi Saha; Kim-Anh Do

doi:10.1146/annurev-statistics-040522-120734

Annual Review of Statistics and Its Application

Volume 11, 2024

Review Article

Open Access

Analysis of Microbiome Data

Christine B. Peterson¹, Satabdi Saha¹, and Kim-Anh Do¹
View Affiliations Hide Affiliations

Affiliations: Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA; email: [email protected]
Vol. 11:483-504 (Volume publication date April 2024) https://doi.org/10.1146/annurev-statistics-040522-120734
First published as a Review in Advance on October 13, 2023
Copyright © 2024 by the author(s).

This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See credit lines of images or other third-party material in this article for license information

Abstract

The microbiome represents a hidden world of tiny organisms populating not only our surroundings but also our own bodies. By enabling comprehensive profiling of these invisible creatures, modern genomic sequencing tools have given us an unprecedented ability to characterize these populations and uncover their outsize impact on our environment and health. Statistical analysis of microbiome data is critical to infer patterns from the observed abundances. The application and development of analytical methods in this area require careful consideration of the unique aspects of microbiome profiles. We begin this review with a brief overview of microbiome data collection and processing and describe the resulting data structure. We then provide an overview of statistical methods for key tasks in microbiome data analysis, including data visualization, comparison of microbial abundance across groups, regression modeling, and network inference. We conclude with a discussion and highlight interesting future directions.

Keyword(s): compositional data, differential abundance, network inference, ordination, regression modeling, zero inflation

Article metrics loading...

/content/journals/10.1146/annurev-statistics-040522-120734

2024-04-22

2024-05-08

Full text loading...

/deliver/fulltext/statistics/11/1/annurev-statistics-040522-120734.html?itemId=/content/journals/10.1146/annurev-statistics-040522-120734&mimeType=html&fmt=ahah

Literature Cited

Aitchison J. 1982.. The statistical analysis of compositional data. . J. R. Stat. Soc. Ser. B 44:(2):139–60
[Crossref] [Google Scholar]
Aitchison J. 1986.. The Statistical Analysis of Compositional Data. London:: Chapman and Hall
Aitchison J, Bacon-Shone J. 1984.. Log contrast models for experiments with mixtures. . Biometrika 71:(2):323–30
[Crossref] [Google Scholar]
Armstrong G, Martino C, Rahman G, Gonzalez A, Vázquez-Baeza Y, et al. 2021.. Uniform manifold approximation and projection (UMAP) reveals composite patterns and resolves visualization artifacts in microbiome data. . mSystems 6:(5):e0069121
[Crossref] [Google Scholar]
Barber RF, Candès EJ. 2015.. Controlling the false discovery rate via knockoffs. . Ann. Stat. 43:(5):2055–85
[Crossref] [Google Scholar]
Barber RF, Ramdas A. 2017.. The p-filter: multilayer false discovery rate control for grouped hypotheses. . J. R. Stat. Soc. Ser. B 79:(4):1247–68
[Crossref] [Google Scholar]
Beghini F, McIver LJ, Blanco-Míguez A, Dubois L, Asnicar F, et al. 2021.. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. . eLife 10::e65088
[Crossref] [Google Scholar]
Benjamini Y, Hochberg Y. 1995.. Controlling the false discovery rate: a practical and powerful approach to multiple testing. . J. R. Stat. Soc. Ser. B 57:(1):289–300
[Crossref] [Google Scholar]
Bien J, Yan X, Simpson L, Müller CL. 2021.. Tree-aggregated predictive modeling of microbiome data. . Sci. Rep. 11::14505
[Crossref] [Google Scholar]
Blanco-Míguez A, Beghini F, Cumbo F, McIver LJ, Thompson KN, et al. 2023.. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. . Nat. Biotechnol. 41:(11):163344
[Crossref] [Google Scholar]
Bogomolov M, Peterson CB, Benjamini Y, Sabatti C. 2021.. Hypotheses on a tree: new error rates and testing strategies. . Biometrika 108:(3):575–90
[Crossref] [Google Scholar]
Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, et al. 2019.. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. . Nat. Biotechnol. 37:(8):852–57
[Crossref] [Google Scholar]
Bray JR, Curtis JT. 1957.. An ordination of the upland forest communities of southern Wisconsin. . Ecol. Monogr. 27:(4):326–49
[Crossref] [Google Scholar]
Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. 2016.. DADA2: High-resolution sample inference from Illumina amplicon data. . Nat. Methods 13:(7):581–83
[Crossref] [Google Scholar]
Chen J, Li H. 2013.. Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. . Ann. Appl. Stat. 7:(1):41842
[Crossref] [Google Scholar]
Cryan JF, O'Riordan KJ, Sandhu K, Peterson V, Dinan TG. 2020.. The gut microbiome in neurological disorders. . Lancet Neurol. 19:(2):179–94
[Crossref] [Google Scholar]
Douglas GM, Maffei VJ, Zaneveld JR, Yurgel SN, Brown JR, et al. 2020.. PICRUSt2 for prediction of metagenome functions. . Nat. Biotechnol. 38:(6):685–88
[Crossref] [Google Scholar]
Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barcelo-Vidal C. 2003.. Isometric logratio transformations for compositional data analysis. . Math. Geol. 35:(3):279–300
[Crossref] [Google Scholar]
Fang H, Huang C, Zhao H, Deng M. 2015.. CCLasso: correlation inference for compositional data through Lasso. . Bioinformatics 31:(19):3172–80
[Crossref] [Google Scholar]
Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB. 2014.. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. . Microbiome 2::15
[Crossref] [Google Scholar]
Fierer N. 2017.. Embracing the unknown: disentangling the complexities of the soil microbiome. . Nat. Rev. Microbiol. 15:(10):579–90
[Crossref] [Google Scholar]
Friedman J, Alm EJ. 2012.. Inferring correlation networks from genomic survey data. . PLOS Comput. Biol. 8:(9):e1002687
[Crossref] [Google Scholar]
Friedman JH, Hastie TJ, Tibshirani RJ. 2008.. Sparse inverse covariance estimation with the graphical lasso. . Biostatistics 9:(3):432–41
[Crossref] [Google Scholar]
Fukuyama J. 2019.. Emphasis on the deep or shallow parts of the tree provides a new characterization of phylogenetic distances. . Genome Biol. 20:(1):131
[Crossref] [Google Scholar]
Gopalakrishnan V, Spencer CN, Nezi L, Reuben A, Andrews M, et al. 2018.. Gut microbiome modulates response to anti–PD-1 immunotherapy in melanoma patients. . Science 359:(6371):97–103
[Crossref] [Google Scholar]
Gower JC. 1966.. Some distance properties of latent root and vector methods used in multivariate analysis. . Biometrika 53:(3–4):325–38
[Crossref] [Google Scholar]
Ha M, Kim J, Galloway-Peña J, Do K, Peterson CB. 2020.. Compositional zero-inflated network estimation for microbiome data. . BMC Bioinformatics 21::581
[Crossref] [Google Scholar]
Holmes I, Harris K, Quince C. 2012.. Dirichlet multinomial mixtures: generative models for microbial metagenomics. . PLOS ONE 7:(2):e30126
[Crossref] [Google Scholar]
Jaccard P. 1900.. Contribution au problème de l'immigration post-glaciaire de la flore alpine. . Bull. Soc. Vaudoise Sci. Nat. 36::87–130
[Google Scholar]
Jansson JK, Hofmockel KS. 2020.. Soil microbiomes and climate change. . Nat. Rev. Microbiol. 18:(1):35–46
[Crossref] [Google Scholar]
Jiang S, Xiao G, Koh AY, Chen Y, Yao B, et al. 2020.. HARMONIES: a hybrid approach for microbiome networks inference via exploiting sparsity. . Front. Genet. 11::445
[Crossref] [Google Scholar]
Katsevich E, Sabatti C. 2019.. Multilayer knockoff filter: Controlled variable selection at multiple resolutions. . Ann. Appl. Stat. 13:(1):1–33
[Crossref] [Google Scholar]
Kaufman L, Rousseeuw P. 1990.. Finding Groups in Data: An Introduction to Cluster Analysis. New York:: Wiley
Kaul A, Mandal S, Davidov O, Peddada SD. 2017.. Analysis of microbiome data in the presence of excess zeros. . Front. Microbiol. 8::2114
[Crossref] [Google Scholar]
Koslovsky MD. 2023.. A Bayesian zero-inflated Dirichlet-multinomial regression model for multivariate compositional count data. . Biometrics. https://doi.org/10.1111/biom.13853
[Google Scholar]
Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. 2015.. Sparse and compositionally robust inference of microbial ecological networks. . PLOS Comput. Biol. 11:(5):e1004226
[Crossref] [Google Scholar]
Lin H, Peddada SD. 2020.. Analysis of compositions of microbiomes with bias correction. . Nat. Commun. 11::3514
[Crossref] [Google Scholar]
Lin W, Shi P, Feng R, Li H. 2014.. Variable selection in regression with compositional covariates. . Biometrika 101:(4):785–97
[Crossref] [Google Scholar]
Liu D, Lin X, Ghosh D. 2007.. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. . Biometrics 63:(4):1079–88
[Crossref] [Google Scholar]
Lloyd-Price J, Arze C, Ananthakrishnan AN, Schirmer M, Avila-Pacheco J, et al. 2019.. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. . Nature 569:(7758):655–62
[Crossref] [Google Scholar]
Love MI, Huber W, Anders S. 2014.. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. . Genome Biol. 15::550
[Crossref] [Google Scholar]
Lozupone C, Hamady M, Kelley S, Knight R. 2007.. Quantitative and qualitative diversity measures lead to different insights into factors that structure microbial communities. . Appl. Environ. Microbiol. 73:(5):1576–85
[Crossref] [Google Scholar]
Lozupone C, Knight R. 2005.. UniFrac: a new phylogenetic method for comparing microbial communities. . Appl. Environ. Microbiol. 71:(12):8228–35
[Crossref] [Google Scholar]
Lu J, Rincon N, Wood DE, Breitwieser FP, Pockrandt C, et al. 2022.. Metagenome analysis using the Kraken software suite. . Nat. Protoc. 17:(12):2815–39
[Crossref] [Google Scholar]
Ma S, Ren B, Mallick H, Moon YS, Schwager E, et al. 2021.. A statistical model for describing and simulating microbial community profiles. . PLOS Comput. Biol. 17:(9):e1008913
[Crossref] [Google Scholar]
Mallick H, Rahnavard A, McIver LJ, Ma S, Zhang Y, et al. 2021.. Multivariable association discovery in population-scale meta-omics studies. . PLOS Comput. Biol. 17:(11):e1009442
[Crossref] [Google Scholar]
Mandal S, Van Treuren W, White RA, Eggesbø M, Knight R, Peddada SD. 2015.. Analysis of composition of microbiomes: a novel method for studying microbial composition. . Microb. Ecol. Health Dis. 26::27663
[Google Scholar]
Mao J, Ma L. 2022.. Dirichlet-tree multinomial mixtures for clustering microbiome compositions. . Ann. Appl. Stat. 16:(3):1476–99
[Crossref] [Google Scholar]
McInnes L, Healy J, Melville J. 2018.. UMAP: uniform manifold approximation and projection for dimension reduction. . arXiv:1802.03426 [stat.ML]
Nearing JT, Douglas GM, Hayes MG, MacDonald J, Desai DK, et al. 2022.. Microbiome differential abundance methods produce different results across 38 datasets. . Nat. Commun. 13:(1):342
[Crossref] [Google Scholar]
O'Brien JD, Record NR, Countway P. 2016.. The power and pitfalls of Dirichlet-multinomial mixture models for ecological count data. . bioRxiv 045468. https://doi.org/10.1101/045468
Osborne N, Peterson CB, Vannucci M. 2022.. Latent network estimation and variable selection for compositional data via variational EM. . J. Comput. Graph. Stat. 31:(1):163–75
[Crossref] [Google Scholar]
Ostner J, Carcy S, Müller CL. 2021.. tascCODA: Bayesian tree-aggregated analysis of compositional amplicon and single-cell data. . Front. Genet. 12::766405
[Crossref] [Google Scholar]
Paulson JN, Stine OC, Bravo HC, Pop M. 2013.. Differential abundance analysis for microbial marker-gene surveys. . Nat. Methods 10:(12):1200–2
[Crossref] [Google Scholar]
Ramdas AK, Barber RF, Wainwright MJ, Jordan MI. 2019.. A unified treatment of multiple testing with prior knowledge using the p-filter. . Ann. Stat. 47:(5):2790–821
[Crossref] [Google Scholar]
Riquelme E, Zhang Y, Zhang L, Montiel M, Zoltan M, et al. 2019.. Tumor microbiome diversity and composition influence pancreatic cancer outcomes. . Cell 178:(4):795–806
[Crossref] [Google Scholar]
Rong R, Jiang S, Xu L, Xiao G, Xie Y, et al. 2021.. MB-GAN: microbiome simulation via generative adversarial network. . GigaScience 10:(2):giab005
[Crossref] [Google Scholar]
Schwabkey ZI, Wiesnoski DH, Chang CC, Tsai WB, Pham D, et al. 2022.. Diet-derived metabolites and mucus link the gut microbiome to fever after cytotoxic cancer treatment. . Sci. Transl. Med. 14:(671):eabo3445
[Crossref] [Google Scholar]
Shi P, Zhang A, Li H. 2016.. Regression analysis for microbiome compositional data. . Ann. Appl. Stat. 10:(2):1019–40
[Google Scholar]
Shi Y, Zhang L, Do K, Jenq RR, Peterson CB. 2023.. Sparse tree-based clustering of microbiome data to characterize microbiome heterogeneity in pancreatic cancer. . J. R. Stat. Soc. Ser. C 72:(1):20–36
[Crossref] [Google Scholar]
Shi Y, Zhang L, Do K, Peterson CB, Jenq RR. 2020.. aPCoA: covariate adjusted principal coordinates analysis. . Bioinformatics 36:(13):4099–101
[Crossref] [Google Scholar]
Shi Y, Zhang L, Peterson CB, Do K, Jenq RR. 2022.. Performance determinants of unsupervised clustering methods for microbiome data. . Microbiome 10::25
[Crossref] [Google Scholar]
Sohn MB, Li H. 2019.. Compositional mediation analysis for microbiome studies. . Ann. Appl. Stat. 13:(1):661–81
[Crossref] [Google Scholar]
Sohn MB, Lu J, Li H. 2022.. A compositional mediation model for a binary outcome: application to microbiome studies. . Bioinformatics 38:(1):16–21
[Crossref] [Google Scholar]
Srinivasan A, Xue L, Zhan X. 2021.. Compositional knockoff filter for high-dimensional regression analysis of microbiome data. . Biometrics 77:(3):984–95
[Crossref] [Google Scholar]
Tang ZZ, Chen G. 2019.. Zero-inflated generalized Dirichlet multinomial regression model for microbiome compositional data analysis. . Biostatistics 20:(4):698–713
[Crossref] [Google Scholar]
Tang ZZ, Chen G, Alekseyenko AV, Li H. 2017.. A general framework for association analysis of microbial communities on a taxonomic tree. . Bioinformatics 33:(9):1278–85
[Crossref] [Google Scholar]
Tara Ocean Found., Tara Oceans, Eur. Mol. Biol. Lab. (EMBL), Eur. Marine Biol. Resour. Cent. Eur. Res. Infrastruct. Consort. (EMBRC-ERIC). 2022.. Priorities for ocean microbiome research. . Nat. Microbiol. 7:(7):937–47
[Crossref] [Google Scholar]
Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, Gordon JI. 2007.. The human microbiome project. . Nature 449:(7164):804–10
[Crossref] [Google Scholar]
Wadsworth WD, Argiento R, Guindani M, Galloway-Peña J, Shelburne SA, Vannucci M. 2017.. An integrative Bayesian Dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data. . BMC Bioinformatics 18::94
[Crossref] [Google Scholar]
Wang S, Cai TT, Li H. 2021.. Optimal estimation of Wasserstein distance on a tree with an application to microbiome studies. . J. Am. Stat. Assoc. 116:(535):1237–53
[Crossref] [Google Scholar]
Wang T, Ling W, Plantinga AM, Wu MC, Zhan X. 2022a.. Testing microbiome association using integrated quantile regression models. . Bioinformatics 38:(2):419–25
[Crossref] [Google Scholar]
Wang T, Zhao H. 2017.. A Dirichlet-tree multinomial regression model for associating dietary nutrients with gut microorganisms. . Biometrics 73:(3):792–801
[Crossref] [Google Scholar]
Wang Y, Sun F, Lin W, Zhang S. 2022b.. AC-PCoA: adjustment for confounding factors using principal coordinate analysis. . PLOS Comput. Biol. 18:(7):e1010184
[Crossref] [Google Scholar]
Wilson N, Zhao N, Zhan X, Koh H, Fu W, et al. 2021.. MiRKAT: kernel machine regression-based global association tests for the microbiome. . Bioinformatics 37:(11):1595–97
[Crossref] [Google Scholar]
Yan X, Bien J. 2021.. Rare feature selection in high dimensions. . J. Am. Stat. Assoc. 116:(534):887–900
[Crossref] [Google Scholar]
Zhang H, Chen J, Feng Y, Wang C, Li H, Liu L. 2021.. Mediation effect selection in high-dimensional and compositional microbiome data. . Stat. Med. 40:(4):885–96
[Crossref] [Google Scholar]
Zhang L, Shi Y, Do K, Peterson CB, Jenq RR. 2021a.. ProgPerm: progressive permutation for a dynamic representation of the robustness of microbiome discoveries. . BMC Bioinformatics 22::126
[Crossref] [Google Scholar]
Zhang L, Shi Y, Jenq R, Do K, Peterson C. 2021b.. Bayesian compositional regression with structured priors for microbiome feature selection. . Biometrics 77:(3):824–38
[Crossref] [Google Scholar]
Zhao N, Chen J, Carroll IM, Ringel-Kulka T, Epstein M, et al. 2015.. Testing in microbiome-profiling studies with MiRKAT, the microbiome regression-based kernel association test. . Am. J. Hum. Genet. 96:(5):797–807
[Crossref] [Google Scholar]
Zhou F, He K, Li Q, Chapkin RS, Ni Y. 2022.. Bayesian biclustering for microbial metagenomic sequencing data via multinomial matrix factorization. . Biostatistics 23:(3):891–909
[Crossref] [Google Scholar]

/content/journals/10.1146/annurev-statistics-040522-120734

Analysis of Microbiome Data

Annual Review of Statistics and Its Application 11, 483 (2024); https://doi.org/10.1146/annurev-statistics-040522-120734

/content/journals/10.1146/annurev-statistics-040522-120734

Data & Media loading...

Article Type: Review Article

Most Cited Most Cited RSS feed

- Functional Data Analysis
  
  Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller
  
  Vol. 3 (2016), pp. 257–295
- Probabilistic Forecasting
  
  Tilmann Gneiting, and Matthias Katzfuss
  
  Vol. 1 (2014), pp. 125–151
- Bayesian Computing with INLA: A Review
  
  Håvard Rue, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren
  
  Vol. 4 (2017), pp. 395–421
- Functional Regression
  
  Jeffrey S. Morris
  
  Vol. 2 (2015), pp. 321–359
- Topological Data Analysis
  
  Larry Wasserman
  
  Vol. 5 (2018), pp. 501–532
- Algorithmic Fairness: Choices, Assumptions, and Definitions
  
  Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, and Kristian Lum
  
  Vol. 8 (2021), pp. 141–163
- Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis
  
  Hongzhe Li
  
  Vol. 2 (2015), pp. 73–94
- Learning Deep Generative Models
  
  Ruslan Salakhutdinov
  
  Vol. 2 (2015), pp. 361–385
- On p-Values and Bayes Factors
  
  Leonhard Held, and Manuela Ott
  
  Vol. 5 (2018), pp. 393–419
- High-Dimensional Statistics with a View Toward Applications in Biology
  
  Peter Bühlmann, Markus Kalisch, and Lukas Meier
  
  Vol. 1 (2014), pp. 255–278
More Less

Annual Review of Statistics and Its Application

Volume 11, 2024

Review Article

Open Access

Analysis of Microbiome Data

Abstract

Most Read This Month

Most Cited Most Cited RSS feed