The human microbiome is the totality of all microbes in and on the human body, and its importance in health and disease has been increasingly recognized. High-throughput sequencing technologies have recently enabled scientists to obtain an unbiased quantification of all microbes constituting the microbiome. Often, a single sample can produce hundreds of millions of short sequencing reads. However, unique characteristics of the data produced by the new technologies, as well as the sheer magnitude of these data, make drawing valid biological inferences from microbiome studies difficult. Analysis of these big data poses great statistical and computational challenges. Important issues include normalization and quantification of relative taxa, bacterial genes, and metabolic abundances; incorporation of phylogenetic information into analysis of metagenomics data; and multivariate analysis of high-dimensional compositional data. We review existing methods, point out their limitations, and outline future research directions.


Article metrics loading...

Loading full text...

Full text loading...


Literature Cited

  1. Aitchison J. 1982. The statistical analysis of compositional data. J. R. Stat. Soc. B 44:139–77 [Google Scholar]
  2. Aitchison J, Bacon-Shone J. 1984. Log contrast models for experiments with mixtures. Biometrika 71:323–30 [Google Scholar]
  3. Angly FE, Willner D, Prieto-Davó A, Edwards RA, Schmieder R. et al. 2009. The GAAS metagenomic tool and its estimations of viral and microbial average genome size in four major biomes. PLOS Comput. Biol. 5:12e1000593 [Google Scholar]
  4. Bäckhed F, Ley RE, Sonnenburg JL, Peterson DA, Gordon JI. 2005. Host-bacterial mutualism in the human intestine. Science 307:1915–20 [Google Scholar]
  5. Billheimer D, Guttorp P, Fagan WF. 2001. Statistical interpretation of species composition. J. Am. Stat. Assoc. 96:1205–14 [Google Scholar]
  6. Brady A, Salzberg SL. 2009. Phymm and phymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods 6:673–76 [Google Scholar]
  7. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD. et al. 2010. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7:335–36 [Google Scholar]
  8. Chaffron S, Rehrauer H, Pernthaler J, von Mering C. 2010. A global network of coexisting microbes from environmental and whole-genome sequence data. Genome Res. 20:947–59 [Google Scholar]
  9. Chen J, Bittinger K, Charlson ES, Hoffmann C, Lewis J. et al. 2012. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics 28:162106–13 [Google Scholar]
  10. Chen J, Bushman FD, Lewis J, Wu G, Li H. 2013. Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis. Biostatistics 14:2244–58 [Google Scholar]
  11. Chen J, Li H. 2013a. Kernel methods for regression analysis of microbiome compositional data. Topics in Applied Statistics: 2012 Symposium of the International Chinese Statistical Association M Hu, Y Liu, J Lin 191–201 New York: Springer [Google Scholar]
  12. Chen J, Li H. 2013b. Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. Ann. Appl. Stat. 7:418–42 [Google Scholar]
  13. Diaz NN, Krause L, Goesmann A, Niehaus K, Nattkemper TW. 2009. TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinform. 10:56 [Google Scholar]
  14. Evans SN, Matsen FA. 2012. The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples. J. R. Stat. Soc. B 74:569–92 [Google Scholar]
  15. Friedman J, Alm EJ. 2012. Inferring correlation networks from genomic survey data. PLOS Comput. Biol. 8:e1002687 [Google Scholar]
  16. Hamady M, Knight R. 2009. Microbial community profiling for human microbiome projects: tools, techniques, and challenges. Genome Res. 19:1141–52 [Google Scholar]
  17. Holmes I, Harris K, Quince C. 2012. Dirichlet multinomial mixtures: generative models for microbial meta-genomics. PLOS ONE 7:e30126 [Google Scholar]
  18. Hsiao EY, McBrite SW, Hsien S, Sharon G, Hyde ER. et al. 2013. Microbiota modulate behavioral and physiological abnormalities associated with neurodevelopmental disorders. Cell 155:1451–63 [Google Scholar]
  19. Hum. Microbiome Proj. Consort 2012. Structure, function and diversity of the healthy human microbiome. Nature 486:207–14 [Google Scholar]
  20. Huson DH, Mitra S, Weber N, Ruscheweyh HJ, Schuster SC. 2011. Integrative analysis of environmental sequences using MEGAN4. Genome Res. 21:1552–60 [Google Scholar]
  21. Iida N, Dzutsev A, Stewart CA, Smith L, Bouladoux N. et al. 2013. Commensal bacteria control cancer response to therapy by modulating the tumor microenvironment. Science 342:967–70 [Google Scholar]
  22. Kent JT. 1982. The Fisher–Bingham distribution on the sphere. J. R. Stat. Soc. B 44:71–80 [Google Scholar]
  23. Keoth RA, Wang Z, Levison BS, Buffa JA, Org E. et al. 2013. Intestinal microbiota metabolism of L-carnitine, a nutrient in red meat, promotes atherosclerosis. Nat. Med. 19:576–85 [Google Scholar]
  24. La Rosa PS, Brooks JP, Deych E, Boone EL, Edwards DJ. et al. 2012. Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLOS ONE 7:e52078 [Google Scholar]
  25. Le Chatelier E, Nielsen T, Qin J, Prifti E, Hildebrand F. et al. 2013. Richness of human gut microbiome correlates with metabolic markers. Nature 500:541–46 [Google Scholar]
  26. Lin W, Shi P, Feng R, Li H. 2014. Variable selection in regression with compositional covariates. Biometrika 101:785–97 [Google Scholar]
  27. Lozupone CA, Hamady M, Kelley ST, Knight R. 2007. Quantitative and qualitative β diversity measures lead to different insights into factors that structure microbial communities. Appl. Environ. Microbiol. 73:1576–85 [Google Scholar]
  28. Lozupone C, Knight R. 2005. UniFrac: a new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol. 71:8228–35 [Google Scholar]
  29. Macklaim JM, Fernandes AD, Di Bella JM, Hammond J-A, Reid G, Gloor GB. 2013. Comparative meta-RNA-seq of the vaginal microbiota and differential expression by Lactobacillus iners in health and dysbiosis. Microbiome 1:12 [Google Scholar]
  30. Matsen FA IV, Evans SN. 2013. Edge principal components and squash clustering: using the special structure of phylogenetic placement data for sample comparison. PLOS ONE 8:3e56859 [Google Scholar]
  31. Matsen FA, Kodner RB, Armbrust EV. 2010. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinform. 11:538 [Google Scholar]
  32. McArdle BH, Anderson MJ. 2001. Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology 82:290–97 [Google Scholar]
  33. McCoy CO, Matsen FA IV. 2013. Abundance-weighted phylogenetic diversity measures distinguish microbial community states and are robust to sampling depth. PeerJ 1:e157 [Google Scholar]
  34. McHardy A, Martín H, Tsirigos A, Hugenholtz P, Rigoutsos I. 2007. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4:63–72 [Google Scholar]
  35. McMurdie P, Holmes S. 2014. Waste not, want not: Why rarefying microbiome data is inadmissible. PLOS Comput. Biol. 10:e1003531 [Google Scholar]
  36. Mende D, Sunagawa S, Zeller G, Bork P. 2013. Accurate and universal delineation of prokaryotic species. Nat. Methods 10:881–84 [Google Scholar]
  37. Nei M, Kumar S. 2000. Molecular Evolution and Phylogenetics Oxford, UK: Oxford Univ. Press [Google Scholar]
  38. Paulson J, Stine O, Bravo H, Pop M. 2013. Differential abundance analysis for microbial marker-gene surveys. Nat. Methods 10:1200–12 [Google Scholar]
  39. Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT. 2012. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. PNAS 109:13272–77 [Google Scholar]
  40. Purdom E. 2011. Analysis of a data matrix and a graph: metagenomic data and the phylogenetic tree. Ann. Appl. Stat. 5:42326–58 [Google Scholar]
  41. Qin J, Li Y, Cai Z, Li S, Zhu J. et al. 2012. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490:55–60 [Google Scholar]
  42. Robinson MD, McCarthy DJ, Smyth GK. 2009. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–40 [Google Scholar]
  43. Scealy JL, Welsh AH. 2011. Regression for compositional data by using distributions defined on the hypersphere. J. R. Stat. Soc. B 73:351–75 [Google Scholar]
  44. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M. et al. 2009. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75:7537–41 [Google Scholar]
  45. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. 2012. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 8:811–14 [Google Scholar]
  46. Sunagawa S, Mende DR, Zeller G, Izquierdo-Carrasco F, Berger SA. et al. 2013. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10:1196–99 [Google Scholar]
  47. Thomas T, Gilbert J, Meyer F. 2012. Metagenomics - a guide from sampling to data analysis. Microbial Inform. Exp. 2:3 [Google Scholar]
  48. Tringe SG, Rubin EM. 2005. Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6:11805–14 [Google Scholar]
  49. Tu Q, He Z, Zhou J. 2014. Strain/species identification in metagenomes using genome-specific markers. Nucleic Acids Res. 428e67 [Google Scholar]
  50. Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A. et al. 2009. A core gut microbiome in obese and lean twins. Nature 457:480–84 [Google Scholar]
  51. White JR, Nagarajan N, Pop M. 2009. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLOS Comput. Biol. 5:e1000352 [Google Scholar]
  52. Wu GD, Chen J, Hoffmann C, Bittinger K, Chen Y-Y. et al. 2011. Linking long-term dietary patterns with gut microbial enterotypes. Science 334:105–8 [Google Scholar]
  53. Xia F, Chen J, Fung WK, Li H. 2013. A logistic normal multinomial regression model for microbiome compositional data analysis. Biometrics 69:121–39 [Google Scholar]

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error