Estimating the Number of Species in Microbial Diversity Studies

John Bunge; Amy Willis; Fiona Walsh

doi:10.1146/annurev-statistics-022513-115654

Annual Review of Statistics and Its Application

Volume 1, 2014

Review Article

Free

Estimating the Number of Species in Microbial Diversity Studies

John Bunge¹, Amy Willis¹, and Fiona Walsh²
View Affiliations Hide Affiliations

Affiliations: ¹Department of Statistical Science, Cornell University, Ithaca, New York 14853; email: [email protected], [email protected] ²Federal Department of Economic Affairs, Education and Research EAER, Research Station Agroscope Changins-Wädenswil ACW, Bacteriology, 8820 Wädenswil, Switzerland; email: [email protected]
Vol. 1:427-445 (Volume publication date January 2014) https://doi.org/10.1146/annurev-statistics-022513-115654
First published as a Review in Advance on August 30, 2013
© Annual Reviews

Abstract

For decades, statisticians have studied the species problem: how to estimate the total number of species, observed plus unobserved, in a population. This problem dates at least as far back as 1943, to a paper by R.A. Fisher. These methods have found many applications in general ecology, but their importance has grown considerably in recent years, driven by the introduction of high-throughput DNA sequencing into microbial ecology. We examine the state of the art in terms of estimating the total number of taxa in a microbial population from a sample of sequences. We focus mainly on estimating the number of species within a single population (α-diversity), but we also briefly consider statistical inference for comparing the numbers of species across populations (β-diversity). We discuss the full range of statistical techniques, parametric and nonparametric as well as frequentist and Bayesian, and specific implications of their use in microbial diversity studies. We conclude with some recommendations for theoretical investigation and computational tool development.

Keyword(s): mixed Poisson, number of classes, sample coverage, species richness, zero truncation, α-diversity

Article metrics loading...

/content/journals/10.1146/annurev-statistics-022513-115654

2014-01-03

2024-05-14

Full text loading...

/deliver/fulltext/statistics/1/1/annurev-statistics-022513-115654.html?itemId=/content/journals/10.1146/annurev-statistics-022513-115654&mimeType=html&fmt=ahah

Literature Cited

Allen HK, Bunge J, Foster JA, Bayles DO, Stanton TB. 2013. Estimation of viral richness from shotgun metagenomes using a frequency count approach. Microbiome 1:5 [Google Scholar]
Amann R, Fuchs BM, Behrens S. 2001. The identification of microorganisms by fluorescence in situ hybridisation. Curr. Opin. Biotechnol. 12:231–36 [Google Scholar]
Barger K, Bunge J. 2011. Objective Bayesian estimation for the number of species. J. Bayesian Anal. 5:765–86 [Google Scholar]
Bäuerle N, Grübel R. 2005. Multivariate counting processes: copulas and beyond. ASTIN Bull. 35:379–408 [Google Scholar]
Berger JO, Bernardo JM, Sun D. 2012. Objective priors for discrete parameter spaces. J. Am. Stat. Assoc. 107:636–48 [Google Scholar]
Bhat S, Sproat R. 2009. Knowing the unseen: estimating vocabulary size over unseen samples. Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing of the AFNLP: Proceedings of the Conference 1109–117 Stroudsburg, PA: World Sci. Publ.
Böhning D, Kuhnert R. 2006. Equivalence of truncated count mixture distributions and mixtures of truncated count distributions. Biometrics 62:1207–15 [Google Scholar]
Böhning D, Kuhnert R. 2009. CAMCR: Computer-Assisted Mixture model analysis for Capture-Recapture count data. AStA Adv. Stat. Anal. 93:61–71 [Google Scholar]
Böhning D, Schön D. 2005. Nonparametric maximum likelihood estimation of population size based on the counting distribution. J. R. Stat. Soc. C 54:721–37 [Google Scholar]
Bunge J. 2013. A survey of software for fitting capture-recapture models. WIREs Comput. Stat. 5:114–20 [Google Scholar]
Bunge J, Barger K. 2008. Parametric models for estimating the number of classes. Biom. J. 50:971–82 [Google Scholar]
Bunge J, Böhning D, Allen H, Foster JA. 2012a. Estimating population diversity with unreliable low frequency counts. Biocomputing 2012: Proceedings of the Pacific Symposium203–12 Hackensack, NJ: World Sci. Publ.
Bunge J, Fitzpatrick M. 1993. Estimating the number of species—a review. J. Am. Stat. Assoc. 88:364–73 [Google Scholar]
Bunge J, Woodard L, Böhning D, Foster JA, Connolly S, Allen HK. 2012b. Estimating population diversity with CatchAll. Bioinformatics 28:1045–47 [Google Scholar]
Chao A. 1987. Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43:783–91 [Google Scholar]
Chao A. 2005. Species richness estimation. Encyclopedia of Statistical Sciences S Kotz, N Balakrishnan, CB Read, B Vidakovic 7907–16 New York: Wiley, 2nd ed.. [Google Scholar]
Chao A, Chazdon RL, Colwell RK, Shen T-J. 2006. Abundance-based similarity indices and their estimation when there are unseen species in samples. Biometrics 62:361–71 [Google Scholar]
Chao A, Shen T-J. 2003–2005. Program SPADE (Species Prediction and Diversity Estimation). Stat. Softw. Program and user's guide at http://chao.stat.nthu.edu.tw
Dray S, Chessel D, Thioulouse J. 2003. Co-inertia analysis and the linking of ecological data tables. Ecology 84:3078–89 [Google Scholar]
Favaro S, Lijoi A, Pruenster I. 2012. A new estimator of the discovery probability. Biometrics 68:1188–96 [Google Scholar]
Fisher RA, Corbet S, Williams CB. 1943. The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol. 12:42–58 [Google Scholar]
Foster JA, Bunge J, Gilbert JA, Moore JH. 2012. Measuring the microbiome: perspectives on advances in DNA-based techniques for exploring microbial life. Brief. Bioinforma. 13:420–29 [Google Scholar]
Gao F. 2013. Moderate deviations for a nonparametric estimator of sample coverage. Ann. Stat. 41:641–69 [Google Scholar]
Gilbert JA, O'Dor R, King N, Vogel TM. 2011. The importance of metagenomic surveys to microbial ecology: or why Darwin would have been a metagenomic scientist. Microb. Inform. Exp. 1:5 [Google Scholar]
Good IJ. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40:237–64 [Google Scholar]
Greenacre M. 2007. Correspondence Analysis in Practice London: Chapman & Hall/CRC, 2nd ed..
Hampton J, Lladser ME. 2012. Estimation of distribution overlap of urn models. PLoS ONE 7:e42368 [Google Scholar]
Holmes I, Harris K, Quince C. 2012. Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE 7:e31026 [Google Scholar]
Johnson NL, Kemp AW, Kotz S. 2005. Univariate Discrete Distributions Hoboken, NJ: Wiley
Karlis D, Meligkotsidou L. 2007. Finite mixtures of multivariate Poisson distributions with application. J. Stat. Plan. Inference 137:1942–60 [Google Scholar]
Koenker R, Mizera I. 2013. Convex optimization in R. J. Stat. Softw. In press
Lewis K. 2009. Persisters, biofilms, and the problem of cultivability. Uncultivated Microorganisms S Epstein 181–94 Berlin: Springer-Verlag [Google Scholar]
Lladser ME, Gouet R, Reeder J. 2011. Extrapolation of urn models via Poissonization: accurate measurements of the microbial unknown. PLoS ONE 6:e21105 [Google Scholar]
Logares R, Haverkamp TH, Kumar S, Lanzén A, Nederbragt AJ. et al. 2012. Environmental microbiology through the lens of high-throughput DNA sequencing: synopsis of current platforms and bioinformatics approaches. J. Microbiol. Methods 91:106–13 [Google Scholar]
Lozupone C, Lladser ME, Knights D, Stombaugh J, Knight R. 2011. UniFrac: an effective distance metric for microbial community comparison. ISME J. 5:169–72 [Google Scholar]
Mao CX, Colwell RK. 2005. Estimation of species richness: mixture models, the role of rare species, and inferential challenges. Ecology 86:1143–53 [Google Scholar]
Mao CX, Lindsay BG. 2007. Estimating the number of classes. Ann. Stat. 35:917–30 [Google Scholar]
Miravete E. 2009. Multivariate Sarmanov count data models CEPR Discuss. Pap. 7463, Cent. Econ. Policy Res., London
Ohannessian MI, Dahleh MA. 2012. Large alphabets: finite, infinite, and scaling models Presented at the 46th Annu. Conf. Inf. Sci. Syst. (CISS), Inst. Electr. Electron. Eng., Princeton, NJ
Pan HY, Chao A, Foissner W. 2009. A nonparametric lower bound for the number of species shared by multiple communities. J. Agric. Biol. Environ. Stat. 14:452–68 [Google Scholar]
Puri PS, Goldie CM. 1979. Poisson mixtures and quasi-infinite divisibility of distributions. J. Appl. Probab. 16:138–53 [Google Scholar]
Quince C, Curtis TP, Sloan WT. 2008. The rational exploration of microbial diversity. ISME J. 2:997–1006 [Google Scholar]
Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ. 2011. Removing noise from pyrosequenced amplicons. BMC Bioinforma. 12:38–55 [Google Scholar]
Rocchetti I, Bunge J, Böhning D. 2011. Population size estimation based upon ratios of recapture probabilities. Ann. Appl. Stat. 5:1512–33 [Google Scholar]
Sears CL. 2005. A dynamic partnership: celebrating our gut flora. Anaerobe 11:247–51 [Google Scholar]
Staley JT, Konopka A. 1985. Measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu. Rev. Microbiol. 39:321–46 [Google Scholar]
Steinebach J, Eastwood VR. 1997. Detecting changes in a multivariate renewal process. Metrika 46:1–19 [Google Scholar]
Tardella L. 2002. A new Bayesian method for nonparametric capture-recapture models in presence of heterogeneity. Biometrika 89:807–17 [Google Scholar]
Tripathi RC, Gurland J. 1977. A general family of discrete distributions with hypergeometric probabilities. J. R. Stat. Soc. B 39:349–56 [Google Scholar]
Tuomisto H. 2011. Commentary: Do we have a consistent terminology for species diversity? Yes, if we choose to use it. Oecologia 167:903–11 [Google Scholar]
Valero J, Pérez-Casany M, Ginebra J. 2010. On zero-truncating and mixing Poisson distributions. Adv. Appl. Probab. 42:1013–27 [Google Scholar]
Valiant G, Valiant P. 2011. Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs. Proceedings of the 43rd Annual ACM Symposium on Theory of Computing L Fortnow, SP Vadhan 685–94 New York: ACM
Wang J-P. 2010. Estimating species richness by a Poisson-compound gamma model. Biometrika 97:727–40 [Google Scholar]
Whitman WB, Coleman DC, Wiebe WJ. 1998. Prokaryotes: the unseen majority. Proc. Natl. Acad. Sci. USA 95:6578–83 [Google Scholar]
Williamson M, Gaston KJ. 2005. The lognormal distribution is not an appropriate null hypothesis for the species-abundance distribution. J. Anim. Ecol. 74:409–22 [Google Scholar]
Zhang Z. 2012. Entropy estimation in Turing's perspective. Neural Comput. 24:1368–89 [Google Scholar]
Zhang Z, Zhou J. 2010. Re-parameterization of multinomial distributions and diversity indices. J. Stat. Plan. Inference 140:1731–38 [Google Scholar]

/content/journals/10.1146/annurev-statistics-022513-115654

Estimating the Number of Species in Microbial Diversity Studies

Annual Review of Statistics and Its Application 1, 427 (2014); https://doi.org/10.1146/annurev-statistics-022513-115654

/content/journals/10.1146/annurev-statistics-022513-115654

Data & Media loading...

Article Type: Review Article

Most Cited Most Cited RSS feed

- Functional Data Analysis
  
  Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller
  
  Vol. 3 (2016), pp. 257–295
- Probabilistic Forecasting
  
  Tilmann Gneiting, and Matthias Katzfuss
  
  Vol. 1 (2014), pp. 125–151
- Bayesian Computing with INLA: A Review
  
  Håvard Rue, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren
  
  Vol. 4 (2017), pp. 395–421
- Functional Regression
  
  Jeffrey S. Morris
  
  Vol. 2 (2015), pp. 321–359
- Topological Data Analysis
  
  Larry Wasserman
  
  Vol. 5 (2018), pp. 501–532
- Algorithmic Fairness: Choices, Assumptions, and Definitions
  
  Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, and Kristian Lum
  
  Vol. 8 (2021), pp. 141–163
- Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis
  
  Hongzhe Li
  
  Vol. 2 (2015), pp. 73–94
- Learning Deep Generative Models
  
  Ruslan Salakhutdinov
  
  Vol. 2 (2015), pp. 361–385
- On p-Values and Bayes Factors
  
  Leonhard Held, and Manuela Ott
  
  Vol. 5 (2018), pp. 393–419
- High-Dimensional Statistics with a View Toward Applications in Biology
  
  Peter Bühlmann, Markus Kalisch, and Lukas Meier
  
  Vol. 1 (2014), pp. 255–278
More Less

Annual Review of Statistics and Its Application

Volume 1, 2014

Review Article

Free

Estimating the Number of Species in Microbial Diversity Studies

Abstract

Most Read This Month

Most Cited Most Cited RSS feed