1932

Abstract

For decades, statisticians have studied the species problem: how to estimate the total number of species, observed plus unobserved, in a population. This problem dates at least as far back as 1943, to a paper by R.A. Fisher. These methods have found many applications in general ecology, but their importance has grown considerably in recent years, driven by the introduction of high-throughput DNA sequencing into microbial ecology. We examine the state of the art in terms of estimating the total number of taxa in a microbial population from a sample of sequences. We focus mainly on estimating the number of species within a single population (α-diversity), but we also briefly consider statistical inference for comparing the numbers of species across populations (β-diversity). We discuss the full range of statistical techniques, parametric and nonparametric as well as frequentist and Bayesian, and specific implications of their use in microbial diversity studies. We conclude with some recommendations for theoretical investigation and computational tool development.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-022513-115654
2014-01-03
2024-04-24
Loading full text...

Full text loading...

/deliver/fulltext/statistics/1/1/annurev-statistics-022513-115654.html?itemId=/content/journals/10.1146/annurev-statistics-022513-115654&mimeType=html&fmt=ahah

Literature Cited

  1. Allen HK, Bunge J, Foster JA, Bayles DO, Stanton TB. 2013. Estimation of viral richness from shotgun metagenomes using a frequency count approach. Microbiome 1:5 [Google Scholar]
  2. Amann R, Fuchs BM, Behrens S. 2001. The identification of microorganisms by fluorescence in situ hybridisation. Curr. Opin. Biotechnol. 12:231–36 [Google Scholar]
  3. Barger K, Bunge J. 2011. Objective Bayesian estimation for the number of species. J. Bayesian Anal. 5:765–86 [Google Scholar]
  4. Bäuerle N, Grübel R. 2005. Multivariate counting processes: copulas and beyond. ASTIN Bull. 35:379–408 [Google Scholar]
  5. Berger JO, Bernardo JM, Sun D. 2012. Objective priors for discrete parameter spaces. J. Am. Stat. Assoc. 107:636–48 [Google Scholar]
  6. Bhat S, Sproat R. 2009. Knowing the unseen: estimating vocabulary size over unseen samples. Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing of the AFNLP: Proceedings of the Conference 1109–117 Stroudsburg, PA: World Sci. Publ.
  7. Böhning D, Kuhnert R. 2006. Equivalence of truncated count mixture distributions and mixtures of truncated count distributions. Biometrics 62:1207–15 [Google Scholar]
  8. Böhning D, Kuhnert R. 2009. CAMCR: Computer-Assisted Mixture model analysis for Capture-Recapture count data. AStA Adv. Stat. Anal. 93:61–71 [Google Scholar]
  9. Böhning D, Schön D. 2005. Nonparametric maximum likelihood estimation of population size based on the counting distribution. J. R. Stat. Soc. C 54:721–37 [Google Scholar]
  10. Bunge J. 2013. A survey of software for fitting capture-recapture models. WIREs Comput. Stat. 5:114–20 [Google Scholar]
  11. Bunge J, Barger K. 2008. Parametric models for estimating the number of classes. Biom. J. 50:971–82 [Google Scholar]
  12. Bunge J, Böhning D, Allen H, Foster JA. 2012a. Estimating population diversity with unreliable low frequency counts. Biocomputing 2012: Proceedings of the Pacific Symposium203–12 Hackensack, NJ: World Sci. Publ.
  13. Bunge J, Fitzpatrick M. 1993. Estimating the number of species—a review. J. Am. Stat. Assoc. 88:364–73 [Google Scholar]
  14. Bunge J, Woodard L, Böhning D, Foster JA, Connolly S, Allen HK. 2012b. Estimating population diversity with CatchAll. Bioinformatics 28:1045–47 [Google Scholar]
  15. Chao A. 1987. Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43:783–91 [Google Scholar]
  16. Chao A. 2005. Species richness estimation. Encyclopedia of Statistical Sciences S Kotz, N Balakrishnan, CB Read, B Vidakovic 7907–16 New York: Wiley, 2nd ed.. [Google Scholar]
  17. Chao A, Chazdon RL, Colwell RK, Shen T-J. 2006. Abundance-based similarity indices and their estimation when there are unseen species in samples. Biometrics 62:361–71 [Google Scholar]
  18. Chao A, Shen T-J. 2003–2005. Program SPADE (Species Prediction and Diversity Estimation). Stat. Softw. Program and user's guide at http://chao.stat.nthu.edu.tw
  19. Dray S, Chessel D, Thioulouse J. 2003. Co-inertia analysis and the linking of ecological data tables. Ecology 84:3078–89 [Google Scholar]
  20. Favaro S, Lijoi A, Pruenster I. 2012. A new estimator of the discovery probability. Biometrics 68:1188–96 [Google Scholar]
  21. Fisher RA, Corbet S, Williams CB. 1943. The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol. 12:42–58 [Google Scholar]
  22. Foster JA, Bunge J, Gilbert JA, Moore JH. 2012. Measuring the microbiome: perspectives on advances in DNA-based techniques for exploring microbial life. Brief. Bioinforma. 13:420–29 [Google Scholar]
  23. Gao F. 2013. Moderate deviations for a nonparametric estimator of sample coverage. Ann. Stat. 41:641–69 [Google Scholar]
  24. Gilbert JA, O'Dor R, King N, Vogel TM. 2011. The importance of metagenomic surveys to microbial ecology: or why Darwin would have been a metagenomic scientist. Microb. Inform. Exp. 1:5 [Google Scholar]
  25. Good IJ. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40:237–64 [Google Scholar]
  26. Greenacre M. 2007. Correspondence Analysis in Practice London: Chapman & Hall/CRC, 2nd ed..
  27. Hampton J, Lladser ME. 2012. Estimation of distribution overlap of urn models. PLoS ONE 7:e42368 [Google Scholar]
  28. Holmes I, Harris K, Quince C. 2012. Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE 7:e31026 [Google Scholar]
  29. Johnson NL, Kemp AW, Kotz S. 2005. Univariate Discrete Distributions Hoboken, NJ: Wiley
  30. Karlis D, Meligkotsidou L. 2007. Finite mixtures of multivariate Poisson distributions with application. J. Stat. Plan. Inference 137:1942–60 [Google Scholar]
  31. Koenker R, Mizera I. 2013. Convex optimization in R. J. Stat. Softw. In press
  32. Lewis K. 2009. Persisters, biofilms, and the problem of cultivability. Uncultivated Microorganisms S Epstein 181–94 Berlin: Springer-Verlag [Google Scholar]
  33. Lladser ME, Gouet R, Reeder J. 2011. Extrapolation of urn models via Poissonization: accurate measurements of the microbial unknown. PLoS ONE 6:e21105 [Google Scholar]
  34. Logares R, Haverkamp TH, Kumar S, Lanzén A, Nederbragt AJ. et al. 2012. Environmental microbiology through the lens of high-throughput DNA sequencing: synopsis of current platforms and bioinformatics approaches. J. Microbiol. Methods 91:106–13 [Google Scholar]
  35. Lozupone C, Lladser ME, Knights D, Stombaugh J, Knight R. 2011. UniFrac: an effective distance metric for microbial community comparison. ISME J. 5:169–72 [Google Scholar]
  36. Mao CX, Colwell RK. 2005. Estimation of species richness: mixture models, the role of rare species, and inferential challenges. Ecology 86:1143–53 [Google Scholar]
  37. Mao CX, Lindsay BG. 2007. Estimating the number of classes. Ann. Stat. 35:917–30 [Google Scholar]
  38. Miravete E. 2009. Multivariate Sarmanov count data models CEPR Discuss. Pap. 7463, Cent. Econ. Policy Res., London
  39. Ohannessian MI, Dahleh MA. 2012. Large alphabets: finite, infinite, and scaling models Presented at the 46th Annu. Conf. Inf. Sci. Syst. (CISS), Inst. Electr. Electron. Eng., Princeton, NJ
  40. Pan HY, Chao A, Foissner W. 2009. A nonparametric lower bound for the number of species shared by multiple communities. J. Agric. Biol. Environ. Stat. 14:452–68 [Google Scholar]
  41. Puri PS, Goldie CM. 1979. Poisson mixtures and quasi-infinite divisibility of distributions. J. Appl. Probab. 16:138–53 [Google Scholar]
  42. Quince C, Curtis TP, Sloan WT. 2008. The rational exploration of microbial diversity. ISME J. 2:997–1006 [Google Scholar]
  43. Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ. 2011. Removing noise from pyrosequenced amplicons. BMC Bioinforma. 12:38–55 [Google Scholar]
  44. Rocchetti I, Bunge J, Böhning D. 2011. Population size estimation based upon ratios of recapture probabilities. Ann. Appl. Stat. 5:1512–33 [Google Scholar]
  45. Sears CL. 2005. A dynamic partnership: celebrating our gut flora. Anaerobe 11:247–51 [Google Scholar]
  46. Staley JT, Konopka A. 1985. Measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu. Rev. Microbiol. 39:321–46 [Google Scholar]
  47. Steinebach J, Eastwood VR. 1997. Detecting changes in a multivariate renewal process. Metrika 46:1–19 [Google Scholar]
  48. Tardella L. 2002. A new Bayesian method for nonparametric capture-recapture models in presence of heterogeneity. Biometrika 89:807–17 [Google Scholar]
  49. Tripathi RC, Gurland J. 1977. A general family of discrete distributions with hypergeometric probabilities. J. R. Stat. Soc. B 39:349–56 [Google Scholar]
  50. Tuomisto H. 2011. Commentary: Do we have a consistent terminology for species diversity? Yes, if we choose to use it. Oecologia 167:903–11 [Google Scholar]
  51. Valero J, Pérez-Casany M, Ginebra J. 2010. On zero-truncating and mixing Poisson distributions. Adv. Appl. Probab. 42:1013–27 [Google Scholar]
  52. Valiant G, Valiant P. 2011. Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs. Proceedings of the 43rd Annual ACM Symposium on Theory of Computing L Fortnow, SP Vadhan 685–94 New York: ACM
  53. Wang J-P. 2010. Estimating species richness by a Poisson-compound gamma model. Biometrika 97:727–40 [Google Scholar]
  54. Whitman WB, Coleman DC, Wiebe WJ. 1998. Prokaryotes: the unseen majority. Proc. Natl. Acad. Sci. USA 95:6578–83 [Google Scholar]
  55. Williamson M, Gaston KJ. 2005. The lognormal distribution is not an appropriate null hypothesis for the species-abundance distribution. J. Anim. Ecol. 74:409–22 [Google Scholar]
  56. Zhang Z. 2012. Entropy estimation in Turing's perspective. Neural Comput. 24:1368–89 [Google Scholar]
  57. Zhang Z, Zhou J. 2010. Re-parameterization of multinomial distributions and diversity indices. J. Stat. Plan. Inference 140:1731–38 [Google Scholar]
/content/journals/10.1146/annurev-statistics-022513-115654
Loading
/content/journals/10.1146/annurev-statistics-022513-115654
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error