1932

Abstract

High-throughput sequencing technologies have evolved at a stellar pace for almost a decade and have greatly advanced our understanding of genome biology. In these sampling-based technologies, there is an important detail that is often overlooked in the analysis of the data and the design of the experiments, specifically that the sampled observations often do not give a representative picture of the underlying population. This has long been recognized as a problem in statistical ecology and in the broader statistics literature. In this review, we discuss the connections between these fields, methodological advances that parallel both the needs and opportunities of large-scale data analysis, and specific applications in modern biology. In the process we describe unique aspects of applying these approaches to sequencing technologies, including sequencing error, population and individual heterogeneity, and the design of experiments.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-biodatasci-072018-021339
2019-07-20
2024-06-15
Loading full text...

Full text loading...

/deliver/fulltext/biodatasci/2/1/annurev-biodatasci-072018-021339.html?itemId=/content/journals/10.1146/annurev-biodatasci-072018-021339&mimeType=html&fmt=ahah

Literature Cited

  1. 1. 
    Pachter L 2013. *Seq. In Bits of DNA: Reviews and Commentary on Computational Biology by Lior Pachter. Nov. 23, 2013, accessed Sept. 6, 2018. https://liorpachter.wordpress.com/seq
    [Google Scholar]
  2. 2. 
    Bunge J, Fitzpatrick M 1993. Estimating the number of species: a review. J. Am. Stat. Assoc. 88:364–73
    [Google Scholar]
  3. 3. 
    Fisher RA, Corbet AS, Williams CB 1943. The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol. 12:42–58
    [Google Scholar]
  4. 4. 
    Good IJ 2000. Turing's anticipation of empirical Bayes in connection with the cryptanalysis of the naval enigma. J. Stat. Comput. Simul. 66:101–11
    [Google Scholar]
  5. 5. 
    Johndrow JE, Lum K, Manrique-Vallier D 2016. Estimating the observable population size from biased samples: a new approach to population estimation with capture heterogeneity. arXiv:1606.02235 [stat.ME]
    [Google Scholar]
  6. 6. 
    Anscombe FJ 1950. Sampling theory of the negative binomial and logarithmic series distributions. Biometrika 37:358–82
    [Google Scholar]
  7. 7. 
    Zipf GK 1935. The Psycho-Biology of Language Boston: Houghton Mifflin
    [Google Scholar]
  8. 8. 
    Bulmer MG 1974. On fitting the Poisson lognormal distribution to species-abundance data. Biometrics 30:101–10
    [Google Scholar]
  9. 9. 
    Burrell QL, Fenton MR 1993. Yes, the GIGP really does work—and is workable. J. Am. Soc. Inform. Sci. 44:61–69
    [Google Scholar]
  10. 10. 
    Sichel HS 1975. On a distribution law for word frequencies. J. Am. Stat. Assoc. 70:542–47
    [Google Scholar]
  11. 11. 
    Norris JL, Pollock KH 1998. Non-parametric MLE for Poisson species abundance models allowing for heterogeneity between species. Environ. Ecol. Stat. 5:391–402
    [Google Scholar]
  12. 12. 
    Wang JPZ, Lindsay BG 2005. A penalized nonparametric maximum likelihood approach to species richness estimation. J. Am. Stat. Assoc. 100:942–59
    [Google Scholar]
  13. 13. 
    Favaro S, Lijoi A, Mena RH, Prünster I 2009. Bayesian non-parametric inference for species variety with a two-parameter Poisson–Dirichlet process prior. J. R. Stat. Soc. Ser. B 71:993–1008
    [Google Scholar]
  14. 14. 
    Lindsay BG 1983. The geometry of mixture likelihoods: a general theory. Ann. Stat. 11:86–94
    [Google Scholar]
  15. 15. 
    Wang JP 2010. Estimating species richness by a Poisson-compound gamma model. Biometrika 97:727–40
    [Google Scholar]
  16. 16. 
    Hansen B, Pitman J 2000. Prediction rules for exchangeable sequences related to species sampling. Stat. Probab. Lett. 46:251–56
    [Google Scholar]
  17. 17. 
    Good IJ, Toulmin GH 1956. The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43:45–63
    [Google Scholar]
  18. 18. 
    Burnham KP, Overton WS 1978. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika 65:625–33
    [Google Scholar]
  19. 19. 
    Daley T, Smith AD 2013. Predicting the molecular complexity of sequencing libraries. Nat. Methods 10:325–27
    [Google Scholar]
  20. 20. 
    Deng C, Daley T, Smith A 2015. Applications of species accumulation curves in large-scale biological data analysis. Quant. Biol. 3:135–44
    [Google Scholar]
  21. 21. 
    Valiant G, Valiant P 2016. Instance optimal learning of discrete distributions. Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing New York: Assoc. Comput. Mach.
    [Google Scholar]
  22. 22. 
    Daley T 2014. Non-parametric models for large capture-recapture experiments with applications to DNA sequencing Ph.D. Thesis, Univ. South. Calif., Los Angeles, CA
    [Google Scholar]
  23. 23. 
    Zou J, Valiant G, Valiant P, Karczewski K, Chan SO et al. 2016. Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects. Nat. Commun. 7:13293
    [Google Scholar]
  24. 24. 
    Deng C, Daley T, Calabrese P, Ren J, Smith AD 2018. Estimating the number of species to attain sufficient representation in a random sample. arXiv:1607.02804 [stat.ME]
    [Google Scholar]
  25. 25. 
    Raghunathan A, Valiant G, Zou J 2017. Estimating the unseen from multiple populations. arXiv:1707.03854 [cs.LG]
    [Google Scholar]
  26. 26. 
    Dumitrascu B, Feng K, Engelhardt BE 2018. GT-TS: experimental design for maximizing cell type discovery in single-cell data. bioRxiv 386540. https://doi.org/10.1101/386540
    [Crossref] [Google Scholar]
  27. 27. 
    Zipf GK 1932. Selected Studies of the Principle of Relative Frequency in Language Cambridge, MA: Harvard Univ. Press
    [Google Scholar]
  28. 28. 
    Robbins H 1964. The empirical Bayes approach to statistical decision problems. Ann. Math. Stat. 35:1–20
    [Google Scholar]
  29. 29. 
    Good IJ 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40:237–64
    [Google Scholar]
  30. 30. 
    Good IJ 1979. Studies in the history of probability and statistics. XXXVII: A. M. Turing's statistical work in World War II. Biometrika 66:393–96
    [Google Scholar]
  31. 31. 
    Lladser ME, Gouet R, Reeder J 2011. Extrapolation of urn models via Poissonization: accurate measurements of the microbial unknown. PLOS ONE 6:e21105
    [Google Scholar]
  32. 32. 
    Lijoi A, Mena RH, Prünster I 2007. Bayesian nonparametric estimation of the probability of discovering new species. Biometrika 94:769–86
    [Google Scholar]
  33. 33. 
    Raghavan M, Steinrücken M, Harris K, Schiffels S, Rasmussen S et al. 2015. Genomic evidence for the Pleistocene and recent population history of Native Americans. Science 349:6250aab3884
    [Google Scholar]
  34. 34. 
    Bunge J 2011. Estimating the number of species with CatchAll. Pac. Symp. Biocomput. 2011:121–30
    [Google Scholar]
  35. 35. 
    Harris B 1968. Statistical inference in the classical occupancy problem unbiased estimation of the number of classes. J. Am. Stat. Assoc. 63:837–47
    [Google Scholar]
  36. 36. 
    Link WA 2003. Nonidentifiability of population size from capture-recapture data with heterogeneous detection probabilities. Biometrics 59:1123–30
    [Google Scholar]
  37. 37. 
    Holzmann H, Munk A, Zucchini W 2006. On identifiability in capture-recapture models. Biometrics 62:934–36
    [Google Scholar]
  38. 38. 
    Mao CX, Lindsay BG 2007. Estimating the number of classes. Ann. Stat 35:2917–30
    [Google Scholar]
  39. 39. 
    Daley T, Smith AD 2018. Better lower bounds: improved non-parametric moment-based species estimation for large experiments. arXiv:1605.03294 [stat.ME]
    [Google Scholar]
  40. 40. 
    Lindsay BG 1995. Mixture models: theory, geometry and applications. NSF-CBMS Regional Conference Series in Probability and Statistics 5:i–163 Hayward, CA: Inst. Math. Stat.
    [Google Scholar]
  41. 41. 
    Wang JP, Lindsay BG 2008. An exponential partial prior for improving nonparametric maximum likelihood estimation in mixture models. Stat. Methodol. 5:30–45
    [Google Scholar]
  42. 42. 
    Wang JPZ, Lindsay BG 2005. A penalized nonparametric maximum likelihood approach to species richness estimation. J. Am. Stat. Assoc. 100:942–59
    [Google Scholar]
  43. 43. 
    Kaplinsky J, Arnaout R 2016. Robust estimates of overall immune-repertoire diversity from high-throughput measurements on samples. Nat. Commun. 7:11881
    [Google Scholar]
  44. 44. 
    Valiant G, Valiant P 2013. Estimating the unseen: improved estimators for entropy and other properties. Advances in Neural Information Processing Systems 26 (NIPS 2013) https://papers.nips.cc/paper/5170-estimating-the-unseen-improved-estimators-for-entropy-and-other-properties
    [Google Scholar]
  45. 45. 
    Chao A 1984. Nonparametric estimation of the number of classes in a population. Scand. J. Stat. 11:265–70
    [Google Scholar]
  46. 46. 
    Burnham KP, Overton WS 1978. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika 65:625–33
    [Google Scholar]
  47. 47. 
    Horvitz DG, Thompson DJ 1952. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47:663–85
    [Google Scholar]
  48. 48. 
    Chao A, Lee SM 1992. Estimating the number of classes via sample coverage. J. Am. Stat. Assoc. 87:210–17
    [Google Scholar]
  49. 49. 
    Chao A, Bunge J 2002. Estimating the number of species in a stochastic abundance model. Biometrics 58:531–39
    [Google Scholar]
  50. 50. 
    Qi Q, Liu Y, Cheng Y, Glanville J, Zhang D et al. 2014. Diversity and clonal selection in the human T-cell repertoire. PNAS 111:13139–44
    [Google Scholar]
  51. 51. 
    Willis A, Bunge J 2015. Estimating diversity via frequency ratios. Biometrics 71:1042–49
    [Google Scholar]
  52. 52. 
    Böhning D, Rocchetti I, Alfó M, Holling H 2016. A flexible ratio regression approach for zero-truncated capture–recapture counts. Biometrics 72:697–706
    [Google Scholar]
  53. 53. 
    Robins HS, Campregher PV, Srivastava SK, Wacher A, Turtle CJ et al. 2009. Comprehensive assessment of T-cell receptor -chain diversity in T cells. Blood 114:4099–107
    [Google Scholar]
  54. 54. 
    Laydon DJ, Melamed A, Sim A, Gillet NA, Sim K et al. 2014. Quantification of HTLV-1 clonality and TCR diversity. PLOS Comput. Biol. 10:e1003646
    [Google Scholar]
  55. 55. 
    Willis A 2016. Extrapolating abundance curves has no predictive power for estimating microbial biodiversity. PNAS 113:E5096
    [Google Scholar]
  56. 56. 
    Weinstein JA, Jiang N, White RA, Fisher DS, Quake SR 2009. High-throughput sequencing of the zebrafish antibody repertoire. Science 324:807–10
    [Google Scholar]
  57. 57. 
    Elhanati Y, Murugan A, Callan CG, Mora T, Walczak AM 2014. Quantifying selection in immune receptor repertoires. PNAS 111:9875–80
    [Google Scholar]
  58. 58. 
    Mangul S, Yang HT, Strauli N, Gruhl F, Porath HT et al. 2018. ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues. Genome Biol. 19:36
    [Google Scholar]
  59. 59. 
    Tian L, Fire AZ, Boyd SD, Olshen RA 2018. Clonality: point estimation. Ann. Appl. Stat. 63:2522–30
    [Google Scholar]
  60. 60. 
    Venturi V, Kedzierska K, Turner SJ, Doherty PC, Davenport MP 2007. Methods for comparing the diversity of samples of the T cell receptor repertoire. J. Immunol. Methods 321:182–95
    [Google Scholar]
  61. 61. 
    Chao A, Wang Y, Jost L 2013. Entropy and the species accumulation curve: a novel entropy estimator via discovery rates of new species. Methods Ecol. Evol. 4:1091–100
    [Google Scholar]
  62. 62. 
    Chao A, Shen TJ 2003. Nonparametric estimation of Shannon's index of diversity when there are unseen species in sample. Environ. Ecol. Stat. 10:429–43
    [Google Scholar]
  63. 63. 
    Jost L 2006. Entropy and diversity. Oikos 113:363–75
    [Google Scholar]
  64. 64. 
    Zhang Z, Zhou J 2010. Re-parameterization of multinomial distributions and diversity indices. J. Stat. Plan. Inference 140:1731–38
    [Google Scholar]
  65. 65. 
    Kivioja T, Vähärautio A, Karlsson K, Bonke M, Enge M et al. 2011. Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods 9:72–74
    [Google Scholar]
  66. 66. 
    Sahlén P, Abdullayev I, Ramsköld D, Matskova L, Rilakovic N et al. 2015. Genome-wide mapping of promoter-anchored interactions with close to single-enhancer resolution. Genome Biol. 16:156
    [Google Scholar]
  67. 67. 
    Enk JM, Devault AM, Kuch M, Murgha YE, Rouillard JM, Poinar HN 2014. Ancient whole genome enrichment using baits built from modern DNA. Mol. Biol. Evol. 31:1292–94
    [Google Scholar]
  68. 68. 
    Sharon D, Tilgner H, Grubert F, Snyder M 2013. A single-molecule long-read survey of the human transcriptome. Nat. Biotechnol. 31:1009–14
    [Google Scholar]
  69. 69. 
    Gamba C, Hanghøj K, Gaunitz C, Alfarhan AH, Alquraishi SA et al. 2016. Comparing the performance of three ancient DNA extraction methods for high-throughput sequencing. Mol. Ecol. Resourc. 16:459–69
    [Google Scholar]
  70. 70. 
    Chandra T, Kirschner K, Thuret JY, Pope BD, Ryba T et al. 2012. Independence of repressive his-tone marks and chromatin compaction during senescent heterochromatic layer formation. Mol. Cell 47:2203–14
    [Google Scholar]
  71. 71. 
    de Almeida CR, Stadhouders R, de Bruijn MJ, Bergen IM, Thongjuea S et al. 2011. The DNA-binding protein CTCF limits proximal Vκ recombination and restricts κ enhancer interactions to the immunoglobulin κ light chain locus. Immunity 35:4501–13
    [Google Scholar]
  72. 72. 
    Qu J, Hodges E, Molaro A, Gagneux P, Dean MD et al. 2018. Evolutionary expansion of DNA hypomethylation in the mammalian germline genome. Genome Res 28:2145–58
    [Google Scholar]
  73. 73. 
    Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP 2014. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15:121–32
    [Google Scholar]
  74. 74. 
    Daley T, Smith AD 2014. Modeling genome coverage in single-cell sequencing. Bioinformatics 30:3159–65
    [Google Scholar]
  75. 75. 
    Zong C, Lu S, Chapman AR, Xie XS 2012. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science 338:1622–26
    [Google Scholar]
  76. 76. 
    Lander ES, Waterman MS 1988. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2:231–39
    [Google Scholar]
  77. 77. 
    Colwell RK, Coddington JA 1994. Estimating terrestrial biodiversity through extrapolation. Philos. Trans. R. Soc. B 345:101–18
    [Google Scholar]
  78. 78. 
    Pollen AA, Nowakowski TJ, Shuga J, Wang X, Leyrat AA et al. 2014. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol. 32:1053–58
    [Google Scholar]
  79. 79. 
    Saunders A, Macosko EZ, Wysoker A, Goldman M, Krienen FM et al. 2018. Molecular diversity and specializations among the cells of the adult mouse brain. Cell 174:1015–30
    [Google Scholar]
  80. 80. 
    Ekblom R, Wolf JBW 2014. A field guide to whole-genome sequencing, assembly and annotation. Evol. Appl. 7:1026–42
    [Google Scholar]
  81. 81. 
    Bailey T, Krajewski P, Ladunga I, Lefebvre C, Li Q et al. 2013. Practical guidelines for the comprehensive analysis of ChIP-seq data. PLOS Comput. Biol. 9:e1003326
    [Google Scholar]
  82. 82. 
    Illumina 2014. Estimating sequencing coverage Tech. Note 770-2011-022, Illumina, San Diego, CA. https://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf
    [Google Scholar]
  83. 83. 
    Sanger F, Nicklen S, Coulson AR 1977. DNA sequencing with chain-terminating inhibitors. PNAS 74:5463–67
    [Google Scholar]
  84. 84. 
    Staden R 1980. A new computer method for the storage and manipulation of DNA gel reading data. Nucleic Acids Res. 8:3673–94
    [Google Scholar]
  85. 85. 
    Benjamini Y, Speed TP 2012. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40:e72
    [Google Scholar]
  86. 86. 
    Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH 2008. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat. Biotechnol. 26:1293–300
    [Google Scholar]
  87. 87. 
    Hooper SD, Dalevi D, Pati A, Mavromatis K, Ivanova NN, Kyrpides NC 2010. Estimating DNA coverage and abundance in metagenomes using a gamma approximation. Bioinformatics 26:295–301
    [Google Scholar]
  88. 88. 
    Miller CA, Hampton O, Coarfa C, Milosavljevic A 2011. ReadDepth: a parallel R package for detecting copy number alterations from short sequencing reads. PLOS ONE 6:e16327
    [Google Scholar]
  89. 89. 
    1,000 Genomes Proj. Consort 2015. A global reference for human genetic variation. Nature 526:68–74
    [Google Scholar]
  90. 90. 
    Woodworth MB, Girskis KM, Walsh CA 2017. Building a lineage from single cells: genetic techniques for cell lineage tracking. Nat. Rev. Genet. 18:230–44
    [Google Scholar]
  91. 91. 
    Colwell RK, Chao A, Gotelli NJ, Lin SY, Mao CX et al. 2012. Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. J. Plant Ecol. 5:3–21
    [Google Scholar]
  92. 92. 
    Ionita-Laza I, Lange C, Laird NM 2009. Estimating the number of unseen variants in the human genome. PNAS 106:5008–13
    [Google Scholar]
  93. 93. 
    Gravel S 2014. Predicting discovery rates of genomic features. Genetics 197:601–10
    [Google Scholar]
  94. 94. 
    Efron B, Thisted R 1976. Estimating the number of unseen species: How many words did Shakespeare know. Biometrika 63:435–47
    [Google Scholar]
  95. 95. 
    Gravel S, Henn BM, Gutenkunst RN, Indap AR, Marth GT et al. 2011. Demographic history and rare allele sharing among human populations. PNAS 108:11983–88
    [Google Scholar]
  96. 96. 
    Burnham KP, Overton WS 1978. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika 65:625–33
    [Google Scholar]
  97. 97. 
    Xu C 2018. A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data. Comput. Struct. Biotechnol. J. 16:15–24
    [Google Scholar]
  98. 98. 
    Schwartz RS 2003. Diversity of the immune repertoire and immunoregulation. N. Engl. J. Med. 348:1017–26
    [Google Scholar]
  99. 99. 
    Elhanati Y, Sethna Z, Marcou Q, Callan CG, Mora T, Walczak AM 2015. Inferring processes underlying B-cell repertoire diversity. Philos. Trans. R. Soc. B 370:20140243
    [Google Scholar]
  100. 100. 
    Mora T, Walczak A 2016. Quantifying lymphocyte receptor diversity. arXiv:1604.00487 [q-bio.PE]
    [Google Scholar]
  101. 101. 
    den Braber I, Mugwagwa T, Vrisekoop N, Westera L, Mögling R et al. 2012. Maintenance of peripheral naive T cells is sustained by thymus output in mice but not humans. Immunity 36:288–97
    [Google Scholar]
  102. 102. 
    Mora T, Walczak AM, Bialek W, Callan CG 2010. Maximum entropy models for antibody diversity. PNAS 107:5405–10
    [Google Scholar]
  103. 103. 
    Britanova OV, Putintseva EV, Shugay M, Merzlyak EM, Turchaninova MA et al. 2014. Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling. J. Immunol. 192:2689–98
    [Google Scholar]
  104. 104. 
    Gibson KL, Wu YC, Barnett Y, Duggan O, Vaughan R et al. 2009. B-cell diversity decreases in old age and is correlated with poor health status. Aging Cell 8:18–25
    [Google Scholar]
  105. 105. 
    Wrammert J, Smith K, Miller J, Langley WA, Kokko K et al. 2008. Rapid cloning of high-affinity human monoclonal antibodies against influenza virus. Nature 453:667–71
    [Google Scholar]
  106. 106. 
    Qi Q, Cavanagh MM, Le Saux S, Wagar LE, Mackey S et al. 2016. Defective T memory cell differentiation after varicella zoster vaccination in older individuals. PLOS Pathog. 12:e1005892
    [Google Scholar]
  107. 107. 
    Li B, Li T, Pignon JC, Wang B, Wang J et al. 2016. Landscape of tumor-infiltrating T cell repertoire of human cancers. Nat. Genet. 48:725–32
    [Google Scholar]
  108. 108. 
    Perline R 2005. Strong, weak and false inverse power laws. Stat. Sci. 20:68–88
    [Google Scholar]
  109. 109. 
    White EP, Enquist BJ, Green JL 2008. On estimating the exponent of power-law frequency distributions. Ecology 89:905–12
    [Google Scholar]
  110. 110. 
    Clauset A, Shalizi CR, Newman ME 2009. Power-law distributions in empirical data. SIAM Rev. 51:661–703
    [Google Scholar]
  111. 111. 
    Warren RL, Freeman JD, Zeng T, Choe G, Munro S et al. 2011. Exhaustive T-cell repertoire sequencing of human peripheral blood samples reveals signatures of antigen selection and a directly measured repertoire size of at least 1 million clonotypes. Genome Res. 21:790–97
    [Google Scholar]
  112. 112. 
    Chiarucci A, Di Biase RM, Fattorini L, Marcheselli M, Pisani C 2018. Joining the incompatible: exploiting purposive lists for the sample-based estimation of species richness. Ann. Appl. Stat. 12:1679–99
    [Google Scholar]
  113. 113. 
    Tang F, Barbacioru C, Wang Y, Nordman E, Lee C et al. 2009. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 6:377–82
    [Google Scholar]
  114. 114. 
    Menon V 2017. Clustering single cells: a review of approaches on high-and low-depth single-cell RNA-seq data. Brief. Funct. Genom. 17:240–45
    [Google Scholar]
  115. 115. 
    Zamanighomi M, Lin Z, Daley T, Chen X, Duren Z et al. 2018. Unsupervised clustering and epigenetic classification of single cells. Nat. Commun. 9:2410
    [Google Scholar]
  116. 116. 
    Trapnell C 2015. Defining cell types and states with single-cell genomics. Genome Res. 25:1491–98
    [Google Scholar]
  117. 117. 
    Islam S, Kjällquist U, Moliner A, Zajac P, Fan JB et al. 2011. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res. 21:71160–67
    [Google Scholar]
  118. 118. 
    Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K et al. 2015. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161:1202–14
    [Google Scholar]
  119. 119. 
    Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A et al. 2015. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161:1187–201
    [Google Scholar]
  120. 120. 
    Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW et al. 2017. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8:14049
    [Google Scholar]
  121. 121. 
    Taylor K, Watson L, Frenz L, Greiner D, Lebofsky R et al. 2017. A scalable high-throughput method for RNA-seq analysis of thousands of single cells White Pap. 1070-2016-013, Illumina, San Diego, CA. https://jp.illumina.com/content/dam/illumina-marketing/documents/products/flyers/ddseq-single-cell-poster-handout-single-cell-poster-handout-web.pdf
    [Google Scholar]
  122. 122. 
    Lindström NO, Brandine GDS, Tran T, Ransick A, Suh G et al. 2018. Progressive recruitment of mesenchymal progenitors reveals a time-dependent process of cell fate acquisition in mouse and human nephrogenesis. Dev. Cell 45:651–60
    [Google Scholar]
  123. 123. 
    Hicks SC, Townes FW, Teng M, Irizarry RA 2017. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19:4562–78
    [Google Scholar]
  124. 124. 
    Kharchenko PV, Silberstein L, Scadden DT 2014. Bayesian approach to single-cell differential expression analysis. Nat. Methods 11:740–42
    [Google Scholar]
  125. 125. 
    Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert JP 2018. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9:284
    [Google Scholar]
  126. 126. 
    Chen K, Pachter L 2005. Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLOS Comput. Biol. 1:106–12
    [Google Scholar]
  127. 127. 
    Human Microbiome Proj. Consort 2012. Structure, function and diversity of the healthy human microbiome. Nature 486:207–14
    [Google Scholar]
  128. 128. 
    Lloyd-Price J, Mahurkar A, Rahnavard G, Crabtree J, Orvis J et al. 2017. Strains, functions and dynamics in the expanded human microbiome project. Nature 550:61–66
    [Google Scholar]
  129. 129. 
    Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K et al. 2015. Structure and function of the global ocean microbiome. Science 348:1261359
    [Google Scholar]
  130. 130. 
    Bunge J, Willis A, Walsh F 2014. Estimating the number of species in microbial diversity studies. Annu. Rev. Stat. Appl. 1:427–45
    [Google Scholar]
  131. 131. 
    Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP 2016. DADA2: high-resolution sample inference from Illumina amplicon data. Nat. Methods 13:581–83
    [Google Scholar]
  132. 132. 
    Ren B, Bacallado S, Favaro S, Holmes S, Trippa L 2017. Bayesian nonparametric ordination for the analysis of microbial communities. J. Am. Stat. Assoc. 112:1430–42
    [Google Scholar]
  133. 133. 
    Goltsman DSA, Sun CL, Proctor DM, DiGiulio DB, Robaczewska A et al. 2018. Metagenomic analysis with strain-level resolution reveals fine-scale variation in the human pregnancy microbiome. Genome Res 28:1467–80
    [Google Scholar]
  134. 134. 
    Edgar R 2017. Accuracy of microbial community diversity estimated by closed- and open-reference OTUs. PeerJ 5:e3889
    [Google Scholar]
  135. 135. 
    Rodriguez-R LM, Konstantinidis KT 2014. Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics 30:629–35
    [Google Scholar]
  136. 136. 
    Pevzner PA, Tang H, Waterman MS 2001. An Eulerian path approach to DNA fragment assembly. PNAS 98:9748–53
    [Google Scholar]
  137. 137. 
    Marais G, Kingsford C 2011. A fast, lock-free approach for efficient parallel counting of occurrences of -mers. Bioinformatics 27:764–70
    [Google Scholar]
  138. 138. 
    Wang JP 2011. SPECIES: an R package for species richness estimation. J. Stat. Softw. 40:1–15
    [Google Scholar]
  139. 139. 
    Chao A, Chiu CH 2016. Nonparametric estimation and comparison of species richness. eLS https://doi.org/10.1002/9780470015902.a0026329
    [Crossref] [Google Scholar]
  140. 140. 
    Gale WA, Sampson G 1995. Good-Turing frequency estimation without tears. J. Quant. Linguist. 2:217–37
    [Google Scholar]
  141. 141. 
    Robinson MD, McCarthy DJ, Smyth GK 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–40
    [Google Scholar]
  142. 142. 
    Colwell RK, Elsensohn JE 2014. EstimateS turns 20: statistical estimation of species richness and shared species from samples, with non-parametric extrapolation. Ecography 37:609–13
    [Google Scholar]
  143. 143. 
    Cohen AC 1960. Estimating the parameter in a conditional Poisson distribution. Biometrics 16:203–11
    [Google Scholar]
  144. 144. 
    Sanathanan L 1972. Estimating the size of a multinomial population. Ann. Math. Stat. 43:142–52
    [Google Scholar]
  145. 145. 
    Sanathanan L 1977. Estimating the size of a truncated sample. J. Am. Stat. Assoc. 72:669–72
    [Google Scholar]
  146. 146. 
    Chen LHY 1975. Poisson approximation for dependent trials. Ann. Probab. 3:534–45
    [Google Scholar]
  147. 147. 
    Arratia R, Goldstein L, Gordon L 1989. Two moments suffice for Poisson approximations: the Chen–Stein method. Ann. Probab. 17:9–25
    [Google Scholar]
  148. 148. 
    Bao R, Huang L, Andrade J, Tan W, Kibbe WA et al. 2014. Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer Inform. 13:67–82
    [Google Scholar]
  149. 149. 
    Hong SH, Bunge J, Jeon SO, Epstein SS 2006. Predicting microbial species richness. PNAS 103:117–22
    [Google Scholar]
  150. 150. 
    Gerstung M, Beisel C, Rechsteiner M, Wild P, Schraml P et al. 2012. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat. Commun. 3:811
    [Google Scholar]
  151. 151. 
    Heck KL, van Belle G, Simberloff D 1975. Explicit calculation of the rarefaction diversity measurement and the determination of sufficient sample size. Ecology 56:1459–61
    [Google Scholar]
  152. 152. 
    Xuan Mao C, Colwell RK, Chang J 2005. Estimating the species accumulation curve using mixtures. Biometrics 61:433–41
    [Google Scholar]
  153. 153. 
    Engen S 1978. Stochastic Abundance Models London: Chapman and Hall
    [Google Scholar]
  154. 154. 
    Efron B, Tibshirani RJ 1994. An Introduction to the Bootstrap London: Chapman and Hall
    [Google Scholar]
  155. 155. 
    Kuhnert R, del Rio Vilas VJ, Gallagher J, Böhning D 2008. A bagging-based correction for the mixture model estimator of population size. Biom. J. 50:993–1005
    [Google Scholar]
/content/journals/10.1146/annurev-biodatasci-072018-021339
Loading
/content/journals/10.1146/annurev-biodatasci-072018-021339
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error