1932

Abstract

Genome and metagenome comparisons based on large amounts of next-generation sequencing (NGS) data pose significant challenges for alignment-based approaches due to the huge data size and the relatively short length of the reads. Alignment-free approaches based on the counts of word patterns in NGS data do not depend on the complete genome and are generally computationally efficient. Thus, they contribute significantly to genome and metagenome comparison. Recently, novel statistical approaches have been developed for the comparison of both long and shotgun sequences. These approaches have been applied to many problems, including the comparison of gene regulatory regions, genome sequences, metagenomes, binning contigs in metagenomic data, identification of virus–host interactions, and detection of horizontal gene transfers. We provide an updated review of these applications and other related developments of word count–based approaches for alignment-free sequence analysis.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-biodatasci-080917-013431
2018-07-20
2024-12-08
Loading full text...

Full text loading...

/deliver/fulltext/biodatasci/1/1/annurev-biodatasci-080917-013431.html?itemId=/content/journals/10.1146/annurev-biodatasci-080917-013431&mimeType=html&fmt=ahah

Literature Cited

  1. 1.  Smith TF, Waterman MS 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195–97
    [Google Scholar]
  2. 2.  Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ 1990. Basic local alignment search tool. J. Mol. Biol. 215:403–10
    [Google Scholar]
  3. 3.  Kent WJ 2002. BLAT: the BLAST-like alignment tool. Genome Res 12:656–64
    [Google Scholar]
  4. 4.  Wang H, Xu Z, Gao L, Hao B 2009. A fungal phylogeny based on 82 complete genomes using the composition vector method. BMC Evol. Biol. 9:195
    [Google Scholar]
  5. 5.  Jun S, Sims G, Wu G, Kim S 2010. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: an alignment-free method with optimal feature resolution. PNAS 107:133–38
    [Google Scholar]
  6. 6.  Sims GE, Kim SH 2011. Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). PNAS 108:8329–34
    [Google Scholar]
  7. 7.  Blaisdell B 1986. A measure of the similarity of sets of sequences not requiring sequence alignment. PNAS 83:5155–59
    [Google Scholar]
  8. 8.  Blaisdell BE 1985. Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding. J. Mol. Evol. 21:278–88
    [Google Scholar]
  9. 9.  Torney D, Burks C, Davison D, Sirotkin K 1990. Computation of d2: a measure of sequence dissimilarity. Computers and DNA: Proc. Interfac. Comput. Sci. Nucleic Acid Seq. Workshop, Santa Fe, N.M., 12–16 Dec GI Bell, TG Marr 109–25 New York: Addison-Wesley
    [Google Scholar]
  10. 10.  Wan L, Reinert G, Sun F, Waterman M 2010. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J. Comput. Biol. 17:1467–90
    [Google Scholar]
  11. 11.  Reinert G, Chew D, Sun FZ, Waterman MS 2009. Alignment-free sequence comparison (I): statistics and power. J. Comput. Biol. 16:1615–34
    [Google Scholar]
  12. 12.  Sims G, Jun S, Wu G, Kim S 2009. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. PNAS 106:2677–82
    [Google Scholar]
  13. 13.  Qi J, Luo H, Hao B 2004. CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res 32:W45
    [Google Scholar]
  14. 14.  Ulitsky I, Burstein D, Tuller T, Chor B 2006. The average common substring approach to phylogenomic reconstruction. J. Comput. Biol. 13:336–50
    [Google Scholar]
  15. 15.  Yang L, Zhang X, Fu H, Yang C 2016. An estimator for local analysis of genome based on the minimal absent word. J. Theor. Biol. 395:23–30
    [Google Scholar]
  16. 16.  Yang L, Zhang X, Zhu H 2012. Alignment free comparison: similarity distribution between the DNA primary sequences based on the shortest absent word. J. Theor. Biol. 295:125–31
    [Google Scholar]
  17. 17.  Yang L, Zhang X, Wang T, Zhu H 2013. Large local analysis of the unaligned genome and its application. J. Comput. Biol. 20:19–29
    [Google Scholar]
  18. 18.  Almeida JS, Carrico JA, Maretzek A, Noble PA, Fletcher M 2001. Analysis of genomic sequences by chaos game representation. Bioinformatics 17:429–37
    [Google Scholar]
  19. 19.  Wang Y, Hill K, Singh S, Kari L 2005. The spectrum of genomic signatures: from dinucleotides to chaos game representation. Gene 346:173–85
    [Google Scholar]
  20. 20.  Jeffrey HJ 1990. Chaos game representation of gene structure. Nucleic Acids Res 18:2163–70
    [Google Scholar]
  21. 21.  Yau SST, Yu C, He R 2008. A protein map and its application. DNA Cell Biol 27:241–50
    [Google Scholar]
  22. 22.  Yin C, Yau SST 2015. An improved model for whole genome phylogenetic analysis by fourier transform. J. Theor. Biol. 382:99–110
    [Google Scholar]
  23. 23.  Vinga S 2013. Information theory applications for biological sequence analysis. Brief. Bioinform. 15:376–89
    [Google Scholar]
  24. 24.  Almeida JS 2013. Sequence analysis by iterated maps, a review. Brief. Bioinform. 15:369–75
    [Google Scholar]
  25. 25.  Zielezinski A, Vinga S, Almeida J, Karlowski WM 2017. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18:186
    [Google Scholar]
  26. 26.  Bonham-Carter O, Steele J, Bastola D 2013. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief. Bioinform. 15:890–905
    [Google Scholar]
  27. 27.  Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F 2014. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief. Bioinform. 15:343–53
    [Google Scholar]
  28. 28.  Vinga S, Almeida J 2003. Alignment-free sequence comparison—a review. Bioinformatics 19:513–23
    [Google Scholar]
  29. 29.  Li Q, Xu Z, Hao B 2010. Composition vector approach to whole-genome-based prokaryotic phylogeny: success and foundations. J. Biotechnol. 149:115–19
    [Google Scholar]
  30. 30.  Marçais G, Kingsford C 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27:764–70
    [Google Scholar]
  31. 31.  Rizk G, Lavenier D, Chikhi R 2013. DSK: k-mer counting with very low memory usage. Bioinformatics 29:652–53
    [Google Scholar]
  32. 32.  Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A 2015. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31:1569–76
    [Google Scholar]
  33. 33.  Sobieski JM, Chen KN, Filiatreau JC, Pickett MH, Fox GE 1984. 16S rRNA oligonucleotide catalog data base. Nucleic Acids Res 12:141–48
    [Google Scholar]
  34. 34.  Fox GE, Stackebrandt E, Hespell RB, Gibson J, Maniloff J et al. 1980. The phylogeny of prokaryotes. Science 209:457–63
    [Google Scholar]
  35. 35.  Fox GE, Magrum LJ, Balch WE, Wolfe RS, Woese CR 1977. Classification of methanogenic bacteria by 16S ribosomal RNA characterization. PNAS 74:4537–41
    [Google Scholar]
  36. 36.  Woese C, Stackebrandt E, Macke T, Fox G 1985. A phylogenetic definition of the major eubacterial taxa. Syst. Appl. Microbiol. 6:143–51
    [Google Scholar]
  37. 37.  McGill TJ, Jurka J, Sobieski JM, Pickett MH, Woese CR, Fox GE 1986. Characteristic archaebacterial 16S rRNA oligonucleotides. Syst. Appl. Microbiol. 7:194–97
    [Google Scholar]
  38. 38.  Woese C, Stackebrandt E, Ludwig W 1985. What are mycoplasmas: the relationship of tempo and mode in bacterial evolution. J. Mol. Evol. 21:305–16
    [Google Scholar]
  39. 39.  Fox GE, Pechman KR, Woese CR 1977. Comparative cataloging of 16S ribosomal ribonucleic acid: molecular approach to procaryotic systematics. Int. J. Syst. Evol. Microbiol. 27:44–57
    [Google Scholar]
  40. 40.  Woese CR 1987. Bacterial evolution. Microbiol. Rev. 51:221–71
    [Google Scholar]
  41. 41.  Ragan MA, Bernard G, Chan CX 2014. Molecular phylogenetics before sequences: oligonucleotide catalogs as k-mer spectra. RNA Biol 11:176–85
    [Google Scholar]
  42. 42.  Karlin S, Mrázek J 1997. Compositional differences within and between eukaryotic genomes. PNAS 94:10227–32
    [Google Scholar]
  43. 43.  Karlin S, Burge C 1995. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 11:283–90
    [Google Scholar]
  44. 44.  Bernard G, Chan CX, Ragan MA 2016. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer. Sci. Rep. 6:28970
    [Google Scholar]
  45. 45.  Chan CX, Bernard G, Poirion O, Hogan JM, Ragan MA 2014. Inferring phylogenies of evolving sequences without multiple sequence alignment. Sci. Rep. 4:6504
    [Google Scholar]
  46. 46.  Lu YY, Tang K, Ren J, Fuhrman JA, Waterman MS, Sun F 2017. Cafe: aCcelerated Alignment-FrEe sequence analysis. Nucleic Acids Res 45:W554–59
    [Google Scholar]
  47. 47.  Narlikar L, Mehta N, Galande S, Arjunwadkar M 2013. One size does not fit all: on how Markov model order dictates performance of genomic sequence analyses. Nucleic Acids Res 41:1416–24
    [Google Scholar]
  48. 48.  Liu X, Wan L, Li J, Reinert G, Waterman M, Sun F 2011. New powerful statistics for alignment-free sequence comparison under a pattern transfer model. J. Theor. Biol. 284:106–16
    [Google Scholar]
  49. 49.  Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F 2013. Alignment-free sequence comparison based on next-generation sequencing reads. J. Comput. Biol. 20:64–79
    [Google Scholar]
  50. 50.  Ren J, Song K, Sun F, Deng M, Reinert G 2013. Multiple alignment-free sequence comparison. Bioinformatics 29:2690–98
    [Google Scholar]
  51. 51.  Qi J, Wang B, Hao BI 2004. Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach. J. Mol. Evol. 58:1–11
    [Google Scholar]
  52. 52.  Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO 2004. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinform 5:163
    [Google Scholar]
  53. 53.  Pride DT, Wassenaar TM, Ghose C, Blaser MJ 2006. Evidence of host-virus coevolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC Genom 7:8
    [Google Scholar]
  54. 54.  Willner D, Thurber RV, Rohwer F 2009. Metagenomic signatures of 86 microbial and viral metagenomes. Environ. Microbiol. 11:1752–66
    [Google Scholar]
  55. 55.  Almagor H 1983. A Markov analysis of DNA sequences. J. Theor. Biol. 104:633–45
    [Google Scholar]
  56. 56.  Pevzner PA, Borodovsky MY, Mironov AA 1989. Linguistics of nucleotide sequences I: the significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dyn. 6:1013–26
    [Google Scholar]
  57. 57.  Hong J 1990. Prediction of oligonucleotide frequencies based upon dinucleotide frequencies obtained from the nearest neighbor analysis. Nucleic Acids Res 18:1625–28
    [Google Scholar]
  58. 58.  Arnold J, Cuticchia AJ, Newsome DA, Jennings WW, Ivarie R 1988. Mono-through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis. Nucleic Acids Res 16:7145–58
    [Google Scholar]
  59. 59.  Avery PJ 1987. The analysis of intron data and their use in the detection of short signals. J. Mol. Evol. 26:335–40
    [Google Scholar]
  60. 60.  Hoel PG 1954. A test for Markov chains. Biometrika 41:430–33
    [Google Scholar]
  61. 61.  Anderson TW, Goodman LA 1957. Statistical inference about Markov chains. Ann. Math. Stat. 28:89–110
    [Google Scholar]
  62. 62.  Avery PJ, Henderson DA 1999. Fitting Markov chain models to discrete state series such as DNA sequences. J. R. Stat. Soc. C 48:53–61
    [Google Scholar]
  63. 63.  Billingsley P 1961. Statistical Inference for Markov Processes 2 Chicago: Univ. Chicago Press
    [Google Scholar]
  64. 64.  Billingsley P 1961. Statistical methods in Markov chains. Ann. Math. Stat. 32:12–40
    [Google Scholar]
  65. 65.  Waterman MS 1995. Introduction to Computational Biology: Maps, Sequences and Genomes Boca Raton, FL: Chapman & Hall/CRC
    [Google Scholar]
  66. 66.  Reinert G, Schbath S, Waterman M 2000. Probabilistic and statistical properties of words: an overview. J. Comput. Biol. 7:1–46
    [Google Scholar]
  67. 67.  Reinert G, Schbath S, Waterman MS 2005. Statistics on words with applications to biological sequences. Applied Combinatorics on Words by M. Lothaire J Berstel, D Perrin 268–352 Cambridge, UK: Cambridge Univ. Press
    [Google Scholar]
  68. 68.  Ewens WJ, Grant GR 2005. Statistical Methods in Bioinformatics: An Introduction New York: Springer
    [Google Scholar]
  69. 69.  Menéndez ML, Pardo L, Pardo M, Zografos K 2011. Testing the order of Markov dependence in DNA sequences. Methodol. Comput. Appl. Probability 13:59–74
    [Google Scholar]
  70. 70.  Papapetrou M, Kugiumtzis D 2013. Markov chain order estimation with conditional mutual information. Physica A 392:1593–601
    [Google Scholar]
  71. 71.  Morvai G, Weiss B 2005. Order estimation of Markov chains. IEEE Trans. Inf. Theory 51:1496
    [Google Scholar]
  72. 72.  Peres Y, Shields P 2005. Two new Markov order estimators. arXiv math/0506080 [math.ST]
  73. 73.  Dalevi D, Dubhashi D, Hermansson M 2006. A new order estimator for fixed and variable length Markov models with applications to DNA sequence similarity. Stat. Appl. Genet. Mol. Biol. 5:11544–6115
    [Google Scholar]
  74. 74.  Baigorri A, Gonçalves C, Resende P 2009. Markov chain order estimation and relative entropy. arXiv 0910.0264 [math.ST]
  75. 75.  Besag J, Mondal D 2013. Exact goodness-of-fit tests for Markov chains. Biometrics 69:488–96
    [Google Scholar]
  76. 76.  Tong H 1975. Determination of the order of a Markov chain by Akaike's information criterion. J. Appl. Probab. 12:488–97
    [Google Scholar]
  77. 77.  Hurvich CM, Tsai CL 1995. Model selection for extended quasi-likelihood models in small samples. Biometrics 51:1077–84
    [Google Scholar]
  78. 78.  Zhao LC, Dorea CCY, Gonçalves CR 2001. On determination of the order of a Markov chain. Stat. Inference Stochast. Process. 4:273–82
    [Google Scholar]
  79. 79.  Dorea C, Lopes J 2006. Convergence rates for Markov chain order estimates using EDC criterion. Bull. Braz. Math. Soc. 37:561–70
    [Google Scholar]
  80. 80.  Katz RW 1981. On some criteria for estimating the order of a Markov chain. Technometrics 23:243–49
    [Google Scholar]
  81. 81.  Ren J, Song K, Deng M, Reinert G, Cannon CH, Sun F 2016. Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics 32:993–1000
    [Google Scholar]
  82. 82.  Lander ES, Waterman MS 1988. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2:231–39
    [Google Scholar]
  83. 83.  Burden CJ, Jing J, Wilson SR 2012. Alignment-free sequence comparison for biologically realistic sequences of moderate length. Stat. Appl. Genet. Mol. Biol. 11:3
    [Google Scholar]
  84. 84.  Cannon CH, Kua CS, Zhang D, Harting J 2010. Assembly free comparative genomics of short-read sequence data discovers the needles in the haystack. Mol. Ecol. 19:146–60
    [Google Scholar]
  85. 85.  Miller W, Rosenbloom K, Hardison R, Hou M, Taylor J et al. 2007. 28-way vertebrate alignment and conservation track in the UCSC genome browser. Genome Res 17:1797–808
    [Google Scholar]
  86. 86.  Bernard G, Ragan MA, Chan CX 2016. Recapitulating phylogenies using k-mers: from trees to networks. F1000Research 5:2789
    [Google Scholar]
  87. 87.  Yi H, Jin L 2013. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res 41:e75
    [Google Scholar]
  88. 88.  Leimeister CA, Boden M, Horwege S, Lindner S, Morgenstern B 2014. Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics 30:1991–99
    [Google Scholar]
  89. 89.  Fan H, Ives AR, Surget-Groba Y, Cannon CH 2015. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genom 16:522
    [Google Scholar]
  90. 90.  Cattaneo G, Petrillo UF, Giancarlo R, Roscigno G 2017. An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop. J. Supercomput. 73:1467–83
    [Google Scholar]
  91. 91.  Bernard G, Chan CX, Chan Y-B, Chua XY, Cong Y et al. 2017. Alignment-free inference of hierarchical and reticulate phylogenomic relationships. Brief. Bioinform. In press
    [Google Scholar]
  92. 92.  Rappé MS, Giovannoni SJ 2003. The uncultured microbial majority. Annu. Rev. Microbiol. 57:369–94
    [Google Scholar]
  93. 93.  Dutilh BE, Cassman N, McNair K, Sanchez SE, Silva GG et al. 2014. A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat. Commun. 5:4498
    [Google Scholar]
  94. 94.  Norman JM, Handley SA, Baldridge MT, Droit L, Liu CY et al. 2015. Disease-specific alterations in the enteric virome in inflammatory bowel disease. Cell 160:447–60
    [Google Scholar]
  95. 95.  Reyes A, Blanton LV, Cao S, Zhao G, Manary M et al. 2015. Gut DNA viromes of Malawian twins discordant for severe acute malnutrition. PNAS 112:11941–46
    [Google Scholar]
  96. 96.  Minot S, Sinha R, Chen J, Li H, Keilbaugh SA et al. 2011. The human gut virome: inter-individual variation and dynamic response to diet. Genome Res 21:1616–25
    [Google Scholar]
  97. 97.  Waller AS, Yamada T, Kristensen DM, Kultima JR, Sunagawa S et al. 2014. Classification and quantification of bacteriophage taxa in human gut metagenomes. ISME J 8:1391–402
    [Google Scholar]
  98. 98.  Brum JR, Ignacio-Espinoza JC, Roux S, Doulcier G, Acinas SG et al. 2015. Patterns and ecological drivers of ocean viral communities. Science 348:1261498
    [Google Scholar]
  99. 99.  Reyes A, Haynes M, Hanson N, Angly FE, Heath AC et al. 2010. Viruses in the faecal microbiota of monozygotic twins and their mothers. Nature 466:334–38
    [Google Scholar]
  100. 100.  Zhang T, Breitbart M, Lee WH, Run JQ, Wei CL et al. 2005. RNA viral community in human feces: prevalence of plant pathogenic viruses. PLOS Biol 4:e3
    [Google Scholar]
  101. 101.  Pearce DA, Newsham KK, Thorne MA, Calvo-Bado L, Krsek M et al. 2012. Metagenomic analysis of a southern maritime antarctic soil. Front. Microbiol. 3:403
    [Google Scholar]
  102. 102.  Adriaenssens EM, Van Zyl L, De Maayer P, Rubagotti E, Rybicki E et al. 2015. Metagenomic analysis of the viral community in Namib Desert hypoliths. Environ. Microbiol. 17:480–95
    [Google Scholar]
  103. 103.  Zablocki O, van Zyl L, Adriaenssens EM, Rubagotti E, Tuffin M et al. 2014. High-level diversity of tailed phages, eukaryote-associated viruses, and virophage-like elements in the metaviromes of antarctic soils. Appl. Environ. Microbiol. 80:6888–97
    [Google Scholar]
  104. 104.  Roux S, Enault F, Hurwitz BL, Sullivan MB 2015. VirSorter: mining viral signal from microbial genomic data. PeerJ 3:e985
    [Google Scholar]
  105. 105.  Edwards RA, McNair K, Faust K, Raes J, Dutilh BE 2016. Computational approaches to predict bacteriophage–host relationships. FEMS Microbiol. Rev. 40:258–72
    [Google Scholar]
  106. 106.  Roux S, Hallam SJ, Woyke T, Sullivan MB 2015. Viral dark matter and virus–host interactions resolved from publicly available microbial genomes. eLife 4:e08490
    [Google Scholar]
  107. 107.  Ahlgren NA, Ren J, Lu YY, Fuhrman JA, Sun F 2017. Alignment-free d*2 oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. Nucleic Acids Res 45:39–53
    [Google Scholar]
  108. 108.  Galiez C, Siebert M, Enault F, Vincent J, Söding J 2017. WIsH: Who is the host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioinformatics 33:3113–14
    [Google Scholar]
  109. 109.  Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA, Thomas AD, Huntemann M et al. 2016. Uncovering Earth's virome. Nature 536:525–30
    [Google Scholar]
  110. 110.  Lima-Mendez G, Van Helden J, Toussaint A, Leplae R 2008. Reticulate representation of evolutionary and functional relationships between phage genomes. Mol. Biol. Evol. 25:762–77
    [Google Scholar]
  111. 111.  Shapiro JW, Putonti C 2018. Gene co-occurrence networks reflect bacteriophage ecology and evolution. mBio In press
    [Google Scholar]
  112. 112.  Villarroel J, Kleinheinz KA, Jurtz VI, Zschach H, Lund O et al. 2016. HostPhinder: a phage host prediction tool. Viruses 8:116
    [Google Scholar]
  113. 113.  Zhang M, Yang L, Ren J, Ahlgren NA, Fuhrman JA, Sun F 2017. Prediction of virus-host infectious association by supervised learning methods. BMC Bioinform 18:60
    [Google Scholar]
  114. 114.  Liao W, Ren J, Wang K, Wang S, Zeng F et al. 2016. Alignment-free transcriptomic and metatranscriptomic comparison using sequencing signatures with variable length Markov chains. Sci. Rep. 6:37243
    [Google Scholar]
  115. 115.  Bühlmann P, Wyner AJ et al. 1999. Variable length Markov chains. Ann. Stat. 27:480–513
    [Google Scholar]
  116. 116.  Rissanen J 1983. A universal data compression system. IEEE Trans. Inf. Theory 29:656–64
    [Google Scholar]
  117. 117.  Kullback S, Leibler RA 1951. On information and sufficiency. Ann. Math. Stat. 22:79–86
    [Google Scholar]
  118. 118.  Akaike H 1987. Factor analysis and AIC. Psychometrika 52:317–32
    [Google Scholar]
  119. 119.  Jiang B, Song K, Ren J, Deng M, Sun F, Zhang X 2012. Comparison of metagenomic samples using sequence signatures. BMC Genom 13:730
    [Google Scholar]
  120. 120.  Behnam E, Smith AD 2014. The Amordad database engine for metagenomics. Bioinformatics 30:2949–55
    [Google Scholar]
  121. 121.  Benoit G, Peterlongo P, Mariadassou M, Drezen E, Schbath S et al. 2016. Multiple comparative metagenomics using multiset k-mer counting. PeerJ Comput. Sci. 2:e94
    [Google Scholar]
  122. 122.  Wang Y, Wang K, Lu YY, Sun F 2017. Improving contig binning of metagenomic data using dS2 oligonucleotide frequency dissimilarity. BMC Bioinform 18:425
    [Google Scholar]
  123. 123.  Mande SS, Mohammed MH, Ghosh TS 2012. Classification of metagenomic sequences: methods and challenges. Brief. Bioinform. 13:669–81
    [Google Scholar]
  124. 124.  Dick GJ, Andersson AF, Baker BJ, Simmons SL, Thomas BC et al. 2009. Community-wide analysis of microbial genome sequence signatures. Genome Biol 10:R85
    [Google Scholar]
  125. 125.  Leung HC, Yiu SM, Yang B, Peng Y, Wang Y et al. 2011. A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics 27:1489–95
    [Google Scholar]
  126. 126.  Kislyuk A, Bhatnagar S, Dushoff J, Weitz JS 2009. Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinform 10:316
    [Google Scholar]
  127. 127.  Wu YW, Ye Y 2011. A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J. Comput. Biol. 18:523–34
    [Google Scholar]
  128. 128.  Wang Y, Hu H, Li X 2015. MBBC: an efficient approach for metagenomic binning based on clustering. BMC Bioinform 16:36
    [Google Scholar]
  129. 129.  Wu YW, Tang YH, Tringe SG, Simmons BA, Singer SW 2014. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome 2:26
    [Google Scholar]
  130. 130.  Lin HH, Liao YC 2016. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Sci. Rep. 6:24175
    [Google Scholar]
  131. 131.  Strous M, Kraft B, Bisdorf R, Tegetmeyer HE 2012. The binning of metagenomic contigs for microbial physiology of mixed cultures. Front. Microbiol. 3:410
    [Google Scholar]
  132. 132.  Kelley DR, Salzberg SL 2010. Clustering metagenomic sequences with interpolated Markov models. BMC Bioinform 11:544
    [Google Scholar]
  133. 133.  Pál C, Papp B, Lercher MJ 2005. Adaptive evolution of bacterial metabolic networks by horizontal gene transfer. Nat. Genet. 37:1372
    [Google Scholar]
  134. 134.  Gyles C, Boerlin P 2014. Horizontally transferred genetic elements and their role in pathogenesis of bacterial disease. Vet. Pathol. 51:328–40
    [Google Scholar]
  135. 135.  Ravenhall M, Škunca N, Lassalle F, Dessimoz C 2015. Inferring horizontal gene transfer. PLOS Comput. Biol. 11:e1004095
    [Google Scholar]
  136. 136.  Lu B, Leong HW 2016. Computational methods for predicting genomic islands in microbial genomes. Comput. Struct. Biotechnol. J. 14:200–6
    [Google Scholar]
  137. 137.  Keeling PJ, Palmer JD 2008. Horizontal gene transfer in eukaryotic evolution. Nat. Rev. Genet. 9:605–18
    [Google Scholar]
  138. 138.  Cong Y, Chan Y-B, Phillips CA, Langston MA, Ragan MA 2017. Robust inference of genetic exchange communities from microbial genomes using TF-IDF. Front. Microbiol. 8:21
    [Google Scholar]
  139. 139.  Cong Y, Chan Y-B, Ragan MA 2016. A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF. Sci. Rep. 6:30308
    [Google Scholar]
  140. 140.  Cong Y, Chan Y-B, Ragan MA 2016. Exploring lateral genetic transfer among microbial genomes using TF-IDF. Sci. Rep. 6:29319
    [Google Scholar]
  141. 141.  Dufraigne C, Fertil B, Lespinats S, Giron A, Deschavanne P 2005. Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res 33:e6
    [Google Scholar]
  142. 142.  Rajan I, Aravamuthan S, Mande SS 2007. Identification of compositionally distinct regions in genomes using the centroid method. Bioinformatics 23:2672–77
    [Google Scholar]
  143. 143.  Tsirigos A, Rigoutsos I 2005. A new computational method for the detection of horizontal gene transfer events. Nucleic Acids Res 33:922–33
    [Google Scholar]
  144. 144.  Becq J, Churlaud C, Deschavanne P 2010. A benchmark of parametric methods for horizontal transfers detection. PLOS ONE 5:e9989
    [Google Scholar]
  145. 145.  Karlin S 2001. Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol 9:335–43
    [Google Scholar]
  146. 146.  Tamames J, Moya A 2008. Estimating the extent of horizontal gene transfer in metagenomic sequences. BMC Genom 9:136
    [Google Scholar]
  147. 147.  Göke J, Schulz MH, Lasserre J, Vingron M 2012. Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics 28:656–63
    [Google Scholar]
  148. 148.  Horwege S, Lindner S, Boden M, Hatje K, Kollmar M et al. 2014. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res 42:W7–11
    [Google Scholar]
  149. 149.  Patil KR, McHardy AC 2013. Alignment-free genome tree inference by learning group-specific distance metrics. Genome Biol. Evol. 5:1470–84
    [Google Scholar]
  150. 150.  Qian K, Luan Y 2017. Weighted measures based on maximizing deviation for alignment-free sequence comparison. Physica A 481:235–42
    [Google Scholar]
  151. 151.  Murray KD, Webers C, Ong CS, Borevitz J, Warthmann N 2017. kWIP: the k-mer weighted inner product, a de novo estimator of genetic similarity. PLOS Comput. Biol. 13:e1005727
    [Google Scholar]
  152. 152.  Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH et al. 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17:132
    [Google Scholar]
  153. 153.  Bai X, Tang K, Ren J, Waterman M, Sun F 2017. Optimal choice of word length when comparing two Markov sequences using a χ2-statistic. BMC Genom 18:732
    [Google Scholar]
  154. 154.  Wu G, Jun S, Sims G, Kim S 2009. Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method. PNAS 106:12826–31
    [Google Scholar]
  155. 155.  Zhang Q, Jun SR, Leuze M, Ussery D, Nookaew I 2017. Viral phylogenomics using an alignment-free method: a three-step approach to determine optimal length of k-mer. Sci. Rep. 7:40712
    [Google Scholar]
  156. 156.  Otu HH, Sayood K 2003. A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19:2122–30
    [Google Scholar]
  157. 157.  Li M, Chen X, Li X, Ma B, Vitányi PM 2004. The similarity metric. IEEE Trans. Inform. Theory 50:3250–64
    [Google Scholar]
  158. 158.  Yu C, Liang Q, Yin C, He RL, Yau SST 2010. A novel construction of genome space with biological geometry. DNA Res 17:155–68
    [Google Scholar]
  159. 159.  Wu TJ, Hsieh YC, Li LA 2001. Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition. Biometrics 57:441–48
    [Google Scholar]
  160. 160.  Vinga S, Gouveia-Oliveira R, Almeida JS 2004. Comparative evaluation of word composition distances for the recognition of SCOP relationships. Bioinformatics 20:206–15
    [Google Scholar]
/content/journals/10.1146/annurev-biodatasci-080917-013431
Loading
/content/journals/10.1146/annurev-biodatasci-080917-013431
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error