Alignment of Next-Generation Sequencing Reads

Knut Reinert; Ben Langmead; David Weese; Dirk J. Evers

doi:10.1146/annurev-genom-090413-025358

Annual Review of Genomics and Human Genetics

Volume 16, 2015

Review Article

Free

Alignment of Next-Generation Sequencing Reads

Knut Reinert¹, Ben Langmead², David Weese¹, and Dirk J. Evers³
View Affiliations Hide Affiliations

Affiliations: ¹Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany; email: [email protected], [email protected] ²Department of Computer Science and Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland 21218; email: [email protected] ³Molecular Health GmbH, 69115 Heidelberg, Germany; email: [email protected]
Vol. 16:133-151 (Volume publication date August 2015) https://doi.org/10.1146/annurev-genom-090413-025358
First published as a Review in Advance on May 04, 2015
© Annual Reviews

Abstract

High-throughput DNA sequencing has considerably changed the possibilities for conducting biomedical research by measuring billions of short DNA or RNA fragments. A central computational problem, and for many applications a first step, consists of determining where the fragments came from in the original genome. In this article, we review the main techniques for generating the fragments, the main applications, and the main algorithmic ideas for computing a solution to the read alignment problem. In addition, we describe pitfalls and difficulties connected to determining the correct positions of reads.

Keyword(s): high-throughput sequencing, read mapping, string indices

Article metrics loading...

/content/journals/10.1146/annurev-genom-090413-025358

2015-08-24

2024-04-18

Full text loading...

/deliver/fulltext/genom/16/1/annurev-genom-090413-025358.html?itemId=/content/journals/10.1146/annurev-genom-090413-025358&mimeType=html&fmt=ahah

Literature Cited

Abouelhoda MI, Kurtz S, Ohlebusch E. 1. 2004. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2:53–86 [Google Scholar]
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 2. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403–10 [Google Scholar]
Ames SK, Hysom DA, Gardner SN, Lloyd GS, Gokhale MB, Allen JE. 3. 2013. Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics 29:2253–60 [Google Scholar]
Baeza-Yates RA, Navarro G. 4. 1999. Faster approximate string matching. Algorithmica 23:127–58 [Google Scholar]
Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M. 5. et al. 2013. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2:10 [Google Scholar]
Brady A, Salzberg SL. 6. 2009. Phymm and Phymmbl: metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods 6:673–76 [Google Scholar]
Brandon R, Myers GW, Sutton GG, Delcher AL, Dew IM. 7. et al. 2000. A whole-genome assembly of Drosophila. Science 287:2196–204 [Google Scholar]
Burkhardt S, Crauser A, Ferragina P, Lenhof H-P, Rivals E, Vingron M. 8. 1999. q-gram based database searching using a suffix array (Quasar). RECOMB '99: Proceedings of the Third Annual International Conference on Computational Molecular Biology77–83 New York: Am. Chem. Soc.
Burrows M, Wheeler DJ. 9. 1994. A block-sorting lossless data compression algorithm Res. Rep. 124, Syst. Res. Cent., Palo Alto, CA
Daniel R. 10. 2005. The metagenomics of soil. Nat. Rev. Microbiol. 3:470–78 [Google Scholar]
David M, Dzamba M, Lister D, Ilie L, Brudno M. 11. 2011. SHRiMP2: sensitive yet practical short read mapping. Bioinformatics 27:1011–12 [Google Scholar]
Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E. 12. et al. 2009. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25:3207–12 [Google Scholar]
Dewey FE, Chen R, Cordero SP, Ormond KE, Caleshu C. 13. et al. 2011. Phased whole-genome genetic risk in a family quartet using a major allele reference sequence. PLOS Genet. 7:e1002280 [Google Scholar]
Dohm JC, Lottaz C, Borodina T, Himmelbauer H. 14. 2008. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 36:e105 [Google Scholar]
Farrar M. 15. 2007. Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics 23:156–61 [Google Scholar]
Fedoroff NV. 16. 2012. Transposable elements, epigenetics, and genome evolution. Science 338:758–67 [Google Scholar]
Ferragina P, Manzini G. 17. 2000. Opportunistic data structures with applications. Proceedings of the 41st Annual Symposium on Foundations of Computer Science, 12–14 November 2000, Redondo Beach, California390–98 Los Alamitos, CA: IEEE Comput. Soc.
Ferragina P, Mishra B. 18. 2014. Algorithms in stringomics (I): pattern-matching against “stringomes.” bioRxiv. doi: 10.1101/001669
Gan X, Stegle O, Behr J, Steffen JG, Drewe P. 19. et al. 2011. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477:419–23 [Google Scholar]
Ghanayim A, Geiger D. 20. 2013. Iterative referencing for improving the interpretation of DNA sequence data Tech. Rep., Technion, Haifa, Isr.
Gilad Y, Pritchard JK, Thornton K. 21. 2009. Characterizing natural variation using next-generation sequencing technologies. Trends Genet. 25:463–71 [Google Scholar]
Glenn TC. 22. 2011. Field guide to next-generation DNA sequencers. Mol. Ecol. Resour. 11:759–69 [Google Scholar]
Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I. 23. et al. 2010. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat. Methods 7:576–77 [Google Scholar]
Hach F, Sarrafi I, Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC. 24. 2014. mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications. Nucleic Acids Res. 42:W494–500 [Google Scholar]
Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P. 25. et al. 2009. Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLOS Comput. Biol. 5:e1000502 [Google Scholar]
26. Hum. Microbiome Proj. Consort 2012. Structure, function and diversity of the healthy human microbiome. Nature 486:207–14 [Google Scholar]
Huson DH, Auch AF, Qi J, Schuster SC. 27. 2007. MEGAN analysis of metagenomic data. Genome Res. 17:377–86 [Google Scholar]
Jiang H, Wong WH. 28. 2008. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 24:2395–96 [Google Scholar]
Karlin S, Altschul SF. 29. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. PNAS 87:2264–68 [Google Scholar]
Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. 30. 2011. Adaptive seeds tame genomic sequence comparison. Genome Res. 21:487–93 [Google Scholar]
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC. 31. et al. 2001. Initial sequencing and analysis of the human genome. Nature 409:860–921 [Google Scholar]
Langmead B, Salzberg S. 32. 2012. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9:357–59 [Google Scholar]
Langmead B, Trapnell C, Pop M, Salzberg SL. 33. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10:R25 [Google Scholar]
Li H, Durbin R. 34. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–60 [Google Scholar]
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J. 35. et al. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–79 [Google Scholar]
Li H, Ruan J, Durbin R. 36. 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18:1851–58 [Google Scholar]
Li R, Li Y, Kristiansen K, Wang J. 37. 2008. SOAP: short oligonucleotide alignment program. Bioinformatics 24:713–14 [Google Scholar]
Li R, Yu C, Li Y, Lam T-W, Yiu S-M. 38. et al. 2009. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–67 [Google Scholar]
Liu B, Gibbons T, Ghodsi M, Treangen T, Pop M. 39. 2011. Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics 12:Suppl. 2S4 [Google Scholar]
Lunter G, Goodson M. 40. 2011. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21:936–39 [Google Scholar]
Mäkinen V, Navarro G, Sirén J, Välimäki N. 41. 2009. Storage and retrieval of individual genomes. Research in Computational Molecular Biology: 13th Annual International Conference, RECOMB 2009, Tucson, AZ, USA, May 18–21, 2009, Proceedings S Batzoglou 121–37 Lect. Notes Bioinform. 5541 Berlin: Springer [Google Scholar]
Manber U, Myers EW. 42. 1990. Suffix arrays: a new method for on-line string searches. Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms319–27 Philadelphia: Soc. Ind. Appl. Math.
Marcus S, Lee H, Schatz MC. 43. 2014. SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30:3476–83 [Google Scholar]
Margulies M, Egholm M, Altman WE, Attiya S, Bader JS. 44. et al. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–80 [Google Scholar]
Montgomery SB, Goode DL, Kvikstad E, Albers CA, Zhang ZD. 45. et al. 2013. The origin, evolution, and functional impact of short insertion–deletion variants identified in 179 human genomes. Genome Res. 23:749–61 [Google Scholar]
Myers EW. 46. 1986. An O(ND) difference algorithm and its variations. Algorithmica 1:251–66 [Google Scholar]
Myers EW. 47. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46:395–415 [Google Scholar]
Nagarajan N, Pop M. 48. 2013. Sequence assembly demystified. Nat. Rev. Genet. 14:157–67 [Google Scholar]
Pagh R, Rodler FF. 49. 2001. Cuckoo hashing. Algorithms: ESA 2001 FM auf der Heide 121–33 Lect. Notes Comput. Sci. 2161 Berlin: Springer [Google Scholar]
Rasmussen KR, Stoye J, Myers EW. 50. 2006. Efficient q-gram filters for finding all ε-matches over a given length. J. Comput. Biol 13:296–308 [Google Scholar]
Rosenfeld JA, Mason CE, Smith TM. 51. 2012. Limitations of the human reference genome for personalized genomics. PLOS ONE 7:e40294 [Google Scholar]
Rozowsky J, Abyzov A, Wang J, Alves P, Raha D. 52. et al. 2011. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol. Syst. Biol. 7:522 [Google Scholar]
Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M. 53. 2009. SHRiMP: accurate mapping of short color-space reads. PLOS Comput. Biol. 5:e1000386 [Google Scholar]
Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S. 54. et al. 2007. The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PLOS Biol. 5:e77 [Google Scholar]
Satya RV, Zavaljevski N, Reifman J. 55. 2012. A new strategy to reduce allelic bias in RNA-Seq readmapping. Nucleic Acids Res. 40:e127 [Google Scholar]
Schatz MC, Delcher AL, Salzberg SL. 56. 2010. Assembly of large genomes using second-generation sequencing. Genome Res. 20:1165–73 [Google Scholar]
Schnable PS, Ware D, Fulton RS, Stein JC, Wei F. 57. et al. 2009. The B73 maize genome: complexity, diversity, and dynamics. Science 326:1112–15 [Google Scholar]
Schneeberger K, Hagmann J, Ossowski S, Warthmann N, Gesing S. 58. et al. 2009. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10:R98 [Google Scholar]
Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. 59. 2012. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 9:811–14 [Google Scholar]
Siragusa E, Weese D, Reinert K. 60. 2013. Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41:e78 [Google Scholar]
Smith TF, Waterman MS. 61. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195–97 [Google Scholar]
Stein L. 62. 2010. The case for cloud computing in genome informatics. Genome Biol. 11:207 [Google Scholar]
Treangen TJ, Salzberg SL. 63. 2011. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13:36–46 [Google Scholar]
Tyekucheva S, Yolken RH, McCombie WR, Parla J, Kramer M. 64. et al. 2011. Establishing the baseline level of repetitive element expression in the human cortex. BMC Genomics 12:495 [Google Scholar]
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ. 65. et al. 2001. The sequence of the human genome. Science 291:1304–51 [Google Scholar]
Weese D, Emde A, Rausch T, Döring A, Reinert K. 66. 2009. RazerS—fast read mapping with sensitivity control. Genome Res. 19:1646–54 [Google Scholar]
Weese D, Holtgrewe M, Reinert K. 67. 2012. RazerS 3: faster, fully sensitive read mapping. Bioinformatics 28:2592–99 [Google Scholar]
Weiner P. 68. 1973. Linear pattern matching algorithms. Proceedings of the 14th Annual Symposium on Switching and Automata Theory1–11 Northridge, CA: IEEE Comput. Soc.
Witherspoon D, Xing J, Zhang Y, Watkins WS, Batzer M, Jorde L. 69. 2010. Mobile element scanning (ME-Scan) by targeted high-throughput sequencing. BMC Genomics 11:410 [Google Scholar]
Wood D, Salzberg S. 70. 2014. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15:R46 [Google Scholar]
Wu TD, Nacu S. 71. 2010. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26:873–81 [Google Scholar]

/content/journals/10.1146/annurev-genom-090413-025358

Alignment of Next-Generation Sequencing Reads

Annual Review of Genomics and Human Genetics 16, 133 (2015); https://doi.org/10.1146/annurev-genom-090413-025358

/content/journals/10.1146/annurev-genom-090413-025358

Data & Media loading...

Article Type: Review Article

Most Cited Most Cited RSS feed

- Next-Generation DNA Sequencing Methods
  
  Elaine R. Mardis
  
  Vol. 9 (2008), pp. 387–402
- Apolipoprotein E: Far More Than a Lipid Transport Protein
  
  Robert W. Mahley, and Stanley C. Rall Jr.
  
  Vol. 1 (2000), pp. 507–537
- A NEW APPROACH TO DECODING LIFE: Systems Biology
  
  Trey Ideker, Timothy Galitski, and Leroy Hood
  
  Vol. 2 (2001), pp. 343–372
- The Ciliopathies: An Emerging Class of Human Genetic Disorders
  
  Jose L. Badano, Norimasa Mitsuma, Phil L. Beales, and Nicholas Katsanis
  
  Vol. 7 (2006), pp. 125–148
- Copy Number Variation in Human Health, Disease, and Evolution
  
  Feng Zhang, Wenli Gu, Matthew E. Hurles, and James R. Lupski
  
  Vol. 10 (2009), pp. 451–481
- Genotype Imputation
  
  Yun Li, Cristen Willer, Serena Sanna, and Gonçalo Abecasis
  
  Vol. 10 (2009), pp. 387–406
- MAMMALIAN CIRCADIAN BIOLOGY: Elucidating Genome-Wide Levels of Temporal Organization
  
  Phillip L. Lowrey, and Joseph S. Takahashi
  
  Vol. 5 (2004), pp. 407–441
- Predicting the Effects of Amino Acid Substitutions on Protein Function
  
  Pauline C. Ng, and Steven Henikoff
  
  Vol. 7 (2006), pp. 61–80
- The RASopathies
  
  Katherine A. Rauen
  
  Vol. 14 (2013), pp. 355–369
- The Toxicogenomic Multiverse: Convergent Recruitment of Proteins Into Animal Venoms
  
  Bryan G. Fry, Kim Roelants, Donald E. Champagne, Holger Scheib, Joel D.A. Tyndall, Glenn F. King, Timo J. Nevalainen, Janette A. Norman, Richard J. Lewis, Raymond S. Norton, Camila Renjifo, and Ricardo C. Rodríguez de la Vega
  
  Vol. 10 (2009), pp. 483–511
More Less

Annual Review of Genomics and Human Genetics

Volume 16, 2015

Review Article

Free

Alignment of Next-Generation Sequencing Reads

Abstract

Most Read This Month

Most Cited Most Cited RSS feed