1932

Abstract

Deciphering the regulatory code of gene expression and interpreting the transcriptional effects of genome variation are critical challenges in human genetics. Modern experimental technologies have resulted in an abundance of data, enabling the development of sequence-based deep learning models that link patterns embedded in DNA to the biochemical and regulatory properties contributing to transcriptional regulation, including modeling epigenetic marks, 3D genome organization, and gene expression, with tissue and cell-type specificity. Such methods can predict the functional consequences of any noncoding variant in the human genome, even rare or never-before-observed variants, and systematically characterize their consequences beyond what is tractable from experiments or quantitative genetics studies alone. Recently, the development and application of interpretability approaches have led to the identification of key sequence patterns contributing to the predicted tasks, providing insights into the underlying biological mechanisms learned and revealing opportunities for improvement in future models.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-genom-021623-024727
2024-08-27
2025-04-18
Loading full text...

Full text loading...

/deliver/fulltext/genom/25/1/annurev-genom-021623-024727.html?itemId=/content/journals/10.1146/annurev-genom-021623-024727&mimeType=html&fmt=ahah

Literature Cited

  1. 1.
    Agarwal V, Shendure J. 2020.. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. . Cell Rep. 31:(7):107663
    [Crossref] [Google Scholar]
  2. 2.
    Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015.. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. . Nat. Biotechnol. 33:(8):83138
    [Crossref] [Google Scholar]
  3. 3.
    Anil R, Dai AM, Firat O, Johnson M, Lepikhin D, et al. 2023.. PaLM 2 technical report. . arXiv:2305.10403 [cs.CL]
  4. 4.
    Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, et al. 2021.. Effective gene expression prediction from sequence by integrating long-range interactions. . Nat. Methods 18:(10):1196203
    [Crossref] [Google Scholar]
  5. 5.
    Avsec Ž, Kreuzhuber R, Israeli J, Xu N, Cheng J, et al. 2019.. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. . Nat. Biotechnol. 37:(6):592600
    [Crossref] [Google Scholar]
  6. 6.
    Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, et al. 2021.. On the opportunities and risks of foundation models. . arXiv:2108.07258 [cs.LG]
  7. 7.
    Botten GA, Zhang Y, Dudnyk K, Kim YJ, Liu X, et al. 2023.. Structural variation cooperates with permissive chromatin to control enhancer hijacking-mediated oncogenic transcription. . Blood 142:(4):33651
    [Google Scholar]
  8. 8.
    Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, et al. 2020.. Language models are few-shot learners. . arXiv:2005.14165 [cs.CL]
  9. 9.
    Buhrmester V, Münch D, Arens M. 2021.. Analysis of explainers of black box deep neural networks for computer vision: a survey. . Mach. Learn. Knowl. Extr. 3:(4):96689
    [Crossref] [Google Scholar]
  10. 10.
    Cao F, Zhang Y, Cai Y, Animesh S, Zhang Y, et al. 2021.. Chromatin interaction neural network (ChINN): a machine learning-based method for predicting chromatin interactions from DNA sequences. . Genome Biol. 22::226
    [Crossref] [Google Scholar]
  11. 11.
    Chen KM, Cofer EM, Zhou J, Troyanskaya OG. 2019.. Selene: a PyTorch-based deep learning library for sequence data. . Nat. Methods 16:(4):31518
    [Crossref] [Google Scholar]
  12. 12.
    Chen KM, Wong AK, Troyanskaya OG, Zhou J. 2022.. A sequence-based global map of regulatory activity for deciphering human genetics. . Nat. Genet. 54:(7):94049
    [Crossref] [Google Scholar]
  13. 13.
    Cheng J, Çelik MH, Kundaje A, Gagneur J. 2021.. MTSplice predicts effects of genetic variants on tissue-specific splicing. . Genome Biol. 22::94
    [Crossref] [Google Scholar]
  14. 14.
    Cheng J, Nguyen TYD, Cygan KJ, Çelik MH, Fairbrother WG, et al. 2019.. MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. . Genome Biol. 20::48
    [Crossref] [Google Scholar]
  15. 15.
    Cofer EM, Raimundo J, Tadych A, Yamazaki Y, Wong AK, et al. 2021.. Modeling transcriptional regulation of model species with deep learning. . Genome Res. 31:(6):1097105
    [Crossref] [Google Scholar]
  16. 16.
    Cooper TA, Wan L, Dreyfuss G. 2009.. RNA and disease. . Cell 136:(4):77793
    [Crossref] [Google Scholar]
  17. 17.
    Cullen KE, Kladde MP, Seyfred MA. 1993.. Interaction between transcription regulatory regions of prolactin chromatin. . Science 261:(5118):2036
    [Crossref] [Google Scholar]
  18. 18.
    Cuperus JT, Groves B, Kuchina A, Rosenberg AB, Jojic N, et al. 2017.. Deep learning of the regulatory grammar of yeast 5′ untranslated regions from 500,000 random sequences. . Genome Res. 27:(12):201524
    [Crossref] [Google Scholar]
  19. 19.
    Dekker J, Rippe K, Dekker M, Kleckner N. 2002.. Capturing chromosome conformation. . Science 295:(5558):130611
    [Crossref] [Google Scholar]
  20. 20.
    Devlin J, Chang M-W, Lee K, Toutanova K. 2019.. BERT: pre-training of deep bidirectional transformers for language understanding. . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, ed. J Burstein, C Doran, T Solorio , pp. 417186. Stroudsburg, PA:: Assoc. Comput. Linguist.
    [Google Scholar]
  21. 21.
    D'haeseleer P. 2006.. What are DNA sequence motifs?. Nat. Biotechnol. 24:(4):42325
    [Crossref] [Google Scholar]
  22. 22.
    Edwards SL, Beesley J, French JD, Dunning AM. 2013.. Beyond GWASs: illuminating the dark road from association to function. . Am. J. Hum. Genet. 93:(5):77997
    [Crossref] [Google Scholar]
  23. 23.
    ENCODE Proj. Consort. 2012.. An integrated encyclopedia of DNA elements in the human genome. . Nature 489:(7414):5774
    [Crossref] [Google Scholar]
  24. 24.
    Fudenberg G, Kelley DR, Pollard KS. 2020.. Predicting 3D genome folding from DNA sequence with Akita. . Nat. Methods 17:(11):111117
    [Crossref] [Google Scholar]
  25. 25.
    Ghanbari M, Ohler U. 2020.. Deep neural networks for interpreting RNA-binding protein target preferences. . Genome Res. 30:(2):21426
    [Crossref] [Google Scholar]
  26. 26.
    Grønning AGB, Doktor TK, Larsen SJ, Petersen USS, Holm LL, et al. 2020.. DeepCLIP: predicting the effect of mutations on protein-RNA binding with deep learning. . Nucleic Acids Res. 48:(13):7099118
    [Google Scholar]
  27. 27.
    GTEx Consort. 2013.. The Genotype-Tissue Expression (GTEx) project. . Nat. Genet. 45:(6):58085
    [Crossref] [Google Scholar]
  28. 28.
    Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, et al. 2019.. Predicting splicing from primary sequence with deep learning. . Cell 176:(3):53548.e24
    [Crossref] [Google Scholar]
  29. 29.
    Jha A, K Aicher J, R Gazzara M, Singh D, Barash Y. 2020.. Enhanced integrated gradients: improving interpretability of deep learning models using splicing codes as a case study. . Genome Biol. 21::149
    [Crossref] [Google Scholar]
  30. 30.
    Ji Y, Zhou Z, Liu H, Davuluri RV. 2021.. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. . Bioinformatics 37:(15):211220
    [Crossref] [Google Scholar]
  31. 31.
    Jing F, Zhang S-W, Zhang S. 2020.. Prediction of enhancer-promoter interactions using the cross-cell type information and domain adversarial neural network. . BMC Bioinform. 21::507
    [Crossref] [Google Scholar]
  32. 32.
    Kelley DR. 2020.. Cross-species regulatory sequence activity prediction. . PLOS Comput. Biol. 16:(7):e1008050
    [Crossref] [Google Scholar]
  33. 33.
    Kelley DR, Reshef YA, Bileschi M, Belanger D, McLean CY, Snoek J. 2018.. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. . Genome Res. 28:(5):73950
    [Crossref] [Google Scholar]
  34. 34.
    Kelley DR, Snoek J, Rinn JL. 2016.. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. . Genome Res. 26:(7):99099
    [Crossref] [Google Scholar]
  35. 35.
    Koo PK, Ploenzke M. 2021.. Improving representations of genomic sequence motifs in convolutional networks with exponential activations. . Nat. Mach. Intell. 3:(3):25866
    [Crossref] [Google Scholar]
  36. 36.
    Kopp W, Monti R, Tamburrini A, Ohler U, Akalin A. 2020.. Deep learning for genomics using Janggu. . Nat. Commun. 11::3488
    [Crossref] [Google Scholar]
  37. 37.
    Krietenstein N, Abraham S, Venev SV, Abdennur N, Gibcus J, et al. 2020.. Ultrastructural details of mammalian chromosome architecture. . Mol. Cell 78:(3):55465.e7
    [Crossref] [Google Scholar]
  38. 38.
    Lanchantin J, Singh R, Wang B, Qi Y. 2017.. Deep Motif Dashboard: visualizing and understanding genomic sequences using deep neural networks. . In Pacific Symposium on Biocomputing 2017, ed. RB Altman, AK Dunker, L Hunter, MD Ritchie, TA Murray, TE Klein , pp. 25465. Singapore:: World Sci.
    [Google Scholar]
  39. 39.
    Li J, Wang J, Zhang P, Wang R, Mei Y, et al. 2022.. Deep learning of cross-species single-cell landscapes identifies conserved regulatory programs underlying cell types. . Nat. Genet. 54:(11):171120
    [Crossref] [Google Scholar]
  40. 40.
    Li W, Wong WH, Jiang R. 2019.. DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning. . Nucleic Acids Res. 47:(10):e60
    [Crossref] [Google Scholar]
  41. 41.
    Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, et al. 2009.. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. . Science 326:(5950):28993
    [Crossref] [Google Scholar]
  42. 42.
    Lin T, Wang Y, Liu X, Qiu X. 2021.. A survey of transformers. . arXiv:2106.04554 [cs.LG]
  43. 43.
    Mao W, Kostka D, Chikina M. 2017.. Modeling enhancer-promoter interactions with attention-based neural networks. . bioRxiv 219667. https://doi.org/10.1101/219667
  44. 44.
    Maslova A, Ramirez RN, Ma K, Schmutz H, Wang C, et al. 2020.. Deep learning of immune cell differentiation. . PNAS 117:(41):2565566
    [Crossref] [Google Scholar]
  45. 45.
    McArthur E, Rinker DC, Gilbertson EN, Fudenberg G, Pittman M, et al. 2022.. Reconstructing the 3D genome organization of Neanderthals reveals that chromatin folding shaped phenotypic and sequence divergence. . bioRxiv 2022.02.07.479462. https://doi.org/10.1101/2022.02.07.479462
  46. 46.
    Min X, Zeng W, Chen S, Chen N, Chen T, Jiang R. 2017.. Predicting enhancers with deep convolutional neural networks. . BMC Bioinform. 18:(Suppl. 13):478
    [Crossref] [Google Scholar]
  47. 47.
    Mosca E, Szigeti F, Tragianni S, Gallagher D, Groh G. 2022.. SHAP-based explanation methods: a review for NLP interpretability. . In Proceedings of the 29th International Conference on Computational Linguistics, ed. N Calzorai, C-R Huang, H Kim, J Pustejovsky, L Wanner, et al. , pp. 4593603. Stroudsburg, PA:: Int. Comm. Comput. Linguist.
    [Google Scholar]
  48. 48.
    Nair S, Kim DS, Perricone J, Kundaje A. 2019.. Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. . Bioinformatics 35:(14):i10816
    [Crossref] [Google Scholar]
  49. 49.
    Nmezi B, Bey GR, Oranburg TD, Dudnyk K, Lardo SM, et al. 2023.. An oligodendrocyte silencer element underlies the pathogenic impact of lamin B1 structural variants. . bioRxiv 2023.08.03.551473. https://doi.org/10.1101/2023.08.03.551473
  50. 50.
    Novakovsky G, Fornes O, Saraswat M, Mostafavi S, Wasserman WW. 2023.. ExplaiNN: interpretable and transparent neural networks for genomics. . Genome Biol. 24::154
    [Crossref] [Google Scholar]
  51. 51.
    OpenAI. 2023.. GPT-4 technical report. . arXiv:2303.08774 [cs.CL]
  52. 52.
    Park CY, Zhou J, Wong AK, Chen KM, Theesfeld CL, et al. 2021.. Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk. . Nat. Genet. 53:(2):16673
    [Crossref] [Google Scholar]
  53. 53.
    Park S, Koh Y, Jeon H, Kim H, Yeo Y, Kang J. 2020.. Enhancing the interpretability of transcription factor binding site prediction using attention mechanism. . Sci. Rep. 10::13413
    [Crossref] [Google Scholar]
  54. 54.
    Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X. 2020.. Pre-trained models for natural language processing: a survey. . arXiv:2003.08271 [cs.CL]
  55. 55.
    Quang D, Xie X. 2016.. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. . Nucleic Acids Res. 44:(11):e107
    [Crossref] [Google Scholar]
  56. 56.
    Quang D, Xie X. 2019.. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. . Methods 166::4047
    [Crossref] [Google Scholar]
  57. 57.
    Rao SSP, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, et al. 2014.. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. . Cell 159:(7):166580
    [Crossref] [Google Scholar]
  58. 58.
    Richter F, Morton SU, Kim SW, Kitaygorodsky A, Wasson LK, et al. 2020.. Genomic analyses implicate noncoding de novo variants in congenital heart disease. . Nat. Genet. 52:(8):76977
    [Crossref] [Google Scholar]
  59. 59.
    Roadmap Epigenom. Consort., Kundaje A, Meuleman W, Ernst J, Bilenky M, et al. 2015.. Integrative analysis of 111 reference human epigenomes. . Nature 518:(7539):31730
    [Crossref] [Google Scholar]
  60. 60.
    Schwessinger R, Gosden M, Downes D, Brown RC, Oudelaar AM, et al. 2020.. DeepC: predicting 3D genome folding using megabase-scale transfer learning. . Nat. Methods 17:(11):111824
    [Crossref] [Google Scholar]
  61. 61.
    Serrano S, Smith NA. 2019.. Is attention interpretable?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ed. A Korhonen, D Traum, L Màrquez , pp. 293151. Stroudsburg, PA:: Assoc. Comput. Linguist.
    [Google Scholar]
  62. 62.
    Shrikumar A, Greenside P, Kundaje A. 2017.. Learning important features through propagating activation differences. . In ICML’17: Proceedings of the 34th International Conference on Machine Learning, pp. 314553. New York:: ACM
    [Google Scholar]
  63. 63.
    Siddiqui N, Borden KLB. 2012.. mRNA export and cancer. . Wiley Interdiscip. Rev. RNA 3:(1):1325
    [Crossref] [Google Scholar]
  64. 64.
    Sikic M. 2023.. Facilitating genome structural variation analysis. . Nat. Methods 20:(4):49192
    [Crossref] [Google Scholar]
  65. 65.
    Singh S, Yang Y, Póczos B, Ma J. 2019.. Predicting enhancer-promoter interaction from genomic sequence with deep neural networks. . Quant. Biol. 7:(2):12237
    [Crossref] [Google Scholar]
  66. 66.
    Slattery M, Zhou T, Yang L, Dantas Machado AC, Gordân R, Rohs R. 2014.. Absence of a simple code: how transcription factors read the genome. . Trends Biochem. Sci. 39:(9):38199
    [Crossref] [Google Scholar]
  67. 67.
    Sokolova K, Theesfeld CL, Wong AK, Zhang Z, Dolinski K, Troyanskaya OG. 2023.. Atlas of primary cell-type-specific sequence models of gene expression and variant effects. . Cell Rep. Methods 3:(9):100580
    [Crossref] [Google Scholar]
  68. 68.
    Stormo GD. 2000.. DNA binding sites: representation and discovery. . Bioinformatics 16:(1):1623
    [Crossref] [Google Scholar]
  69. 69.
    Takahashi H, Kato S, Murata M, Carninci P. 2012.. CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. . Methods Mol. Biol. 786::181200
    [Crossref] [Google Scholar]
  70. 70.
    Tan J, Shenker-Tauris N, Rodriguez-Hernaez J, Wang E, Sakellaropoulos T, et al. 2023.. Cell-type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening. . Nat. Biotechnol. 41:(8):114050
    [Crossref] [Google Scholar]
  71. 71.
    Touvron H, Martin L, Stone K, Albert P, Almahairi A, et al. 2023.. Llama 2: open foundation and fine-tuned chat models. . arXiv:2307.09288 [cs.CL]
  72. 72.
    Trieu T, Martinez-Fundichely A, Khurana E. 2020.. DeepMILO: a deep learning approach to predict the impact of non-coding sequence variants on 3D chromatin structure. . Genome Biol. 21::79
    [Crossref] [Google Scholar]
  73. 73.
    Washburn JD, Mejia-Guerra MK, Ramstein G, Kremling KA, Valluru R, et al. 2019.. Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence. . PNAS 116:(12):554249
    [Crossref] [Google Scholar]
  74. 74.
    Whitaker JW, Chen Z, Wang W. 2015.. Predicting the human epigenome from DNA motifs. . Nat. Methods 12:(3):26572
    [Crossref] [Google Scholar]
  75. 75.
    Xu Z, Lee D-S, Chandran S, Le VT, Bump R, et al. 2022.. Structural variants drive context-dependent oncogene activation in cancer. . Nature 612:(7940):56472
    [Crossref] [Google Scholar]
  76. 76.
    Yamada K, Hamada M. 2022.. Prediction of RNA-protein interactions using a nucleotide language model. . Bioinform. Adv. 2:(1):vbac023
    [Crossref] [Google Scholar]
  77. 77.
    Yang M, Ma J. 2023.. UNADON: transformer-based model to predict genome-wide chromosome spatial position. . Bioinformatics 39:(Suppl. 1):i55362
    [Crossref] [Google Scholar]
  78. 78.
    Yang Y, Zhang R, Singh S, Ma J. 2017.. Exploiting sequence-based features for predicting enhancer-promoter interactions. . Bioinformatics 33:(14):i25260
    [Crossref] [Google Scholar]
  79. 79.
    Yin Q, Wu M, Liu Q, Lv H, Jiang R. 2019.. DeepHistone: a deep learning approach to predicting histone modifications. . BMC Genom. 20:(Suppl. 2):193
    [Crossref] [Google Scholar]
  80. 80.
    Zeng T, Li YI. 2022.. Predicting RNA splicing from DNA sequence using Pangolin. . Genome Biol. 23::103
    [Crossref] [Google Scholar]
  81. 81.
    Zeng W, Wang Y, Jiang R. 2020.. Integrating distal and proximal information to predict gene expression via a densely connected convolutional neural network. . Bioinformatics 36:(2):496503
    [Crossref] [Google Scholar]
  82. 82.
    Zhang Y, Wang Z, Zeng Y, Zhou J, Zou Q. 2021.. High-resolution transcription factor binding sites prediction improved performance and interpretability by deep learning method. . Brief. Bioinform. 22:(6):bbab273
    [Crossref] [Google Scholar]
  83. 83.
    Zhang Z, Park CY, Theesfeld CL, Troyanskaya OG. 2021.. An automated framework for efficiently designing deep convolutional neural networks in genomics. . Nat. Mach. Intell. 3:(5):392400
    [Crossref] [Google Scholar]
  84. 84.
    Zheng R, Wan C, Mei S, Qin Q, Wu Q, et al. 2019.. Cistrome Data Browser: expanded datasets and new tools for gene regulatory analysis. . Nucleic Acids Res. 47:(D1):D72935
    [Crossref] [Google Scholar]
  85. 85.
    Zhou J. 2022.. Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. . Nat. Genet. 54:(5):72534
    [Crossref] [Google Scholar]
  86. 86.
    Zhou J, Park CY, Theesfeld CL, Wong AK, Yuan Y, et al. 2019.. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. . Nat. Genet. 51:(6):97380
    [Crossref] [Google Scholar]
  87. 87.
    Zhou J, Theesfeld CL, Yao K, Chen KM, Wong AK, Troyanskaya OG. 2018.. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. . Nat. Genet. 50:(8):117179
    [Crossref] [Google Scholar]
  88. 88.
    Zhou J, Troyanskaya OG. 2015.. Predicting effects of noncoding variants with deep learning-based sequence model. . Nat. Methods 12:(10):93134
    [Crossref] [Google Scholar]
/content/journals/10.1146/annurev-genom-021623-024727
Loading
/content/journals/10.1146/annurev-genom-021623-024727
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error