Toward Identification of Functional Sequences and Variants in Noncoding DNA

Remo Monti; Uwe Ohler

doi:10.1146/annurev-biodatasci-122120-110102

Annual Review of Biomedical Data Science

Volume 6, 2023

Review Article

Open Access

Toward Identification of Functional Sequences and Variants in Noncoding DNA

Remo Monti^1,2, and Uwe Ohler¹
View Affiliations Hide Affiliations

Affiliations: ¹Max Delbrück Center for Molecular Medicine (MDC), Helmholtz Association of German Research Centers, Berlin Institute for Medical Systems Biology (BIMSB), Berlin, Germany; email: uwe.ohler@mdc-berlin.de ²Digital Health–Machine Learning, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
Vol. 6:191-210 (Volume publication date August 2023) https://doi.org/10.1146/annurev-biodatasci-122120-110102
First published as a Review in Advance on June 01, 2023
Copyright © 2023 by the author(s).

This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See credit lines of images or other third-party material in this article for license information

Abstract

Understanding the noncoding part of the genome, which encodes gene regulation, is necessary to identify genetic mechanisms of disease and translate findings from genome-wide association studies into actionable results for treatments and personalized care. Here we provide an overview of the computational analysis of noncoding regions, starting from gene-regulatory mechanisms and their representation in data. Deep learning methods, when applied to these data, highlight important regulatory sequence elements and predict the functional effects of genetic variants. These and other algorithms are used to predict damaging sequence variants. Finally, we introduce rare-variant association tests that incorporate functional annotations and predictions in order to increase interpretability and statistical power.

Keyword(s): deep learning, enhancer, gene regulation, genome wide association studies, machine learning, rare variants, sequence analysis, transcription, variant effect prediction, whole genome sequencing

Article metrics loading...

/content/journals/10.1146/annurev-biodatasci-122120-110102

2023-08-10

2025-04-04

The full text of this item is not currently available.

Literature Cited

1.
Moore JE, Purcaro MJ, Pratt HE, Epstein CB, Shoresh N et al. 2020. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583:699–710
[Google Scholar]
2.
Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB. 2010. Annotating non-coding regions of the genome. Nat. Rev. Genet. 11:559–71
[Google Scholar]
3.
Goodfellow I, Bengio Y, Courville A. 2016. Deep Learning Cambridge, MA: MIT
[Google Scholar]
4.
Eraslan G, Avsec Ž, Gagneur J, Theis FJ. 2019. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20:389–403
[Google Scholar]
5.
Barshai M, Tripto E, Orenstein Y. 2020. Identifying regulatory elements via deep learning. Annu. Rev. Biomed. Data Sci. 3:315–38
[Google Scholar]
6.
Halldorsson BV, Eggertsson HP, Moore KH, Hauswedell H, Eiriksson O et al. 2022. The sequences of 150,119 genomes in the UK Biobank. Nature 607:732–40
[Google Scholar]
7.
Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA et al. 2021. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590:290–99
[Google Scholar]
8.
Zhu C, Miller M, Zeng Z, Wang Y, Mahlich Y et al. 2020. Computational approaches for unraveling the effects of variation in the human genome and microbiome. Annu. Rev. Biomed. Data Sci. 3:411–32
[Google Scholar]
9.
Frazer J, Notin P, Dias M, Gomez A, Min JK et al. 2021. Disease variant prediction with deep generative models of evolutionary data. Nature 599:91–95
[Google Scholar]
10.
Hu Y, Stilp AM, McHugh CP, Rao S, Jain D et al. 2021. Whole-genome sequencing association analysis of quantitative red blood cell phenotypes: The NHLBI TOPMed program. Am. J. Hum. Genet. 108:874–93
[Google Scholar]
11.
DiCorpo D, Gaynor SM, Russell EM, Westerman KE, Raffield LM et al. 2022. Whole genome sequence association analysis of fasting glucose and fasting insulin levels in diverse cohorts from the NHLBI TOPMed program. Commun. Biol. 5:756
[Google Scholar]
12.
Ellingford JM, Ahn JW, Bagnall RD, Baralle D, Barton S et al. 2022. Recommendations for clinical interpretation of variants found in non-coding regions of the genome. Genome Med. 14:73
[Google Scholar]
13.
Shlyueva D, Stampfel G, Stark A. 2014. Transcriptional enhancers: from properties to genome-wide predictions. Nat. Rev. Genet. 15:272–86
[Google Scholar]
14.
Andersson R, Sandelin A. 2020. Determinants of enhancer and promoter activities of regulatory elements. Nat. Rev. Genet. 21:71–87
[Google Scholar]
15.
Gebauer F, Schwarzl T, Valcárcel J, Hentze MW. 2021. RNA-binding proteins in human genetic disease. Nat. Rev. Genet. 22:185–98
[Google Scholar]
16.
Vierstra J, Lazar J, Sandstrom R, Halow J, Lee K et al. 2020. Global reference mapping of human transcription factor footprints. Nature 583:729–36
[Google Scholar]
17.
Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA et al. 2006. In vivo enhancer analysis of human conserved non-coding sequences. Nature 444:499–502
[Google Scholar]
18.
Berthelot C, Villar D, Horvath JE, Odom DT, Flicek P. 2018. Complexity and conservation of regulatory landscapes underlie evolutionary resilience of mammalian gene expression. Nat. Ecol. Evol. 2:152–63
[Google Scholar]
19.
Fulco CP, Nasser J, Jones TR, Munson G, Bergman DT et al. 2019. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 51:1664–69
[Google Scholar]
20.
Zuin J, Roth G, Zhan Y, Cramard J, Redolfi J et al. 2022. Nonlinear control of transcription through enhancer–promoter interactions. Nature 604:571–77
[Google Scholar]
21.
de Wit E, Vos ES, Holwerda SJ, Valdes-Quezada C, Verstegen MJ et al. 2015. CTCF binding polarity determines chromatin looping. Mol. Cell 60:676–84
[Google Scholar]
22.
Whyte WA, Orlando DA, Hnisz D, Abraham BJ, Lin CY et al. 2013. Master transcription factors and mediator establish super-enhancers at key cell identity genes. Cell 153:307–19
[Google Scholar]
23.
Dixon JR, Selvaraj S, Yue F, Kim A, Li Y et al. 2012. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485:376–80
[Google Scholar]
24.
Sun F, Chronis C, Kronenberg M, Chen XF, Su T et al. 2019. Promoter-enhancer communication occurs primarily within insulated neighborhoods. Mol. Cell 73:250–63
[Google Scholar]
25.
Winick-Ng W, Kukalev A, Harabula I, Zea-Redondo L, Szabó D et al. 2021. Cell-type specialization is encoded by specific chromatin topologies. Nature 599:684–91
[Google Scholar]
26.
Kornberg RD. 1974. Chromatin structure: a repeating unit of histones and DNA: Chromatin structure is based on a repeating unit of eight histone molecules and about 200 DNA base pairs. Science 184:868–71
[Google Scholar]
27.
Cotney J, Leng J, Oh S, DeMare LE, Reilly SK et al. 2012. Chromatin state signatures associated with tissue-specific gene expression and enhancer activity in the embryonic limb. Genome Res. 22:1069–80
[Google Scholar]
28.
Bernstein BE, Humphrey EL, Erlich RL, Schneider R, Bouman P et al. 2002. Methylation of histone H3 Lys 4 in coding regions of active genes. PNAS 99:8695–700
[Google Scholar]
29.
Schübeler D, MacAlpine DM, Scalzo D, Wirbelauer C, Kooperberg C et al. 2004. The histone modification pattern of active genes revealed through genome-wide chromatin analysis of a higher eukaryote. Genes Dev. 18:1263–71
[Google Scholar]
30.
Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW et al. 2010. Histone H3K27ac separates active from poised enhancers and predicts developmental state. PNAS 107:21931–36
[Google Scholar]
31.
Kouzarides T. 2007. Chromatin modifications and their function. Cell 128:693–705
[Google Scholar]
32.
Boros J, Arnoult N, Stroobant V, Collet JF, Decottignies A. 2014. Polycomb repressive complex 2 and H3K27me3 cooperate with H3K9 methylation to maintain heterochromatin protein 1α at chromatin. Mol. Cell. Biol. 34:3662–74
[Google Scholar]
33.
Holliday R, Pugh JE. 1975. DNA modification mechanisms and gene activity during development: Developmental clocks may depend on the enzymic modification of specific bases in repeated DNA sequences. Science 187:226–32
[Google Scholar]
34.
Gardiner-Garden M, Frommer M. 1987. CpG islands in vertebrate genomes. J. Mol. Biol. 196:261–82
[Google Scholar]
35.
Wiener D, Schwartz S. 2021. The epitranscriptome beyond m⁶a. Nat. Rev. Genet. 22:119–31
[Google Scholar]
36.
Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y et al. 2007. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Mathods 4:651–57
[Google Scholar]
37.
Johnson DS, Mortazavi A, Myers RM, Wold B. 2007. Genome-wide mapping of in vivo protein-DNA interactions. Science 316:58301497–503
[Google Scholar]
38.
Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH et al. 2008. High-resolution mapping and characterization of open chromatin across the genome. Cell 132:2311–22
[Google Scholar]
39.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5:621–28
[Google Scholar]
40.
Lieberman-Aiden E, Van Berkum NL, Williams L, Imakaev M, Ragoczy T et al. 2009. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326:289–93
[Google Scholar]
41.
Darnell RB. 2010. HITS-CLIP: panoramic views of protein-RNA regulation in living cells. WIREs RNA 1:266–86
[Google Scholar]
42.
Belton JM, McCord RP, Gibcus JH, Naumova N, Zhan Y, Dekker J. 2012. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58:268–76
[Google Scholar]
43.
Ziegenhain C, Vieth B, Parekh S, Reinius B, Guillaumet-Adkins A et al. 2017. Comparative analysis of single-cell RNA sequencing methods. Mol. Cell 65:631–43.e4
[Google Scholar]
44.
Buenrostro JD, Wu B, Litzenburger UM, Ruff D, Gonzales ML et al. 2015. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523:486–90
[Google Scholar]
45.
Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. 2013. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10:1213–18
[Google Scholar]
46.
Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T et al. 2003. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. PNAS 100:15776–81
[Google Scholar]
47.
Barski A, Cuddapah S, Cui K, Roh TY, Schones DE et al. 2007. High-resolution profiling of histone methylations in the human genome. Cell 129:823–37
[Google Scholar]
48.
Feng J, Liu T, Qin B, Zhang Y, Liu XS. 2012. Identifying ChIP-seq enrichment using MACS. Nat. Protoc. 7:1728–40
[Google Scholar]
49.
Kelley DR, Reshef YA, Bileschi M, Belanger D, McLean CY, Snoek J. 2018. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28:739–50
[Google Scholar]
50.
Ernst J, Kellis M. 2017. Chromatin-state discovery and genome annotation with ChromHMM. Nat. Protoc. 12:2478–92
[Google Scholar]
51.
Schreiber J, Durham T, Bilmes J, Noble WS. 2020. Avocado: A multi-scale deep tensor factorization method learns a latent representation of the human epigenome. Genome Biol. 22:81
[Google Scholar]
52.
Melnikov A, Murugan A, Zhang X, Tesileanu T, Wang L et al. 2012. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat. Biotechnol. 30:271–77
[Google Scholar]
53.
Patwardhan RP, Hiatt JB, Witten DM, Kim MJ, Smith RP et al. 2012. Massively parallel functional dissection of mammalian enhancers in vivo. Nat. Biotechnol. 30:265–70
[Google Scholar]
54.
Arnold CD, Gerlach D, Stelzer C, Boryń ŁM, Rath M, Stark A. 2013. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339:1074–77
[Google Scholar]
55.
Ernst J, Melnikov A, Zhang X, Wang L, Rogov P et al. 2016. Genome-scale high-resolution mapping of activating and repressive nucleotides in regulatory regions. Nat. Biotechnol. 34:1180–90
[Google Scholar]
56.
Monti R, Barozzi I, Osterwalder M, Lee E, Kato M et al. 2017. Limb-Enhancer Genie: an accessible resource of accurate enhancer predictions in the developing limb. PLOS Comput. Biol. 13:e1005720
[Google Scholar]
57.
Movva R, Greenside P, Marinov GK, Nair S, Shrikumar A, Kundaje A. 2019. Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PLOS ONE 14:1–20
[Google Scholar]
58.
1000 Genomes Proj. Consort 2015. A global reference for human genetic variation. Nature 526:68–74
[Google Scholar]
59.
Karczewski KJ, Martin AR. 2020. Analytic and translational genetics. Annu. Rev. Biomed. Data Sci. 3:217–41
[Google Scholar]
60.
Uffelmann E, Huang QQ, Munung NS, de Vries J, Okada Y et al. 2021. Genome-wide association studies. Nat. Rev. Methods Primers 1:59
[Google Scholar]
61.
Ho SS, Urban AE, Mills RE. 2020. Structural variation in the sequencing era. Nat. Rev. Genet. 21:171–89
[Google Scholar]
62.
Gaziano JM, Concato J, Brophy M, Fiore L, Pyarajan S et al. 2016. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70:214–23
[Google Scholar]
63.
100,000 Genomes Proj. Pilot Investig 2021. 100,000 Genomes Pilot on rare-disease diagnosis in health care—preliminary report. N. Engl. J. Med. 385:1868–80
[Google Scholar]
64.
Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J et al. 2020. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581:434–43
[Google Scholar]
65.
Sudlow C, Gallacher J, Allen N, Beral V, Burton P et al. 2015. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Med. 12:e1001779
[Google Scholar]
66.
Das S, Forer L, Schönherr S, Sidore C, Locke AE et al. 2016. Next-generation genotype imputation service and methods. Nat. Genet. 48:1284–87
[Google Scholar]
67.
Flynn E, Lappalainen T. 2022. Functional characterization of genetic variant effects on expression. Annu. Rev. Biomed. Data Sci. 5:119–39
[Google Scholar]
68.
Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E et al. 2013. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45:580–85
[Google Scholar]
69.
Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D. 2011. Fast linear mixed models for genome-wide association studies. Nat. Methods 8:833–35
[Google Scholar]
70.
Loh PR, Kichaev G, Gazal S, Schoech AP, Price AL. 2018. Mixed-model association for biobank-scale datasets. Nat. Genet. 50:906–8
[Google Scholar]
71.
Zhou W, Nielsen JB, Fritsche LG, Dey R, Gabrielsen ME et al. 2018. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50:1335–41
[Google Scholar]
72.
Pruim RJ, Welch RP, Sanna S, Teslovich TM, Chines PS et al. 2010. Locuszoom: regional visualization of genome-wide association scan results. Bioinformatics 26:2336–37
[Google Scholar]
73.
Schaid DJ, Chen W, Larson NB. 2018. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat. Rev. Genet. 19:491–504
[Google Scholar]
74.
Cano-Gamez E, Trynka G. 2020. From GWAS to function: using functional genomics to identify the mechanisms underlying complex diseases. Front. Genet. 11:424
[Google Scholar]
75.
Buniello A, MacArthur JAL, Cerezo M, Harris LW, Hayhurst J et al. 2019. The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47:D1D1005–12
[Google Scholar]
76.
Torkamani A, Wineinger NE, Topol EJ. 2018. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 19:581–90
[Google Scholar]
77.
Lambert SA, Abraham G, Inouye M. 2019. Towards clinical utility of polygenic risk scores. Hum. Mol. Genet. 28:R133–42
[Google Scholar]
78.
Berg OG, von Hippel PH. 1988. Selection of DNA binding sites by regulatory proteins. Trends Biochem. Sci. 13:207–11
[Google Scholar]
79.
Stormo GD. 2000. DNA binding sites: representation and discovery. Bioinformatics 16:16–23
[Google Scholar]
80.
Ben-Gal I, Shani A, Gohr A, Grau J, Arviv S et al. 2005. Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics 21:2657–66
[Google Scholar]
81.
Mathelier A, Wasserman WW. 2013. The next generation of transcription factor binding site prediction. PLOS Comput. Biol. 9:e1003214
[Google Scholar]
82.
Weirauch MT, Cote A, Norel R, Annala M, Zhao Y et al. 2013. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31:126–34
[Google Scholar]
83.
Fornes O, Castro-Mondragon JA, Khan A, Van der Lee R, Zhang X et al. 2020. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 48:D87–92
[Google Scholar]
84.
Krizhevsky A, Sutskever I, Hinton GE. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60:84–90
[Google Scholar]
85.
Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33:831–38
[Google Scholar]
86.
Zhou J, Troyanskaya OG. 2015. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 12:931–34
[Google Scholar]
87.
Kelley DR, Snoek J, Rinn JL. 2016. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26:990–99
[Google Scholar]
88.
Maslova A, Ramirez RN, Ma K, Schmutz H, Wang C et al. 2020. Deep learning of immune cell differentiation. PNAS 117:25655–66
[Google Scholar]
89.
Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. 2018. A primer on deep learning in genomics. Nat. Genet. 51:12–18
[Google Scholar]
90.
Novakovsky G, Dexter N, Libbrecht MW, Wasserman WW, Mostafavi S. 2022. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. 24:125–37
[Google Scholar]
91.
Ghanbari M, Ohler U. 2020. Deep neural networks for interpreting RNA binding protein target preferences. Genome Res. 30:214–26
[Google Scholar]
92.
Chen KM, Cofer EM, Zhou J, Troyanskaya OG. 2019. Selene: a PyTorch-based deep learning library for sequence data. Nat. Methods 16:315–18
[Google Scholar]
93.
Kopp W, Monti R, Tamburrini A, Ohler U, Akalin A. 2020. Deep learning for genomics using Janggu. Nat. Commun. 11:3448
[Google Scholar]
94.
Quang D, Xie X. 2016. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44:e107
[Google Scholar]
95.
Zrimec J, Börlin CS, Buric F, Muhammad AS, Chen R et al. 2020. Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure. Nat. Commun. 11:6141
[Google Scholar]
96.
Avsec Ž, Weilert M, Shrikumar A, Krueger S, Alexandari A et al. 2021. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53:354–66
[Google Scholar]
97.
Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D et al. 2019. Predicting splicing from primary sequence with deep learning. Cell 176:3535–48.e24
[Google Scholar]
98.
Fudenberg G, Kelley DR, Pollard KS. 2020. Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods 17:1111–17
[Google Scholar]
99.
Schwessinger R, Gosden M, Downes D, Brown RC, Oudelaar AM et al. 2020. DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat. Methods 17:1118–24
[Google Scholar]
100.
Kelley DR. 2020. Cross-species regulatory sequence activity prediction. PLOS Comput. Biol. 16:e1008050
[Google Scholar]
101.
Tareen A, Kinney JB. 2019. Biophysical models of cis-regulation as interpretable neural networks. arXiv:2001.03560 [q-bio.MN]
102.
Luo W, Li Y, Urtasun R, Zemel R. 2016. Understanding the effective receptive field in deep convolutional neural networks. Adv. Neural Inform. Process. Syst. 29:4905–13
[Google Scholar]
103.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L et al. 2017. Attention is all you need. Adv. Neural Inform. Process. Syst. 30:5999–6009
[Google Scholar]
104.
Dey K, Van de Geijn B, Kim SS, Hormozdiari F, Kelley D, Price A. 2019. Evaluating the informativeness of deep learning annotations for human complex diseases. Nature 11:4703
[Google Scholar]
105.
Shrikumar A, Greenside P, Kundaje A. 2017. Learning important features through propagating activation differences. Proceedings of the 34th International Conference on Machine Learning (ICML'17), Vol. 703145–53. New York: Assoc. Comput. Mach.
[Google Scholar]
106.
Linder J, La Fleur A, Chen Z, Ljubetič A, Baker D et al. 2022. Interpreting neural networks for biological sequences by learning stochastic masks. Nat. Mach. Intell. 4:41–54
[Google Scholar]
107.
Lundberg SM, Lee SI. 2017. A unified approach to interpreting model predictions. Adv. Neural Inform. Process. Syst. 30:4765–4774
[Google Scholar]
108.
Sundararajan M, Taly A, Yan Q. 2017. Axiomatic attribution for deep networks. Proceedings of the 34th International Conference on Machine Learning (ICML'17), Vol. 703319–28. New York: Assoc. Comput. Mach.
[Google Scholar]
109.
Shrikumar A, Tian K, Avsec Ž, Shcherbina A, Banerjee A et al. 2018. Technical note on Transcription Factor Motif Discovery from Importance Scores (TF-MoDISco) version 0.5.6.5. arXiv:1811.00416 [cs.LG]
110.
Bahdanau D, Cho K, Bengio Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 [cs.CL]
111.
Ullah F, Ben-Hur A. 2021. A self-attention model for inferring cooperativity between regulatory features. Nucleic Acids Res. 49:13e77
[Google Scholar]
112.
Deleted in proof
113.
Guo Y, Xu Q, Canzio D, Shou J, Li J et al. 2015. CRISPR inversion of CTCF sites alters genome topology and enhancer/promoter function. Cell 162:900–10
[Google Scholar]
114.
Avsec Ž, Weilert M, Shrikumar A, Krueger S, Alexandari A et al. 2021. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53:354–66
[Google Scholar]
115.
Schreiber J, Singh R, Bilmes J, Noble WS. 2020. A pitfall for machine learning methods aiming to predict across cell types. Genome Biol. 21:282
[Google Scholar]
116.
Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y et al. 2015. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47:1228–35
[Google Scholar]
117.
MacArthur D, Manolio T, Dimmock D, Rehm H, Shendure J et al. 2014. Guidelines for investigating causality of sequence variants in human disease. Nature 508:469–76
[Google Scholar]
118.
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR et al. 2016. The Ensembl Variant Effect Predictor. Genome Biol. 17:122
[Google Scholar]
119.
Ng PC, Henikoff S. 2003. Sift: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31:3812–14
[Google Scholar]
120.
Adzhubei I, Jordan DM, Sunyaev SR. 2013. Predicting functional effect of human missense mutations using PolyPhen-2. Curr. Protoc. Hum. Genomics 76:7.20.1–7.20.41
[Google Scholar]
121.
Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M et al. 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15:1034–50
[Google Scholar]
122.
Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M. 2019. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47:D1886–94
[Google Scholar]
123.
Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS et al. 2014. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42:D1980–85
[Google Scholar]
124.
Stenson PD, Mort M, Ball EV, Shaw K, Phillips AD, Cooper DN. 2014. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum. Genet. 133:1–9
[Google Scholar]
125.
Mahmood K, Jung CH, Philip G, Georgeson P, Chung J et al. 2017. Variant effect prediction tools assessed using independent, functional assay-based datasets: implications for discovery and diagnostics. Hum. Genom. 11:10
[Google Scholar]
126.
Liu L, Sanderford MD, Patel R, Chandrashekar P, Gibson G, Kumar S. 2019. Biological relevance of computationally predicted pathogenicity of noncoding variants. Nat. Commun. 10:330
[Google Scholar]
127.
Lee S, Abecasis GR, Boehnke M, Lin X. 2014. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95:5–23
[Google Scholar]
128.
Wang Q, Dhindsa RS, Carss K, Harper AR, Nag A et al. 2021. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature 597:527–32
[Google Scholar]
129.
Monti R, Rautenstrauch P, Ghanbari M, James AR, Kirchler M et al. 2022. Identifying interpretable gene-biomarker associations with functionally informed kernel-based tests in 190,000 exomes. Nat. Commun. 13:5332
[Google Scholar]
130.
Liu Y, Xie J. 2020. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. J. Am. Stat. Assoc. 115:393–402
[Google Scholar]
131.
Lee S, Abecasis GR, Boehnke M, Lin X. 2014. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95:5–23
[Google Scholar]
132.
Lee S, Wu MC, Lin X. 2012. Optimal tests for rare variant effects in sequencing association studies. Biostatistics 13:762–75
[Google Scholar]
133.
Chen H, Huffman JE, Brody JA, Wang C, Lee S et al. 2019. Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole-genome sequencing studies. Am. J. Hum. Genet. 104:260–74
[Google Scholar]
134.
Sey NY, Hu B, Mah W, Fauni H, McAfee JC et al. 2020. A computational tool (H-MAGMA) for improved prediction of brain-disorder risk genes by incorporating brain chromatin interaction profiles. Nat. Neurosci. 23:583–93
[Google Scholar]
135.
Ma S, Dalgleish J, Lee J, Wang C, Liu L et al. 2021. Powerful gene-based testing by integrating long-range chromatin interactions and knockoff genotypes. PNAS 118:47e2105191118
[Google Scholar]
136.
Jin B, Capra JA, Benchek P, Wheeler N, Naj AC et al. 2022. An association test of the spatial distribution of rare missense variants within protein structures identifies Alzheimer's disease–related patterns. Genome Res. 32:778–90
[Google Scholar]
137.
de Leeuw CA, Mooij JM, Heskes T, Posthuma D. 2015. MAGMA: generalized gene-set analysis of GWAS data. PLOS Comput. Biol. 11:4e1004219
[Google Scholar]
138.
Li X, Li Z, Zhou H, Gaynor SM, Liu Y et al. 2020. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat. Genet. 52:969–83
[Google Scholar]
139.
McMahon A, Lewis E, Buniello A, Cerezo M, Hall P et al. 2021. Sequencing-based genome-wide association studies reporting standards. Cell Genom. 1:100005
[Google Scholar]
140.
Lewis AC, Green RC. 2021. Polygenic risk scores in the clinic: new perspectives needed on familiar ethical issues. Genome Med. 13:14
[Google Scholar]
141.
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD et al. 2020. Language models are few-shot learners. Adv. Neural Inform. Process. Syst. 33:1877–901
[Google Scholar]

/content/journals/10.1146/annurev-biodatasci-122120-110102

Toward Identification of Functional Sequences and Variants in Noncoding DNA

Annual Review of Biomedical Data Science 6, 191 (2023); https://doi.org/10.1146/annurev-biodatasci-122120-110102

/content/journals/10.1146/annurev-biodatasci-122120-110102

Data & Media loading...

Article Type: Review Article

Most Cited Most Cited RSS feed

- Ethical Machine Learning in Healthcare
  
  Irene Y. Chen, Emma Pierson, Sherri Rose, Shalmali Joshi, Kadija Ferryman, and Marzyeh Ghassemi
  
  Vol. 4 (2021), pp. 123–144
- Spatial Metabolomics and Imaging Mass Spectrometry in the Age of Artificial Intelligence
  
  Theodore Alexandrov
  
  Vol. 3 (2020), pp. 61–87
- Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models
  
  Juan M. Banda, Martin Seneviratne, Tina Hernandez-Boussard, and Nigam H. Shah
  
  Vol. 1 (2018), pp. 53–68
- Computational Methods for Understanding Mass Spectrometry–Based Shotgun Proteomics Data
  
  Pavel Sinitcyn, Jan Daniel Rudolph, and Jürgen Cox
  
  Vol. 1 (2018), pp. 207–234
- Using Phecodes for Research with the Electronic Health Record: From PheWAS to PheRS
  
  Lisa Bastarache
  
  Vol. 4 (2021), pp. 1–19
- Deep Learning in Biomedical Data Science
  
  Pierre Baldi
  
  Vol. 1 (2018), pp. 181–205
- RNA Sequencing Data: Hitchhiker's Guide to Expression Analysis
  
  Koen Van den Berge, Katharina M. Hembach, Charlotte Soneson, Simone Tiberi, Lieven Clement, Michael I. Love, Rob Patro, and Mark D. Robinson
  
  Vol. 2 (2019), pp. 139–173
- Challenges and Opportunities for Developing More Generalizable Polygenic Risk Scores
  
  Ying Wang, Kristin Tsuo, Masahiro Kanai, Benjamin M. Neale, and Alicia R. Martin
  
  Vol. 5 (2022), pp. 293–320
- Visualization of Biomedical Data
  
  Seán I. O'Donoghue, Benedetta Frida Baldi, Susan J. Clark, Aaron E. Darling, James M. Hogan, Sandeep Kaur, Lena Maier-Hein, Davis J. McCarthy, William J. Moore, Esther Stenau, Jason R. Swedlow, Jenny Vuong, and James B. Procter
  
  Vol. 1 (2018), pp. 275–304
- From Tissues to Cell Types and Back: Single-Cell Gene Expression Analysis of Tissue Architecture
  
  Xi Chen, Sarah A. Teichmann, and Kerstin B. Meyer
  
  Vol. 1 (2018), pp. 29–51
More Less

Annual Review of Biomedical Data Science

Volume 6, 2023

Review Article

Open Access

Toward Identification of Functional Sequences and Variants in Noncoding DNA

Abstract

Most Read This Month

Most Cited Most Cited RSS feed