Toward Identification of Functional Sequences and Variants in Noncoding DNA

Understanding the noncoding part of the genome, which encodes gene regulation, is necessary to identify genetic mechanisms of disease and translate findings from genome-wide association studies into actionable results for treatments and personalized care. Here we provide an overview of the computational analysis of noncoding regions, starting from gene-regulatory mechanisms and their representation in data. Deep learning methods, when applied to these data, highlight important regulatory sequence elements and predict the functional effects of genetic variants. These and other algorithms are used to predict damaging sequence variants. Finally, we introduce rare-variant association tests that incorporate functional annotations and predictions in order to increase interpretability and statistical power.


INTRODUCTION
Less than 2% of the human genome is coding-i.e., translated into mRNA and transcribed into protein or short peptides.The remaining 98% is non-(protein-)coding and contains many regions that fulfill structural and regulatory functions.Within two decades after the release of the human genome sequence, next-generation sequencing (NGS) has fundamentally changed the way we study both the coding and noncoding parts of genomes.Experiments that capture and determine the sequence of functional nucleotide elements have generated rich annotations of both the human genome and those of model organisms (1).
While evolutionary sequence conservation across species and the categorization of regions based on experimental readouts emerged early as important directions of genome research (2), few could have foreseen the transformative role of machine learning, and specifically deep learning, in the field.Driven by advances in computer vision and natural language processing (3), deep learning models have become ubiquitous in the study of the molecular functions encoded in the genome (4,5).
NGS is also driving progress in the study of human genetic variation.The falling costs of whole-genome sequencing (WGS) have made it possible to collect these data for hundreds of thousands of individuals (6,7).Recently, the UK Biobank has provided one of the largest and most widely accessible WGS datasets to date, genotyping about 150,000 individuals (6).These data contain hundreds of millions of predominantly rare genetic variants, many of which have never been observed before.
Understanding the role of these rare variants in health and disease relies on our ability to distinguish neutral variants from those that disrupt the function of the genomic sequences that contain them.While variant effect prediction (VEP) for coding regions is relatively well developed (8) and recent algorithms are good at identifying damaging nonsynonymous variants (9), the prediction and evaluation of variant effects for noncoding regions are lagging behind.However, understanding noncoding variation is critical for variant prioritization in sequencing-based genome-wide association studies (seqGWAS) (10,11) or clinical variant interpretation (12).
Here we present concepts and recent advances in the computational analysis of noncoding regions and short genetic variants therein.Most examples provided concern the regulation of transcription, although the concepts generalize to other mechanisms of gene regulation as well.Deep learning is covered as a method to extract important sequence features and predict the functional effects of genetic variants, but we defer to other articles for detailed descriptions of architectures and algorithms (4,5).We cover ways to evaluate functional effect predictions and use them to predict damaging genetic variants.Finally, we describe the incorporation of functional annotations into rare-variant association tests in seqGWAS.

General Properties and Classes of Regulatory Elements
While the noncoding genome contains many classes of sequences, such as repetitive, structural regions and transposable elements, the analysis of noncoding regions is largely concerned with generegulatory regions.This entails the localization of regulatory DNA or RNA sequences, identifying their function, and understanding how their function is determined by the cell machinery.Established (albeit simplistic) models place short regions of DNA/RNA into one or more functional classes, where single class instances are typically hundreds to a few thousands of nucleotides long.
Promoters, enhancers, and insulators [most prominently CCCTC-binding factor (CTCF) binding sites] are classes that regulate transcription (13,14).3 and 5 untranslated regions (UTRs) and splice-regulatory regions influence posttranscriptional processes, such as RNA processing, localization, stability, and translation, which are regulated by RNA-binding proteins and RNAs (e.g., microRNAs) (15).All of these classes constitute cis-regulatory elements (CREs), where "cis" indicates interaction with nearby genes (i.e., discrete, functional transcribed regions) encoded on the same chromosome.The functions of CREs are established by the sequence-specific binding of regulators, which recruit or direct catalytic macromolecular complexes such as RNA polymerase II (Pol-II) for transcription or the RISC (RNA-induced silencing complex) for RNA degradation.

Enhancers, Promoters, and the Cis-Regulatory Code
A promoter is the DNA sequence surrounding and immediately upstream of transcription start sites (TSSs).The core promoter is the region directly at the TSS that serves as a docking platform for Pol-II recruitment; the proximal promoter is the region directly upstream of the TSS.Promoters interact with enhancers in order to activate transcription (13) (Figure 1).The specificity of enhancers and promoters depends on the binding of selectively expressed transcription factors (TFs).A single TF can only bind compatible short sequence elements [∼8-20 base pairs (bp)] and might interact with other TFs, for example, through cooperative or competitive binding.Typical transcriptional regulatory regions contain several binding sites for the same or a small subset of TFs (16).Through these mechanisms, the underlying DNA sequence determines the conditions under which a regulatory region becomes active.
The genome-wide patterns of binding sites constitute the so-called cis-regulatory code.The logic behind this code imposes evolutionary constraints, leading to conservation of regulatory elements and TF binding preferences across millions of years of evolution (17).Noncoding patterns of sequence conservation differ from protein-coding conservation: Constraints may be placed not on the precise arrangement of binding sites, but rather on their overall presence and encoded binding affinity within a certain neighborhood.This often leads to binding site turnover, meaning that the function of nucleotides that trace back to the same ancestral nucleotide may differ, resulting in low colinear sequence similarity and the frequent emergence of new regulatory regions (18).

Loops and Topologically Associating Domains
On the linear genome sequence, enhancers can be found a few thousand (proximal) to hundreds of thousands (distal) base pairs away from the promoters with which they interact.Inside the nucleus, active enhancers are brought into physical proximity of promoters, and the frequency and duration of these enhancer-promoter contacts are correlated with the strength of transcription (activity by contact) (19,20).When enhancer-promoter contacts are established, the nuclear DNA forms loops, with loop formation in mammals regulated by CTCF (21).A gene may be regulated solely by its promoter (e.g., in the case of some housekeeping genes), or by dozens of enhancers that sometimes cluster together as so-called superenhancers [e.g., in the case of developmental master regulators (22)].In a specific condition, only a subset of enhancers will contact the promoter, and each enhancer encodes only part of the global expression pattern of a gene (e.g., its activity in one specific cell type).Large regions of the genome in which active sequences frequently contact each other are called topologically associating domains (23).These domains are separated by domain boundaries, across which contact frequencies are low, but which are dynamically shaped over the course of development (24,25).

The Histone Code and DNA Methylation
TFs compete with histones for DNA binding.Histones are multimeric proteins involved in the packing of DNA as part of DNA-protein complexes called nucleosomes (26).The complex of DNA together with nucleosomes is referred to as chromatin.DNA that is occupied by nucleosomes is not directly accessible to other DNA-binding proteins, which limits its regulatory activity.The functions of histones are tuned by posttranslational modifications.Histones flanking promoters and enhancers are often acetylated at lysine 27 (H3K27ac) (27), histones flanking promoters are methylated at lysine 4 (H3K4me3) (28,29), and H3K4me1 is often found at enhancers (30) (Figure 1).Other histone modifications indicate active repression (H3K27me3) or tight packing in inactive regions called heterochromatin (H3K9me3) (31,32).
The DNA itself can also be chemically modified.The methylation of GpG-dinucleotides has been linked to repression, while demethylation has been linked to activation (33).As methylated residues are prone to mutational changes, promoters at active regions contain a disproportionally large number of these dinucleotides (34).Both the chemical modification of histones and the DNA affect the recruitment of factors that activate or repress transcription.In recent years, a plethora of chemical modifications of RNA have been identified as well, reportedly influencing diverse processes ranging from processing to stability and translation (35).

Functional Genomics Assays and Data
NGS-based assays provide noisy readouts of the gene-regulatory processes and interactions mentioned above, typically genome-or transcriptome-wide.These assays biochemically enrich for functional nucleotide fragments, which are then sequenced (36)(37)(38)(39)(40)(41)(42).The raw sequencing data constitute millions of so-called reads, i.e., sequences of typically a few hundred nucleotides (composed of letters A, C, G, and T) corresponding to the enriched fragments.The reads are mapped back to the reference genome in order to determine their source of origin.Specialized experimental techniques further allow assigning reads to single cells (43,44).For simplicity, this review focuses on bulk experiments that lack this level of resolution.
After mapping to the reference genome, the number of reads originating from all positions is quantified.These per-position counts, also called coverage, are visualized as histogram-like tracks along the genome (Figure 1c).Statistical methods are used to determine regions with significantly elevated read counts (peaks) (48), which can indicate, for example, accessible regions (in the case of DNase-seq or ATAC-seq) or regions bound by specific proteins (ChIP-seq).Other postprocessing steps might include converting discrete counts into relative enrichment over background, smoothing, and corrections for sequence biases (49).
Specialized methods such as chromatin conformation capture (3C), 4C (circular 3C), and Hi-C (high-throughput 3C) measure contact frequencies between DNA sequences in the nucleus (40,42).The readouts from these experiments are 2D and symmetric, with both axes corresponding to locations on the reference.These contact-count matrices are visualized as heatmaps.Again, postprocessing steps are typically applied for visualization and statistical testing (e.g., down-weighting expected short-range contacts).
Large consortia have made efforts to systematically annotate the genome of model organisms and humans by collecting functional genomics data across many tissues and time-points in development.By combining the signals from many experiments, these efforts have identified millions of candidate CREs (1).Numerous statistical methods have been developed to integrate experiments and define genome-wide chromatin states (i.e., locations with stereotypical combinations of histone modifications, accessibility, or methylation) corresponding to the classes of regulatory elements introduced above (e.g., enhancer-like states) (50,51).
Reporter assays are used to validate functions of regulatory regions in vivo (in native context within a living organisms) or in vitro (in isolation from regular biological context, e.g., in a test tube).Typically, this requires cloning or barcoding fragments of interest next to a reporter gene whose activity can then be measured (e.g., by sequencing or fluorescence imaging) and compared across the fragments assayed.For example, reporter assays have been used to identify single developmental enhancers in mouse embryos (17).Massively parallel reporter assays measure the activities of many elements at once (52,53).STARR-seq (self-transcribing active regulatory region sequencing) is an in vivo assay that inserts enhancer fragments in gene bodies and hence uses the RNA-seq readout directly as evidence for functionality (54).Reporter assays that include mutagenesis measure the effects of genetic perturbations on regulatory elements (55).However, typically only a limited set of regulatory elements and cell types can be investigated by a single experiment.Readouts from reporter assays have been used to train models for tissue-specific enhancer prediction (56) or to train and validate sequence models (57), as introduced in the next sections.

HUMAN GENETIC VARIATION IN HEALTH AND DISEASE
Genetic variants naturally occur throughout the genome.They are defined by their location relative to the reference genome and the reference and alternative sequence (i.e., the reference and alternative alleles).The more common sequence is defined as the major allele, and the less common sequence as the minor allele.The average human genome contains over four million genetic variants, primarily consisting of single-nucleotide polymorphisms (i.e., variants that change a single base) and short insertions or deletions.The vast majority of these variants (>99%) lie in noncoding regions, including roughly 300,000 variants in possible enhancers (58).
On the population level, the majority (>90%) of variants are rare, with minor allele frequencies below 0.1%, and over 40% are only observed in single individuals at current sample sizes (singletons) (6, 7).However, on the level of the individual, common variants outnumber rare variants by a large margin (>95% have minor allele frequencies above 0.5%) (58).Most are inconsequential, but some variants alter biomolecular function (Figure 2a).While these molecular changes often go unobserved, they manifest themselves as measurable differences in traits at the population level (Figure 2b).Here we briefly introduce population genetics, reviewed by Karczewski & Martin (59), and genome-wide association studies (60) through the lens of the functional analysis of noncoding regions.We defer to other reviews for a discussion of large structural variants (61).

Databases of Human Genetic Variation
Increasing access to genetic sequencing has inspired efforts to catalog the observed genetic variation in humans and its distribution across populations.In 2010, the 1000 Genomes Project set out to generate the first freely available reference of human genetic variation (58).While initial research focused on whole-exome sequencing (i.e., strongly enriching for coding regions), access to WGS data is increasing, driven by biobanks and other large-scale collaborative efforts (6,7,62,63).The online database gnomAD provides statistics on observed variants across populations (64).
Noncoding variants are placed into potential gene-regulatory contexts through overlap with functional annotations [e.g., UTRs or regions of accessible chromatin (Figure 1c)] or by functional variant effect prediction, introduced below.

Genome-Wide Association Studies
Genome-wide association studies (GWAS) quantify the correlation of genetic sequence variants with heritable traits and disease (i.e., they measure the association of genotype and phenotype).
Biobanks that systematically collect genotypes and other data of hundreds of thousands of individuals have become major resources for GWAS (62,65).Due to cost and ease of processing, genotypes have mainly been measured using microarrays, which allow for predefined sets of typically hundreds of thousands of common variants.These are used to statistically impute millions of variants using variant reference panels (66).Imputation is possible because neighboring variants are inherited together, and therefore are highly correlated, a phenomenon termed as linkage disequilibrium (LD).
Phenotypes are derived from physical measurements, blood tests, imaging, questionnaires, or health records.Quantitative trait locus (QTL) studies are GWAS that use readouts from molecular assays as their phenotypes.For example, expression QTL (eQTL) studies, reviewed by Flynn & Lappalainen (67), correlate gene expression with genetic variants.However, even large eQTL studies like the GTEx (Genotype-Tissue Expression) project (68) have been limited in their sample sizes (<1,000 individuals).
The most common approach in GWAS independently tests each variant that reaches certain inclusion criteria (e.g., frequency and imputation quality) for its association with the phenotype.Specialized software allows these association tests to be rapidly performed, while correcting for covariates like age or sex and confounding by population stratification (e.g., ancestry) (69)(70)(71).GWAS summary statistics list the strength, direction, and p-values of associations between variants and the phenotype and can be visualized along the genome, similar to functional genomics tracks (72).However, LD prevents resolving association signals to the scale of regulatory elements (Figure 2b).Statistical fine-mapping, reviewed by Schaid et al. and Cano-Gamez & Trynka (73,74), is necessary to narrow down groups of correlated variants to one or a few candidate causal variants responsible for the associations at a genetic region.Many of these fine-mapping methods use functional genomics data because causal variants are enriched in regulatory regions like enhancers or promoters.
Rare-variant association studies that use sequencing-based genotyping suffer less from issues related to LD (e.g., signal localization).However, they face challenges with statistical power due to the large number of variants with very low frequencies and singletons.These issues are alleviated by variant aggregation and prioritization using variant effect predictions (see below).
The growing number of GWAS has inspired efforts to collect summary statistics and make them accessible through portals such as the NHGRI-EBI (National Human Genome Research Institute-European Bioinformatics Institute) GWAS Catalog (75).GWAS have revealed that common traits and diseases depend on many variants with small effects scattered throughout the genome (polygenicity) (59), which complicates downstream applications like the identification of relevant genes, cell types, or pathways through integration with functional genomics data (74).Besides providing biological insights, GWAS results are used for the construction of polygenic risk scores, as reviewed by Torkamani et al. and Lambert et al. (76,77).

LEARNING THE REGULATORY CODE
Before quantifying the impact of regulatory variants, one needs to understand where the functional sequence elements are and what function they impart.Going back to the days of the Human Genome Project, models for gene regulation have been built at levels that range from interactions of binding sites with the DNA to the function encoded in a single enhancer, to the expression of the affected gene, and ultimately to the organismal phenotype.

Biophysical Models of Binding and Position Weight Matrices
Since the early days of computational biology, researchers have been interested in models that adequately describe the target sequences of proteins binding to nucleic acids.Few features are identical across functional instances (restriction enzymes are an exception), and therefore flexible representations of short sequences of the same functionality (sequence elements), so-called motifs, have been needed.Looking at a typical TF binding site bound by the same TF, some positions will be flexible as to their nucleotide preferences, and others will be strict.Early attempts to capture this took the form of consensus sequences (regular expressions that include characters in addition to A, C, G, and T, such as W) to describe the presence of a nucleotide involved in a weak hydrogen bond (A or T).Such representations have the advantage of allowing for very fast string matching algorithms, but they are obviously limited and do not allow for more nuanced representations (e.g., that the presence of A or T may not be equally probable).
The most widespread representations of DNA/RNA sequence motifs have long been as matrices of size 4 × N. Given a set of aligned short sequences of length N, in our case target sequences of a specific TF, a position frequency matrix (PFM) first tabulates how many instances of each nucleotide are observed at each position and divides the entries by N. Each of the N columns then represents a multinomial distribution over the alphabet (here, of the 4 nucleotides), meaning that all N positions are considered independently.To avoid zero probabilities, one typically modifies the initial counts by including prior information (e.g., in the form of pseudocounts).
Given a longer sequence, the probability of each subsequence of length N being a target for the TF can then be computed by looking up the probabilities of each nucleotide of the subsequence at the corresponding position in the matrix and multiplying them.This operation of moving a short pattern (motif ) along a larger signal (a DNA sequence) and recording the resulting scores is called a convolution.The matrix representation gained traction not least because of its clear connection to statistical mechanics and binding affinity (78).In practice, the PFM is converted into a position weight matrix (PWM) (or position-specific scoring matrix), which normalizes the observed frequencies by a background of, for example, the nucleotide composition of the genome (79).
Since their introduction, several extensions of PWMs have been proposed, particularly to address issues of fixed length (targets of some TFs contain a spacer region of variable length) and of the independence between positions of a binding site.More intricate probabilistic models such as hidden Markov models or Bayesian networks have been developed and have shown clear improvements for at least some TFs (80).For some time, small amounts of available data limited the practical applicability of models with a larger number of parameters, a situation that changed with the availability of ChIP-seq experiments and high-throughput assays for determining in vitro binding affinities with typically hundreds to thousands of putative TF binding sites (81,82).Resources such as JASPAR provide precomputed models for large numbers of TFs for several eukaryotic species (83).

Deep Learning-Based Sequence Models
Deep neural networks are scalable, flexible, and automatically learn predictive tasks without the need for feature engineering (3) (i.e., the manual design or selection of informative input variables).These properties, and state-of-the-art performance, have led to their widespread adaptation in the field of genomics and elsewhere.The word "deep" in deep learning stems from the many layers present in these models, which consist of interconnected computational units called neurons.By stacking layers whose neurons perform linear operations followed by nonlinear activation functions, these models learn highly nonlinear functions of the input data (3).
Soon after deep neural networks began outperforming other methods on image classification tasks (84), they started being applied to predict gene-regulatory functions from sequence (85)(86)(87).The majority of sequence models are based on convolutional neural networks (CNNs).Sequence models of noncoding regulatory processes.Deep neural networks predict genomics data from the underlying nucleotide sequences at varying scales and resolutions.Long-range processes are modeled with larger input sequences.Because some models contain pooling operations, the output length can be shorter than the input length.Output data vary in their level of resolution, measured in base pairs per position.Splicing and highly resolved data (e.g., ChIP-nexus) can be modelled at base pair resolution.The model in the bottom row predicts coverage of the center 128-bp sequence from the 1,000-bp surrounding sequence.Abbreviations: bp, base pairs; ChIP-nexus, chromatin immunoprecipitation with nucleotide resolution through exonuclease, unique barcode, and single ligation.
Convolutional layers (i.e., the building blocks of CNNs) contain many fixed-size filters, which are scanned across sequences (4).In the first layer of a network, filters act as short motif finders (typically, ∼3-25 bp) and can learn sequence patterns capturing known binding preferences of regulatory proteins (87,88).However, the deep model structure allows for the representation of motif variants and combinations, positional dependencies, and multiple pattern occurrences within a larger sequence.Recent reviews have covered model architectures and applications in genomics (4,89), as well as the interpretation of neural networks for genomics using so-called explainable AI methods (90).Here, we focus on core concepts for the analysis of noncoding regions using sequence models and highlight recent advances.
Sequence models predict functional annotations like those presented in Figure 1 from the underlying nucleotide sequence and, optionally, inputs that capture sequence context (e.g., gene annotations) (91) (Figure 3).Sequences are typically represented by one-hot encoding on the nucleotide level (4), and models are trained to predict the presence of either binary features like peaks (classification) (85,86) or continuous readouts such as signal strength (regression) (49).Software packages facilitate transforming genomics file formats into the data types required for deep learning (92,93).
Models that predict features from DNA sequence can learn a variety of tasks at different levels of gene regulation, including binding of individual TFs (85,86) or RNA-binding proteins (85,91), accessibility of DNA (87), histone modifications (86,94), transcription (95,96), splicing (97), and 3D genome folding (98, 99) (Figure 4).Models are often trained to predict many outputs for the same input sequence (called multitarget or multilabel)-for example, readouts from many functional genomics experiments across time points and tissues.Because the regulatory code is largely conserved, models can be trained on data from multiple species at once (100).Specialized

Figure 4
Sequence interpretation with neural networks, and prediction of damaging variants.(a) The variant C → T disrupts binding of a transcription factor.A deep neural network is presented with both the reference sequence (ref. ) and alternative sequence (alt.)containing G (two forward passes).The model predicts lower probabilities of observing DNase accessibility peaks for the alternative sequence.The difference between reference and alternative shows the direction of effects.(b) A gradient-based interpretation method is applied to the neural network using the reference sequence as input (forward and backward pass).The method highlights transcription factor binding sites.(c) The prediction of noncoding damaging variants typically relies on functional genomics data.
architectures confer biophysical interpretations to model parameters and components (101), and prior knowledge (e.g., TF binding preferences) can be used to initialize model parameters, which can increase performance (94).
The predictive tasks are reflected in the key properties of sequence models, including the length of the input sequence in nucleotides, the choice of output coding and resolution, and the receptive field (102) of neurons in the output layers (Figure 3).The receptive field measures the number of nucleotides in the input sequence that can influence the predictions at every position in the output layer; that is, it captures the ability of the model to integrate long-range interactions.CNNs use spatial pooling and fully connected layers before the output layer to grow the receptive field.Fully convolutional CNNs rely on pooling and exponentially dilated convolutions in order to grow the receptive field to tens of thousands of base pairs (49,97).Models for processes involving the folding of DNA or RNA (e.g., transcription or splicing) benefit from larger receptive fields (96,97).
Recently, transformers (103) have been used to increase the theoretical receptive field to hundreds of thousands of base pairs in order to improve the performance of transcription prediction (96).Models with larger receptive fields more accurately predict accessibility, ChIP-seq, and transcription (measured by CAGE), showing that the underlying data reflect the results of both local (∼100 bp) and long-range (>10,000 bp) interactions (49).However, most signals tend to be captured at short scales (49,96,97), and predictive performance may not be the best measure to determine the utility of a model for downstream tasks (104).

In Silico Analysis of Regulatory Elements with Sequence Models
Given a trained sequence model, in silico mutagenesis predicts the functional effects of genetic variants (86, 87) (Figure 4a).Because most models are trained only on the reference sequence, ignoring genetic variants, this task can be seen as a form of zero-shot transfer learning (i.e., repurposing for a different task without adjusting the model's parameters).First, a model predicts functional annotations for the reference sequence and the alternative sequence containing the variant.The difference between reference and alternative prediction is referred to as the allelic functional variant effect prediction.Saturated in silico mutagenesis investigates all possible single-nucleotide substitutions of a sequence and can highlight important nucleotides.However, this approach is computationally expensive, as it relies on many predictions (forward passes through the network), and it might not capture all important parts of the sequence if there is redundancy (105,106).
Methods for sequence instance interpretation rely on the forward propagation of activation differences (e.g., DeepLIFT or DeepSHAP) (105,107) or backward propagation of gradients (108) in order to highlight important nucleotides (Figure 4b).While computationally more efficient than mutagenesis, one drawback of these methods is that they have to compare against a background, the choice of which can considerably affect the results.Occlusion-based methods such as the recently proposed Scrambler (106) learn to predict occlusion masks that either completely destroy or preserve the prediction of a trained model by reshuffling.The occlusion masks (or their inverses) are used to highlight important subsequences.
Once important short subsequences are identified, downstream algorithms identify common patterns in these subsequences across instances in order to distill the global knowledge encoded in the model (91).This approach resembles the search for PWMs, but it is highly nonlinear.For example, TF-MoDISco (transcription factor motif discovery from importance scores) uses a series of clustering and other steps to generate consolidated motifs (109).
Learned attention mechanisms allow neural networks to dynamically weigh (attend to) and combine different parts of a sequence for prediction (110).This allows models to highlight important parts of sequences from a single forward pass.Co-attention to subsequences has been used to identify cooperative binding between TFs (111).Self-attention constructs 2D attention matrices that capture pairwise interactions between even very distant parts of sequences (103), and these 2D attention matrices can be useful for enhancer-promoter contact prediction (96).
Finally, sequence models allow complex in silico experiments to be performed that are difficult to perform in vivo.For example, by reversing all short sequences predicted to attract the TF CTCF, Fudenberg et al. (98) confirmed that their model learned the previously described influence of CTCF-binding motif directionality on loop formation (113).Avsec et al. (114) utilized a basepair-resolution model to perform computational experiments to answer questions about spacing and cooperativity between TF binding sites.

Evaluation of Sequence Models
The evaluation of sequence models has focused on both the predictive performance of data held out for the target task (the task optimized during training) and models' ability to correctly predict the effects of genetic variants (transfer task).Holdout data comprise data on genomic regions not used during model training.Because functional genomics data are noisy, perfect predictions are never expected.The concordance between experimental replicates serves as a baseline for achievable model performance (49).Evaluating a model's performance on the same sequences that were used for training (data leakage) should be avoided, even for holdout experimental replicates, as it inflates performance estimates (115).This has been seen as less of a problem, and, in fact, almost unavoidable, for the transfer task of predicting genetic variant effects.Predicted effects on gene expression can be evaluated directly using data from reporter assays or fine-mapped eQTL data from matched cell types (96).
Functional predictions can be linked to association signals from GWAS or QTL studies through methods such as stratified LD score regression (116) or overlap/enrichment-based approaches, reviewed by Cano-Gamez & Trynka (74).The results of these analyses should always be contrasted against direct use of the training data or other strong baselines (104).Finally, the utility of models can be estimated on transfer tasks such as the prediction of damaging variants (86), as outlined in the next section.

PREDICTION OF DAMAGING VARIANTS
Damaging variants are functional variants that negatively impact biomolecular function (117).The majority of algorithms for the prediction of damaging variants have focused on coding variation.Protein-truncating variants and splice donor/acceptor variants are generally considered damaging (118), while variants that change the amino acid sequence are further contextualized using sequence conservation and molecular modeling (9,119,120).Sequence conservation can be considered either across species (121) or within the human population (64).The absence of genetic variation between species or within populations is seen as evidence for variation intolerance, and variants in variant-depleted regions are considered potentially damaging (64).
In addition to conservation, many algorithms for the prediction of damaging variants in noncoding regions take into account functional sequence annotations (8) (Figure 4c).These annotations include the location relative to transcripts (e.g., promoter, UTR, intron, splice region) or transcript-agnostic annotations (e.g., CTCF binding site, accessible region, chromatin states).Annotations can also include functional variant effect predictions derived from sequence models.For example, the popular variant effect prediction tool CADD (combined annotation-dependent depletion) has been updated to include predictions from SpliceAI, a deep learning model that predicts splicing (97,122).However, most tools for noncoding variants do not yet incorporate allelic functional effect predictions and therefore lack the ability to distinguish variants with opposing predicted functional effects at the same location.
Methods that predict damaging variants are optimized or evaluated using verified variants from databases [e.g., ClinVar (123) or HMGD (Human Gene Mutation Database) (124)] or allele frequency data.The latter approach distinguishes between common variants, which are depleted of damaging variants by natural (purifying) selection (59,117), and rare variants like singletons, which have not been depleted.Generally, constructing appropriate training sets for these methods is challenging, and independent evaluations have exposed a lack of generalizability (125,126).Therefore, algorithms that do not rely on predefined sets of variants have been proposed (9).

FUNCTIONAL PREDICTIONS IN RARE-VARIANT ASSOCIATION STUDIES
We have previously introduced GWAS that test variants individually.For the many rare variants observed by sequencing-based genotyping in seqGWAS, this approach lacks statistical power: First, the effect size for any single rare variant would need to be very large in order to reach statistical significance.Second, LD is low for rare variants, and therefore noncausal rare variants cannot effectively tag nearby causal variants.Third, the many very rare variants and singletons would drastically increase the burden of multiple rounds of testing (besides being statistically inappropriate).
To address these problems, rare-variant association tests increase the weights of potentially causal variants and aggregate variants in groups (127) (Figure 5).In addition, variant inclusion criteria based on functional annotations or effect predictions are used to increase the fraction of potentially causal variants.Filtering and weighting increase power if the causal mechanisms at a locus are correctly identified and noncausal variants are excluded from the test.As the true biology of each locus and its influence on the trait are unknown, it is common to vary the inclusion criteria for variants and perform multiple tests per locus (10,128,129).The p-values arising from different variant groups tested at the same locus can be aggregated, for example, using omnibus tests like the Cauchy combination test (130).
The two main types of association tests that aggregate variants are variant collapsing tests (also called burden tests) and kernel-based tests (also called variance-component tests).Variant collapsing tests aggregate all variants within a group into a single variable before testing.They have high statistical power if the variants affect the phenotype in the same direction (e.g., increasing the probability of disease) and most included variants are causal.Kernel-based tests are advantageous if variants have opposing directions of effect or if there are fewer causal variants (131).Tests that combine collapsing and kernel-based tests in order to universally increase statistical power have been proposed (132,133).
In exome-sequencing studies, which currently are still the majority of seqGWAS, variants are often grouped by gene, and potentially damaging coding variants are readily identified.Collapsing tests are effective in coding regions because the majority of damaging coding variants lead to a loss of function, and therefore have aligned effects on the phenotype within the same gene.We have recently shown that kernel-based tests have advantages if gain-of-function variants are present, which are typically limited to only few amino acid positions in a gene (129).
For noncoding variants, the grouping of variants is less straightforward, and sliding windows and groups defined by specific chromatin states have been used (e.g., enhancer-like regions, CTCF binding sites) (10).Methods that aggregate signal at the gene level by incorporating genome folding data have also been proposed (134,135).
Rare-variant tests typically allow for the incorporation of variant weights, which can be derived using variant annotations such as allele frequencies or functional variant effect predictions (129,132,(136)(137)(138).The directionality of functional variant effect predictions can also be taken into account, which could enable, for example, the separate collapsing of variants predicted to increase or decrease expression at a specific locus.The different components (i.e., matrices) used to construct a functionally informed rare-variant test are depicted in Figure 5.Both variant annotations (e.g., functional effect predictions) and variant similarities (e.g., physical proximity between variants) can be taken into account.By incorporating functional variant effect predictions, functionally informed tests directly provide hypotheses on the implicated cis-regulatory mechanisms.
Reporting standards for aggregated tests are still being established (139).When reporting the results of grouped tests, it has been recommended to report all the variants that contributed to the combined test and, ideally, the type of variant effect and algorithm that was used to identify these variants (if functional predictions were performed).This allows novel variants that appear in the same sequence context to be contextualized.

DISCUSSION
As functional genomics and genotype data keep expanding, driven by advances in experimental and sequencing technologies, the analysis of noncoding regions is becoming increasingly complex.In this review, we have introduced topics relevant to the analysis of noncoding regions, with a focus on functional genomics data for gene expression, variant effect prediction with sequence models, and the integration of these predictions with (seq)GWAS.We focused on single short genetic variants; however, phased genotype data could allow variant interactions to be investigated.
We showed how deep learning has become a valuable tool to perform in silico experiments and predict the functional effects of variants through transfer learning.We see great promise in improvements of model interpretability (90), as well as in architectures that allow for the integration of large sequence contexts (103).
In order to make use of these advances in the context of rare variants, software for rare-variant association tests needs to be flexibly designed to accommodate the many types of variant annotations and effect predictions available.Reporting standards for associations found by such methods need to be established (139).This is critical for their application in personalized healthcare or for the inclusion of variants in polygenic scores.
While genomics deep learning has greatly profited from advances in natural language processing and image analysis, algorithms tailored to the specific properties of genomics data could further increase performance and interpretability.There is a need for models that more accurately predict cell type-specific effects by incorporating new types of data or combining existing datasets.Single-cell data analysis, reporter assays using CRISPR (19), and the analysis of complex structural variants provide promising avenues for future research.
In general, the linkage of functional variant effect predictions for noncoding variants to association signals from GWAS, their incorporation into rare-variant association tests, and their application in clinical settings require further research.We observe a lack of consensus in evaluation strategies and independent benchmarks, which makes it difficult to assess the utility of predictions for downstream applications (125,126).
When designing algorithms for tasks like variant effect prediction, ethical considerations increasingly need to be taken into account.For example, if data used to train models come only from individuals of a specific ancestry, models may not generalize well to other ancestries, which could increase disparities in care.The reporting of results from such algorithms on the personal level also needs to be scrutinized (140).
Finally, self-supervised learning could help create powerful zero-shot models for noncoding variants, as has been shown for large corpora of text in natural language processing (141), and recently applied to protein variation (9).

Errata
An online log of corrections to Annual Review of Biomedical Data Science articles may be found at http://www.annualreviews.org/errata/biodatasci

Figure 1
Figure 1 Model of transcriptional regulation and its observation through functional genomics data.(a) Two enhancers are brought close to a promoter, where they activate transcription of a gene.The DNA bends and forms loops.(b) Close-up of an enhancer bound by transcription factors.The flanking histones bare typical modifications (H3K27ac, H3K4me1).The DNA is nucleosome-free and, therefore, accessible.(c) Functional genomics data measure processes depicted in panels a and b.The reference sequence provides the coordinate system (x-axis).A 2D interaction map shows interaction frequencies between regions of the DNA.Contacts between enhancers and promoters appear as dark regions on the off-diagonal elements.Sequence conservation and the location of genetic variants are shown together with functional genomics data tracks.Peak calls appear as bars underneath continuous signal tracks.The enhancer highlighted in gray is conserved, is accessible (measured by DNase-seq), and has elevated H3K27ac (measured by ChIP-seq).Chromatin states are inferred from functional genomics data.Abbreviations: ChIP-seq, chromatin immunoprecipitation and sequencing; DNase-seq, DNase I hypersensitive sites sequencing.

Figure 2
Figure 2Functional enhancer variants and their observation through genome-wide association studies (GWAS).(a) An enhancer variant disrupts the binding of a transcription factor.This leads to fewer contacts with the promoter and, therefore, lower expression of a gene involved in cholesterol metabolism.The causal enhancer variant is inherited together with nearby noncausal tagging variants.(b) A GWAS measures the correlation of variants' alternative allele counts with cholesterol in a large population, shown by allele dosage plots with regression lines (effect sizes are exaggerated for clarity).All variants are significantly correlated with the trait ( * * * p-value < 5 × 10 −8 ), which hinders identification of the causal variant.

a
In silico mutagenesis b Instance interpretation c Prediction of damaging variants Δ = alt.-ref.
Sequence annotations in rare-variant association studies.Models predict effects for all observed genetic variants, and variants are grouped by the type of variant effect prediction, their locations (sliding windows), or their overlaps with predicted functional elements.Each group of variants is tested separately.Functionally informed rare-variant association tests integrate groups' genotypes with variant annotations and variant-variant similarities.Noncollapsing tests construct kernel matrices that capture genetic similarities between individuals, whereas burden tests (collapsing tests) test single variables.Abbreviation: CCTF, CCCTC-binding factor.
Human Genomics of COVID-19 Pneumonia: Contributions of Rare and Common Variants Aurélie Cobat, Qian Zhang, COVID Human Genetic Effort, Laurent Abel, Jean-Laurent Casanova, and Jacques Fellay p p