Illuminating the Virosphere Through Global Metagenomics.

Viruses are the most abundant biological entity on Earth, infect cellular organisms from all domains of life, and are central players in the global biosphere. Over the last century, the discovery and characterization of viruses have progressed steadily alongside much of modern biology. In terms of outright numbers of novel viruses discovered, however, the last few years have been by far the most transformative for the field. Advances in methods for identifying viral sequences in genomic and metagenomic datasets, coupled to the exponential growth of environmental sequencing, have greatly expanded the catalog of known viruses and fueled the tremendous growth of viral sequence databases. Development and implementation of new standards, along with careful study of the newly discovered viruses, have transformed and will continue to transform our understanding of microbial evolution, ecology, and biogeochemical cycles, leading to new biotechnological innovations across many diverse fields, including environmental, agricultural, and biomedical sciences.

Viral discovery in (meta)genomes; exponential growth of databases "Contagium vivum fluidum" or "filterable infectious agents" Bacteriophages discovered EM allows first visualization of viruses; morphometry is used for classification Major developments in viral discovery. A timeline of some of the most significant discoveries and events in virology leading to the current era of metagenomics methods. "Contagium vivum fluidum" (contagious living fluid) was the description Beijerinck used to describe tobacco mosaic virus. The scale of the timeline has been expanded following 2010. Items in red text highlight prominent examples of recent publications using metagenomic approaches (114)(115)(116)(117)(118)(119). Abbreviations: cryo-EM, cryogenic EM; EM, electron microscopy; ICTV, International Committee on Taxonomy of Viruses; NCLDVs, nucleocytoplasmic large DNA viruses; NGS, next-generation sequencing; PCR, polymerase chain reaction.
based on the length of their genomes, allowing researchers to compare the composition of many viral communities across multiple samples (21)(22)(23)(24). Another common technique made use of the growing number of viral gene and genome sequences available in public data repositories. Using this information, researchers developed PCR (polymerase chain reaction)-based assays targeting genes unique to specific groups of viruses, which could then be used to detect those viruses in a sample. This technique enabled some of the Assembly: the bioinformatic method for combining the relatively short reads produced by the sequencer into longer sequences called contigs and scaffolds IMG/VR: a comprehensive viral database, including over two million high-quality genomes and genome fragments derived from metagenomes Virome: the entire collection of viruses in a given environment or sample or a sample enriched for viral particles Metagenome: a collection of sequences representing the genomes from multiple organisms found together in a single sample

WHAT IS A VIRUS?
Viral particles are not only much smaller than most cellular life but also are much simpler, typically composed of a short piece of DNA or RNA surrounded by a layer of protein called a capsid and, in some viruses, of a lipid membrane called the viral envelope (120). Viruses have no metabolism of their own, so in order to replicate they must infect a host cell and take advantage of its metabolic machinery, followed by the release of new copies of the virus into the environment. Viruses lead one of three different lifestyles: (a) lytic, where host infection is followed directly by replication and release of new viral particles, killing the host in the process; (b) chronic, where viral replication takes place over long periods of time and release of new virions is less than lethal to the host; or (c) lysogenic, where the viral genome is integrated into the host's and remains dormant for a period of time before being activated and lysing the cell (121).
first quantitative comparisons of the distribution of specific viruses across space and time (25). It was also possible to compare the sequences of PCR-amplified genes, further increasing resolution and demonstrating that even among morphologically similar viruses, diversity at the sequence level is quite high (26,27). As these techniques were applied to a wide variety of viruses, one of the striking findings was that in many cases highly similar viral sequences could be detected across very broad geographic ranges (28)(29)(30). The major limitation of these approaches, however, is that each assay must target an individual viral group, as there is no universal marker gene for all viruses, unlike the 16S and 18S ribosomal genes widely used for PCR-based surveys of prokaryotes and eukaryotes.

METAGENOMICS AND THE MODERN APPROACH TO VIRAL DETECTION
Starting in the mid-2000s and continuing until today, sequencing technologies have advanced and costs have come down by several orders of magnitude (31). These advancements have fueled the rise of viral metagenomic sequencing, which involves sequencing the total viral DNA or RNA from individual samples (e.g., soil, seawater, or host-associated), bypassing any culturing in the laboratory. Typically, DNA or RNA is extracted from an environmental sample, fragmented, and then sequenced, generating millions of short reads (e.g., 100-200 bp) that are assembled into contigs. Metagenomic viral contigs are then identified using computational tools and algorithms that use a variety of viral-specific sequence features and signatures, providing unprecedented resolution on viral genomic diversity (32)(33)(34)(35)(36). However, metagenomic assembly is challenging, particularly for viruses with repetitive genomic elements, viruses from diverse subpopulations, or viruses present at low abundance (37). To address these challenges, researchers have found long-read sequencing (e.g., Oxford Nanopore and PacBio technologies) to be useful for sequencing viral genomes and transcriptomes from environmental samples without the need for assembly (38)(39)(40). These technological advances have fueled an unprecedented explosion in the amount of viral sequence data generated by various labs around the world. The largest database of viral genomes is IMG/VR (Integrated Microbial Genomes/Virus) at the Department of Energy's Joint Genome Institute (41), which houses over two million sequences derived from mostly uncultivated viruses.
There are three main strategies for sequencing viruses from the environment: virome sequencing, bulk metagenome sequencing, and single-cell sequencing. The majority of viral studies to date perform virome sequencing, which involves size filtration to enrich for the viral fraction before sequencing, similar to what is typically done before a microscopy-based analysis. Enriching for the viral fraction (the so-called virome) results in improved sequence coverage of viruses by minimizing the number of reads wasted on the sample's cellular fraction (see Figure 2). For researchers primarily interested in the virome, this approach increases the viral signal-to-noise ratio in the resulting sequence data, although some contaminating cellular sequences and plasmids often remain and must be removed for most analyses (42,43). However, due to low sample Provirus: a viral sequence that has been integrated into the host's genome and is not actively being replicated to produce new viral particles biomass, genome amplification techniques (e.g., multiple displacement amplification) are often employed, which can distort abundance and result in marked biases, including overamplification of small circular single-stranded DNA viruses (44-46). This approach may also exclude viruses that lie outside the standard fractionation size and weight cutoffs used (47), viruses that are attached to cells, viruses replicating inside of cells, and temperate viruses that have integrated into the host's genome. An alternative to virome sequencing is to skip the viral enrichment step altogether and sequence the bulk DNA or RNA found in a sample (see Figure 2), followed by computational separation of viral and cell-derived sequences. This approach greatly expands the number of samples that can be mined for viruses, as it includes metagenomics datasets collected to address other scientific objectives and is primarily responsible for the exponential growth in viral sequence databases over the last few years. Additionally, this approach allows for identifying proviruses that have integrated into a host genome, although it remains challenging to identify the sequence boundary between virus and host and to distinguish between viable proviruses and ancient or degraded remnants of proviruses (48). Lastly, since sequences from both viral and cellular origins are produced from the same sample, this approach allows for additional analysis based on associating the viruses identified in the sample together with their presumptive hosts. The major downside of this approach is that since the majority of reads derive from cellular organisms, it is considerably more challenging to assemble low-abundance viruses or viruses with large genomes.
A third approach for obtaining viral sequences from environmental samples involves using a flow cytometer to isolate viruses associated with single cells or even individual viral particles. This approach usually requires an amplification step due to the small amounts of nucleic acid found in the individually isolated cells or virions (49, 50). Single-cell approaches can provide very high resolution of virus-host interactions and can quantify and characterize the dynamics of the interaction, e.g., the number of lytic versus lysogenic infections in a community (51, 52). Single-virus approaches are especially useful in addressing some of the challenges of reconstructing complete viral genomes from metagenomes (53). A variation on single-cell sequencing involves combining fluorescently stained viral particles with either cultured bacterial hosts or even uncultured cells from the environment and then sorting and collecting the cells that the viruses target (54, 55). This increases the power of single-cell sequencing by enriching the sequenced portion for cells with attached viral particles, although perhaps at the expense of missing many of the sample's proviruses (which are contained in cells not tagged by a virus). However, with all single-cell approaches, DNA amplification can result in highly fragmented assemblies, and it can also be challenging to discriminate between viruses that were attached to the cell and viruses integrated into the host's genome.

COMPUTATIONAL METHODS HAVE FUELED AN ACCELERATED PACE OF DISCOVERY
Regardless of the environmental sequencing approach (e.g., virome, bulk metagenome, or single cell), it is essential to apply a computational method for separating viral and nonviral (e.g., cellular, plasmid) sequences in silico. Existing computational methods follow one of two broadly defined approaches: (a) matching the sample sequences directly to a set of reference sequences, or (b) using a classification algorithm to label all of the sample's sequences as either viral or nonviral. In general terms, the trade-off between the approaches is that the former is typically applied on the unassembled reads, and therefore can be computationally fast, but has a higher risk of false negatives. The latter approach requires assembly and therefore is more expensive, but it allows for far greater flexibility in identifying novel, uncharacterized viruses not found in the reference

CHARACTERISTICS OF VIRAL GENOMES
Viral genomes exhibit a wide range of sizes. The smallest circoviruses, which infect mostly birds and pigs, have single-stranded DNA genomes less than 2,000 nucleotides long and only encode three or four proteins (122). Pandoraviruses, double-stranded DNA viruses that infect amoebas, have the largest known viral genomes-over two million nucleotides encoding more than 2,000 proteins (123). These two genera are the extremes, of course, and most viruses fall somewhere in between. The majority of bacteriophages, for example, have genome lengths of around 5-10 kb up to 30-50 kb (124).
Despite lacking a universal viral marker gene, viruses have many hallmark genes that are unique to viruses and not found in any cellular genomes. These include genes such as those encoding capsid, portal, and terminase proteins, all of which are involved in the formation and packaging of viral particles (125). In addition, other features common to many viral genomes include a lower rate of strand switching (long stretches of genes encoded on the same strand of a double-stranded genome), smaller average gene size, and an enrichment in genes of unknown function (49). These characteristics of viral genomes can be useful in distinguishing sequences obtained from environmental samples as either viral or cellular (56). set but that exhibit key viral features (see the sidebar titled Characteristics of Viral Genomes), although at a higher risk of false positives (56).
These two approaches need not be used in isolation from each other. In fact, some of the most impactful efforts, in terms of expanding the databases of known viral sequences, have utilized something resembling an iterative strategy, whereby viral sequences are identified in new datasets by matching to known viral genes, followed by the use of the novel genes discovered on those sequences as baits to identify more novel viruses and augment the reference set. Referencematching can then be applied yet again to the sample data using the expanded reference set. Care needs to be taken, of course, to prevent adding noise to the reference set by ensuring that the novel sequences are bona fide viral, which typically means that they must meet an expertly defined set of characteristics common to viral genes and genomes. This iteration expands the reference sets with the addition of each newly identified viral sequence, increasing the power of the reference-based approach while simultaneously improving the accuracy of the classifiers due to the availability of more reference data for training. Thus, the two methods are mutually reinforcing (57, 58).
With the diversity of experimental and computational methods for identifying viruses (see Table 1), it is important to follow established reporting guidelines to enable the validation and replication of results. To facilitate this, a broad coalition of experts in virology and genomics established the Minimum Information About an Uncultivated Virus Genome (MIUViG) standards (59). The standards mandate reporting of various metadata categories, including the type of dataset generated (virome, bulk metagenome, single-cell, etc.), the sequence assembly method, the software used to identify viral sequences, the predicted genome characteristics (i.e., single-or doublestranded DNA or RNA, sense or antisense, segmented or nonsegmented), whether the sequence is an integrated provirus, the genome quality (finished, high-quality, or fragment), and the number of contigs that comprise the genome. Additional optional metadata that should be reported with new viral sequences include predicted taxonomic classification, predicted host, feature annotations (identifying gene-coding regions, provirus integration sites, etc.), and other experimental details such as the sorting method (for single cells) and the enrichment method (for viromes).
One particularly challenging task is assessing the quality of new viral sequences, which can range from small genome fragments to complete and near-complete genomes. For bacteria and archaea, genome quality is often estimated based on the presence and copy number of widely

VIRAL EVOLUTION AND SYSTEMATICS
Inferring the phylogenetic relationships among viruses remains a major challenge due to the complexities inherent to viral systematics. First, horizontal gene transfer and recombination is widespread among viruses, resulting in genomic mosaics (1,(137)(138)(139)(140)(141)(142). Second, many viruses evolve at such a rapid rate that there is little or no remaining information in the currently observable genomes to deduce higher-level phylogenetic relationships (143). Furthermore, it is thought that viruses may have arisen multiple times over evolutionary history, making it impossible to trace their origin to a single common ancestor (144,145). The International Committee on Taxonomy of Viruses (ICTV) is the authority on viral taxonomy and nomenclature. Recently, the ICTV has revised both the taxonomic rank structure used for virus classification, by increasing the number of accepted ranks to 15, and the data requirements for classifying novel viruses, by allowing the use of sequence data in the absence of EM images (146,147). The former allows for the possibility of hierarchically connecting all viruses (although phylogenetic relationships are not necessarily implied) and the latter indicates an adaptation to the enormous amount of sequence data being generated in recent years, representing many novel viruses known only through their sequences. Finally, although the recent increase in viral sequence data generation means that it is possible, and often necessary, to describe viral taxonomies using only sequence data, it is still recommended to consider phenotypic information such as morphology whenever feasible so that taxonomic classifications reflect biologically meaningful divisions (148). of terminal repeats or provirus integration sites. When applied to 735,106 viral sequences from the IMG/VR version 2 (v2.0) database, this approach was able to accurately estimate completeness for the majority of sequences from host-associated, marine, freshwater, and soil environments. In the case of proviruses, CheckV also predicts the host-virus sequence boundary, which allows for removal of the host region and improves the identification of bona fide auxiliary metabolic genes in the virus. For example, this approach removed numerous antibiotic resistance genes found on IMG/VR sequences, which are likely to be cellular-encoded given previous work showing that phages rarely encode resistance genes (64).

INFERRING THE EVOLUTIONARY ORIGINS AND RELATIONSHIPS AMONG VIRUSES
Assigning a taxonomic classification to newly discovered viruses is not trivial, and viral taxonomies often undergo frequent revision to reflect the latest understanding of viral evolution (see the sidebar titled Viral Evolution and Systematics). Traditional phylogenetic methods used for prokaryotes and eukaryotes do not work for viruses, which lack universally distributed marker genes (65). To address this limitation, researchers have developed several alternative methods that utilize a variety of strategies. One strategy involves clustering genomes into viral operational taxonomic units (vOTUs) based on average nucleotide identity (ANI) or shared gene content. For example, the MIUViG standards proposed a threshold of 95% ANI over 85% genome length for delineating species-level vOTUs (59), while another approach utilized gene sharing networks to delineate vOTUs at approximately the genus or subfamily ranks (66). In either case, reference genomes can be included in the clustering in order to transfer taxonomic annotations at the appropriate rank within viral OTUs. For example, using this approach, Roux  Caudovirales order utilizing a phylogeny derived from 77 common, yet nonuniversal, proteins (67). Another method, which has been adapted from a framework widely used for taxonomic classification of bacteria and archaea, incorporates nucleotide and amino acid alignments, clustering, phylogenetic inference, and a flexible set of distance metrics (68). Yet another method recently described combines sequence alignments for closely related viruses with a novel mutual information metric for more distantly related viruses (69).
As the number of novel viruses discovered through metagenomic sequence data continues to increase at an accelerating rate, there is clearly a need for further development and evaluation of the various approaches for providing taxonomic annotations to uncultivated viruses. One possibility would be to establish genome-based phylogenies of all sequenced viral groups, encompassing both cultivated and uncultivated viruses, analogous to the Genome Taxonomy Database developed for bacteria and archaea (70). This approach could shed new light on the evolution of the virosphere and facilitate new methods for the automated taxonomic annotation of viral genomes. Another possibility would be to incorporate ecological properties of viruses, like host range and habitat distribution, to improve our understanding of the evolutionary relationships among viruses.

EXPERIMENTAL AND COMPUTATIONAL APPROACHES FOR HOST PREDICTION
Identifying the cellular host of a virus is essential for understanding a virus's ecosystem impact and for leveraging viruses in biotechnological applications such as phage therapy. A variety of experimental and computational approaches exist for uncovering host-virus interactions (reviewed in 71). Broadly, these methods differ in terms of whether they depend on cultivation, the type of sequencing data they require (e.g., single-cell, metagenome, whole-genome, or proximity ligation, such as with Hi-C), their dependence on available reference data, the taxonomic resolution of the host prediction, and their sensitivity and specificity (see Table 2). The gold standard of evidence for a virus-host relationship is culturing the two together and observing lytic or lysogenic infection via a spot assay, plaque assay, or liquid assay. However, these methods require the availability of the host or virus in pure culture and are therefore not applicable for a large fraction of the virosphere. Another experimental approach involves using flow cytometry to isolate and sequence viruses attached to or replicating inside of individual cells. Single-cell techniques have been applied to samples from human gut (55) and marine (51) environments and have revealed high-resolution phage-host interactions. Another interesting approach involves sequencing crosslinked DNA from a microbial community (e.g., meta3C), which can enable associations between the host genome and proximal or integrated mobile elements, such as viruses and plasmids (72,73).
Several computational methods utilize genomic information to predict connections between a set of viral genomes and a corresponding set of host genomes (see Table 2). The two most commonly used approaches are CRISPR (clustered regularly interspaced short palindromic repeats) matching and genomic similarity, which both depend on the availability of high-quality reference data. CRISPR matching is performed by identifying near-perfect alignments between CRISPR spacers and viral genomes, indicating a history of past infection. While this approach is accurate for assigning the host at low ranks (e.g., species), CRISPR-Cas systems are only found in ∼40% of bacteria and 70% of archaea (74) and can be entirely absent from certain prokaryotic lineages (75), and CRISPR arrays can be challenging to assemble from short-read data (76). Additionally, CRISPR spacers rapidly turn over in the host genome, meaning that the information encoding the virus-host linkage will be quickly lost if the relationship is not actively maintained (77)(78)(79). Host-virus genomic similarity is another commonly used approach (41,80), which is

CRISPR spacer match Computational Identifies (near-) exact matches between viral genomes and CRISPR spacers found in host genome
The CRISPR-Cas system is found in only ∼40% of bacteria and ∼70% of archaea. CRISPR arrays rapidly turn over in the environment and many spacers fail to match any mobile element. Viruses may contain anti-CRISPR proteins that inhibit the acquisition of protospacers by the host.  often a signature of either a recent or an ancient integration event by a temperate virus. The sensitivity and specificity of this approach (as well as the resolution of the taxonomic assignment) depend on several factors, including the alignment similarity, the length of the aligned region, and whether DNA or proteins are being compared (71). The main drawback of this approach is that it is not effective for obligate lytic viruses, which never integrate into the host genome, which is why it is recommended to combine this approach with CRISPR targeting. For example, using a combined approach, Roux et al. (41) resolved viral connections for the vast majority of prokaryotic phyla (see Figure 3). Other computational approaches include identifying integrated proviruses in microbial (meta)genomes (56, 62, 81), identifying (lagged) co-abundance patterns between viruses and hosts (82), identifying similar oligonucleotide profiles (83)(84)(85), and using computational methods that utilize viral signature genes that correspond with specific hosts (86).

REVEALING THE ECOLOGICAL IMPACT OF THE GLOBAL VIROME
Viruses exert significant influence on the ecology of the communities in which they are found, which include most of Earth's known biomes (see Figure 4). While it is clear that viruses can directly affect the population of their microbial hosts, the various mechanisms involved are still being clarified. Evidence supporting the so called kill-the-winner hypothesis, whereby the organisms most successful at growing within a given biological niche become the target of a larger number of viruses, has been observed in multiple studies (87,88). Our understanding of this dynamic has increased greatly as the genomic features of the viruses involved (along with their targeted hosts) are elucidated. The forces viruses exert on their hosts, through both lytic and lysogenic virus lifestyles, contribute greatly to the richness and diversity of the microbial communities of which they are a part (89). The fact that viral genes can be transferred to their hosts in a variety of ways has led some researchers to view viruses as an extended gene pool that contributes to the genetic diversity of their community (90). Proviruses in particular are known for providing genes that can eventually become part of the host's genetic repertoire. In addition, cases of coinfection, when a host is simultaneously infected by more than one virus, can lead to viral genome recombination and expanded genomic diversity.
Beyond direct effects on their hosts, and thus an influence on the overall microbial ecology of their communities, viruses play a key role in global nutrient cycles, especially in the surface of the world's lakes and oceans. When an aquatic virus kills its host, the cellular debris releases organic carbon, nitrogen, and phosphorus, which are then available for heterotrophic bacteria. This has been called the viral shunt, as it prevents the normal accumulation of organic carbon  with and without host assignments. (b) Phylogenetic distribution of bacterial and archaeal hosts for viruses in IMG/VR v3. For each phylum, a pie chart indicates the fraction of genomes assigned to this phylum from bulk metagenomes (red), viromes (i.e., samples that were enriched for viruses; orange), or isolate viruses (gray). The numbers next to the pie charts indicate the number of genomes from isolate viruses assigned to each phylum, if any. For viruses from viromes and metagenomes, only high-quality genomes are shown (>90% completeness). The set of isolate viruses in IMG/VR was originally obtained from NCBI (National Center for Biotechnology Information) RefSeq and GenBank. While most of these are isolates, some may not be, but that information is not recorded in IMG/VR and all are reported as isolate viral genomes here. The number of viruses with hosts for the isolate viruses in panel a includes eukaryotic viruses, which are not shown in the tree of panel b. Asterisks denote the ten phyla for which no viruses, even of medium-or low-quality genomes, have been identified. Phyla with no asterisks or pie charts next to them do have viruses with medium-or low-quality genomes that have been identified and available through IMG/VR, but they are not shown here. The scale bar for branch lengths indicates amino acid substitutions per site. Panel b adapted from Reference 57. Distribution of metagenome-assembled viral genomes across Earth's biomes. The maps show the geographic distribution of high-quality viral genomes from IMG/VR (Integrated Microbial Genomes/Virus) across major biomes, as defined by the GOLD (General Ontology for Linguistic Description) ontology. High-quality viral genomes were identified using CheckV and contain more than 90% of the expected genome length. Figure  in larger organisms that graze on the microbes (91). Simultaneously, some of the host debris can form sticky aggregates that sink from the ocean surface (92). This so-called viral shuttle can lead to an increase in the long-term storage of carbon in the subsurface ocean layers and the seafloor (93,94). It has also been suggested that viral lytic activity in the aquatic surface ecosystem may have a role in stimulating carbon fixation and primary production by certain photosynthetic plankton, either via reducing the phytoplankton's predators or competitors or through the lytic release of nutrients (95). These processes merit further attention given the current worldwide interest in accurately measuring and modeling carbon and other biogeochemical cycles.
Phages have been used in antibacterial therapy for nearly 100 years, although their utility in this role has been mostly surpassed by the much more common small-molecule antibiotics (96). However, this trend may begin to change as the rising incidence of multi-drug-resistant infections has increased the demand for novel therapies and metagenomic sequencing efforts have discovered many novel phages that could be harnessed in future therapeutics (97)(98)(99). The development of tools for predicting the therapeutic value of specific phages, as well as growing databases of known phage-host relationships, should prove useful in advancing phages as therapeutics. In addition to phage therapy, the genetic content of phage communities has been mined for novel antimicrobial protein-based therapeutics such as lysins with broad-and narrow-spectrum effects (100).
Alongside therapeutics, the specificity of phages' host targeting can also be exploited for diagnostic applications. A variety of methods have been devised to harness phage-host interactions for the detection of pathogens, from simple growth inhibition assays to engineered phages carrying reporter genes (101). Such technologies have the potential to be valuable clinical tools in the diagnosis and management of infection, and current and future efforts are ongoing to develop their promise (102)(103)(104).
While advances in phage-based therapies may steal the headlines, it can be argued that phage applications in food and agriculture represent a potentially much broader impact on human life. Phages have been used commercially as a replacement for or adjunct to conventional pesticides on tomato and pepper crops for over a decade (105), and they are even certified for organic production (106). There is increasing interest in applying them to other crops and aquacultures (107), and the growing catalog of known phages will be a significant resource in these efforts (108,109). Beyond representing a replacement for chemical pesticides, phages have found additional uses as approved methods for reducing microbial contamination in food processing and supply chains (110,111). As the development of phage-based products continues and their adoption widens, careful thought must accompany their application, including a thorough understanding of their benefits and limitations (112).

WHERE DO WE GO FROM HERE?
It is difficult to predict where research will take us, but based on past developments, it is likely there will be many impactful discoveries made by continuing our efforts to unearth novel viruses and improve our understanding of viral biology. Large-scale coordinated efforts for characterizing novel viruses from many environments are ongoing in many labs around the world, and the number of unknown viruses remaining to be discovered is predicted to be vast based on the observation that the rate of discovery of new vOTUs is not yet reaching any plateau (41). Apace with the discovery efforts, multiple classification efforts move ahead with new tools and techniques propelled by the increase in sequence data. One particular area that is expected to promote further rapid growth in the field is the development and wider adaptation of standards across all aspects of viral genomics research (59). The scientific community is strongly encouraged to adhere to www.annualreviews.org • Illuminating the Virosphere and promote the refinement of these standards, as well as to contribute to the development of new ones. Building on the discovery, classification, and functional characterization efforts, our collective understanding of viruses' effects on global ecology will continue to be refined and used for predictive modeling (113).
Efforts to improve software tools for the discovery of viral sequence data in metagenomic datasets constitute an area of very active research. These developments are fueled by the application of modern data science techniques, as well as by the continued improvement of sequencing technologies. Different types of sequencing, along with creative analytic approaches, should be exploited to fill in the gaps in our coverage of viral and virus-host space. The tools and resources for viral sequence discovery will continue to adapt and promote further participation by newcomers to the field as we increase our understanding of viruses' roles in the continuing evolution of life on Earth.

SUMMARY POINTS
1. Viruses are found everywhere on Earth, and through interactions with their hosts they are major players in all ecosystems in which they are found.
2. Nonculture, sequence-based methods have exponentially increased the discovery and study of new viral lineages and functions.
3. Despite recent technology advances, vast amounts of viral diversity remain undiscovered.
4. Full utilization of the massive amounts of sequence data being generated for research presents unique challenges and requires following principles of good data management.
5. Viral genome databases that collect, curate, and support the comparative analysis of viruses are critical for advancing our understanding of the viral world.
6. Development and implementation of standards for viral genomics are of fundamental importance for data comparability and reusability from different research groups.
7. Employment of multiple complementary approaches for the identification of viruses (including virome and metagenome studies) can lead to the most comprehensive reconstruction of viral diversity in any ecosystem.

FUTURE ISSUES
1. Researchers in the field should adhere to standards for data generation, reporting, and sharing in order to maximize scientific collaboration, rigor, and reproducibility.
2. Methods for assembly of genomes from complex metagenomes should be optimized in order to achieve high-quality viral genomes.
3. New methods for the identification of novel viral genomes should be developed, with emphasis on those with limited similarity to known viruses.

4.
A genome-based taxonomy of cultivated and uncultivated viruses should be constructed and curated to enable rapid and accurate taxonomic classification of newly discovered viruses.
5. Computational methods for connecting viruses to their hosts should be improved, which is especially important given that the host is not known for the vast majority of uncultivated viruses.
6. Computational methods should be used to illuminate the evolutionary arms race between viruses and their hosts, including mechanisms for host or viral defense and mechanisms of recognition (e.g., virus-host protein-protein interactions).
7. Researchers should develop a deeper understanding of the mechanistic actions of phages and viruses in Earth's ecosystems and uncover their underlying roles in controlling health and disease.