Annual Review of Biomedical Data Science - Volume 6, 2023
Volume 6, 2023
-
-
Single-Cell RNA Sequencing for Studying Human Cancers
Vol. 6 (2023), pp. 1–22More LessSince the first publication a decade ago describing the use of single-cell RNA sequencing (scRNA-seq) in the context of cancer, over 200 datasets and thousands of scRNA-seq studies have been published in cancer biology. scRNA-seq technologies have been applied across dozens of cancer types and a diverse array of study designs to improve our understanding of tumor biology, the tumor microenvironment, and therapeutic responses, and scRNA-seq is on the verge of being used to improve decision-making in the clinic. Computational methodologies and analytical pipelines are key in facilitating scRNA-seq research. Numerous computational methods utilizing the most advanced tools in data science have been developed to extract meaningful insights. Here, we review the advancements in cancer biology gained by scRNA-seq and discuss the computational challenges of the technology that are specific to cancer research.
-
-
-
Challenges and Opportunities for Data Science in Women's Health
Vol. 6 (2023), pp. 23–45More LessThe intersection of women's health and data science is a field of research that has historically trailed other fields, but more recently it has gained momentum. This growth is being driven not only by new investigators who are moving into this area but also by the significant opportunities that have emerged in new methodologies, resources, and technologies in data science. Here, we describe some of the resources and methods being used by women's health researchers today to meet challenges in biomedical data science. We also describe the opportunities and limitations of applying these approaches to advance women's health outcomes and the future of the field, with emphasis on repurposing existing methodologies for women's health.
-
-
-
Computational Methods for Single-Cell Proteomics
Vol. 6 (2023), pp. 47–71More LessAdvances in single-cell proteomics technologies have resulted in high-dimensional datasets comprising millions of cells that are capable of answering key questions about biology and disease. The advent of these technologies has prompted the development of computational tools to process and visualize the complex data. In this review, we outline the steps of single-cell and spatial proteomics analysis pipelines. In addition to describing available methods, we highlight benchmarking studies that have identified advantages and pitfalls of the currently available computational toolkits. As these technologies continue to advance, robust analysis tools should be developed in tandem to take full advantage of the potential biological insights provided by these data.
-
-
-
Statistical Learning Methods for Neuroimaging Data Analysis with Applications
Hongtu Zhu, Tengfei Li, and Bingxin ZhaoVol. 6 (2023), pp. 73–104More LessThe aim of this review is to provide a comprehensive survey of statistical challenges in neuroimaging data analysis, from neuroimaging techniques to large-scale neuroimaging studies and statistical learning methods. We briefly review eight popular neuroimaging techniques and their potential applications in neuroscience research and clinical translation. We delineate four themes of neuroimaging data and review major image processing analysis methods for processing neuroimaging data at the individual level. We briefly review four large-scale neuroimaging-related studies and a consortium on imaging genomics and discuss four themes of neuroimaging data analysis at the population level. We review nine major population-based statistical analysis methods and their associated statistical challenges and present recent progress in statistical methodology to address these challenges.
-
-
-
Strategies for the Genomic Analysis of Admixed Populations
Vol. 6 (2023), pp. 105–127More LessAdmixed populations constitute a large portion of global human genetic diversity, yet they are often left out of genomics analyses. This exclusion is problematic, as it leads to disparities in the understanding of the genetic structure and history of diverse cohorts and the performance of genomic medicine across populations. Admixed populations have particular statistical challenges, as they inherit genomic segments from multiple source populations—the primary reason they have historically been excluded from genetic studies. In recent years, however, an increasing number of statistical methods and software tools have been developed to account for and leverage admixture in the context of genomics analyses. Here, we provide a survey of such computational strategies for the informed consideration of admixture to allow for the well-calibrated inclusion of mixed ancestry populations in large-scale genomics studies, and we detail persisting gaps in existing tools.
-
-
-
Decoding Aging Hallmarks at the Single-Cell Level
Shuai Ma, Xu Chi, Yusheng Cai, Zhejun Ji, Si Wang, Jie Ren, and Guang-Hui LiuVol. 6 (2023), pp. 129–152More LessOrganismal aging exhibits wide-ranging hallmarks in divergent cell types across tissues, organs, and systems. The advancement of single-cell technologies and generation of rich datasets have afforded the scientific community the opportunity to decode these hallmarks of aging at an unprecedented scope and resolution. In this review, we describe the technological advancements and bioinformatic methodologies enabling data interpretation at the cellular level. Then, we outline the application of such technologies for decoding aging hallmarks and potential intervention targets and summarize common themes and context-specific molecular features in representative organ systems across the body. Finally, we provide a brief summary of available databases relevant for aging research and present an outlook on the opportunities in this emerging field.
-
-
-
Addressing the Challenge of Biomedical Data Inequality: An Artificial Intelligence Perspective
Yan Gao, Teena Sharma, and Yan CuiVol. 6 (2023), pp. 153–171More LessArtificial intelligence (AI) and other data-driven technologies hold great promise to transform healthcare and confer the predictive power essential to precision medicine. However, the existing biomedical data, which are a vital resource and foundation for developing medical AI models, do not reflect the diversity of the human population. The low representation in biomedical data has become a significant health risk for non-European populations, and the growing application of AI opens a new pathway for this health risk to manifest and amplify. Here we review the current status of biomedical data inequality and present a conceptual framework for understanding its impacts on machine learning. We also discuss the recent advances in algorithmic interventions for mitigating health disparities arising from biomedical data inequality. Finally, we briefly discuss the newly identified disparity in data quality among ethnic groups and its potential impacts on machine learning.
-
-
-
An Overview of Deep Generative Models in Functional and Evolutionary Genomics
Burak Yelmen, and Flora JayVol. 6 (2023), pp. 173–189More LessFollowing the widespread use of deep learning for genomics, deep generative modeling is also becoming a viable methodology for the broad field. Deep generative models (DGMs) can learn the complex structure of genomic data and allow researchers to generate novel genomic instances that retain the real characteristics of the original dataset. Aside from data generation, DGMs can also be used for dimensionality reduction by mapping the data space to a latent space, as well as for prediction tasks via exploitation of this learned mapping or supervised/semi-supervised DGM designs. In this review, we briefly introduce generative modeling and two currently prevailing architectures, we present conceptual applications along with notable examples in functional and evolutionary genomics, and we provide our perspective on potential challenges and future directions.
-
-
-
Toward Identification of Functional Sequences and Variants in Noncoding DNA
Remo Monti, and Uwe OhlerVol. 6 (2023), pp. 191–210More LessUnderstanding the noncoding part of the genome, which encodes gene regulation, is necessary to identify genetic mechanisms of disease and translate findings from genome-wide association studies into actionable results for treatments and personalized care. Here we provide an overview of the computational analysis of noncoding regions, starting from gene-regulatory mechanisms and their representation in data. Deep learning methods, when applied to these data, highlight important regulatory sequence elements and predict the functional effects of genetic variants. These and other algorithms are used to predict damaging sequence variants. Finally, we introduce rare-variant association tests that incorporate functional annotations and predictions in order to increase interpretability and statistical power.
-
-
-
A Review of and Roadmap for Data Science and Machine Learning for the Neuropsychiatric Phenotype of Autism
Vol. 6 (2023), pp. 211–228More LessAutism spectrum disorder (autism) is a neurodevelopmental delay that affects at least 1 in 44 children. Like many neurological disorder phenotypes, the diagnostic features are observable, can be tracked over time, and can be managed or even eliminated through proper therapy and treatments. However, there are major bottlenecks in the diagnostic, therapeutic, and longitudinal tracking pipelines for autism and related neurodevelopmental delays, creating an opportunity for novel data science solutions to augment and transform existing workflows and provide increased access to services for affected families. Several efforts previously conducted by a multitude of research labs have spawned great progress toward improved digital diagnostics and digital therapies for children with autism. We review the literature on digital health methods for autism behavior quantification and beneficial therapies using data science. We describe both case–control studies and classification systems for digital phenotyping. We then discuss digital diagnostics and therapeutics that integrate machine learning models of autism-related behaviors, including the factors that must be addressed for translational use. Finally, we describe ongoing challenges and potential opportunities for the field of autism data science. Given the heterogeneous nature of autism and the complexities of the relevant behaviors, this review contains insights that are relevant to neurological behavior analysis and digital psychiatry more broadly.
-
-
-
Recent Developments in Ultralarge and Structure-Based Virtual Screening Approaches
Vol. 6 (2023), pp. 229–258More LessDrug development is a wide scientific field that faces many challenges these days. Among them are extremely high development costs, long development times, and a small number of new drugs that are approved each year. New and innovative technologies are needed to solve these problems that make the drug discovery process of small molecules more time and cost efficient, and that allow previously undruggable receptor classes to be targeted, such as protein–protein interactions. Structure-based virtual screenings (SBVSs) have become a leading contender in this context. In this review, we give an introduction to the foundations of SBVSs and survey their progress in the past few years with a focus on ultralarge virtual screenings (ULVSs). We outline key principles of SBVSs, recent success stories, new screening techniques, available deep learning–based docking methods, and promising future research directions. ULVSs have an enormous potential for the development of new small-molecule drugs and are already starting to transform early-stage drug discovery.
-
-
-
Human Microbiomes and Disease for the Biomedical Data Scientist
Vol. 6 (2023), pp. 259–273More LessThe human microbiome is complex, variable from person to person, essential for health, and related to both the risk for disease and the efficacy of our treatments. There are robust techniques to describe microbiota with high-throughput sequencing, and there are hundreds of thousands of already-sequenced specimens in public archives. The promise remains to use the microbiome both as a prognostic factor and as a target for precision medicine. However, when used as an input in biomedical data science modeling, the microbiome presents unique challenges. Here, we review the most common techniques used to describe microbial communities, explore these unique challenges, and discuss the more successful approaches for biomedical data scientists seeking to use the microbiome as an input in their studies.
-
-
-
Virus-Derived Small RNAs and microRNAs in Health and Disease
Vol. 6 (2023), pp. 275–298More LessMicroRNAs (miRNAs) are short noncoding RNAs that can regulate all steps of gene expression (induction, transcription, and translation). Several virus families, primarily double-stranded DNA viruses, encode small RNAs (sRNAs), including miRNAs. These virus-derived miRNAs (v-miRNAs) help the virus evade the host's innate and adaptive immune system and maintain an environment of chronic latent infection. In this review, the functions of the sRNA-mediated virus–host interactions are highlighted, delineating their implication in chronic stress, inflammation, immunopathology, and disease. We provide insights into the latest viral RNA–based research—in silico approaches for functional characterization of v-miRNAs and other RNA types. The latest research can assist toward the identification of therapeutic targets to combat viral infections.
-
-
-
Combining Molecular and Radiomic Features for Risk Assessment in Breast Cancer
Vol. 6 (2023), pp. 299–311More LessBreast cancer risk is highly variable within the population and current research is leading the shift toward personalized medicine. By accurately assessing an individual woman's risk, we can reduce the risk of over/undertreatment by preventing unnecessary procedures or by elevating screening procedures. Breast density measured from conventional mammography has been established as one of the most dominant risk factors for breast cancer; however, it is currently limited by its ability to characterize more complex breast parenchymal patterns that have been shown to provide additional information to strengthen cancer risk models. Molecular factors ranging from high penetrance, or high likelihood that a mutation will show signs and symptoms of the disease, to combinations of gene mutations with low penetrance have shown promise for augmenting risk assessment. Although imaging biomarkers and molecular biomarkers have both individually demonstrated improved performance in risk assessment, few studies have evaluated them together. This review aims to highlight the current state of the art in breast cancer risk assessment using imaging and genetic biomarkers.
-
-
-
Single-Cell Multiomics
Vol. 6 (2023), pp. 313–337More LessSingle-cell RNA sequencing methods have led to improved understanding of the heterogeneity and transcriptomic states present in complex biological systems. Recently, the development of novel single-cell technologies for assaying additional modalities, specifically genomic, epigenomic, proteomic, and spatial data, allows for unprecedented insight into cellular biology. While certain technologies collect multiple measurements from the same cells simultaneously, even when modalities are separately assayed in different cells, we can apply novel computational methods to integrate these data. The application of computational integration methods to multimodal paired and unpaired data results in rich information about the identities of the cells present and the interactions between different levels of biology, such as between genetic variation and transcription. In this review, we both discuss the single-cell technologies for measuring these modalities and describe and characterize a variety of computational integration methods for combining the resulting data to leverage multimodal information toward greater biological insight.
-
-
-
Importance of Diversity in Precision Medicine: Generalizability of Genetic Associations Across Ancestry Groups Toward Better Identification of Disease Susceptibility Variants
Vol. 6 (2023), pp. 339–356More LessGenome-wide association studies (GWAS) revolutionized our understanding of common genetic variation and its impact on common human disease and traits. Developed and adopted in the mid-2000s, GWAS led to searchable genotype–phenotype catalogs and genome-wide datasets available for further data mining and analysis for the eventual development of translational applications. The GWAS revolution was swift and specific, including almost exclusively populations of European descent, to the neglect of the majority of the world's genetic diversity. In this narrative review, we recount the GWAS landscape of the early years that established a genotype–phenotype catalog that is now universally understood to be inadequate for a complete understanding of complex human genetics. We then describe approaches taken to augment the genotype–phenotype catalog, including the study populations, collaborative consortia, and study design approaches aimed to generalize and then ultimately discover genome-wide associations in non-European descent populations. The collaborations and data resources established in the efforts to diversify genomic findings undoubtedly provide the foundations of the next chapters of genetic association studies with the advent of budget-friendly whole-genome sequencing.
-
-
-
Identification of Splice Variants and Isoforms in Transcriptomics and Proteomics
Vol. 6 (2023), pp. 357–376More LessAlternative splicing is pivotal to the regulation of gene expression and protein diversity in eukaryotic cells. The detection of alternative splicing events requires specific omics technologies. Although short-read RNA sequencing has successfully supported a plethora of investigations on alternative splicing, the emerging technologies of long-read RNA sequencing and top-down mass spectrometry open new opportunities to identify alternative splicing and protein isoforms with less ambiguity. Here, we summarize improvements in short-read RNA sequencing for alternative splicing analysis, including percent splicing index estimation and differential analysis. We also review the computational methods used in top-down proteomics analysis regarding proteoform identification, including the construction of databases of protein isoforms and statistical analyses of search results. While many improvements in sequencing and computational methods will result from emerging technologies, there should be future endeavors to increase the effectiveness, integration, and proteome coverage of alternative splicing events.
-
-
-
Gene Interactions in Human Disease Studies—Evidence Is Mounting
Vol. 6 (2023), pp. 377–395More LessDespite monumental advances in molecular technology to generate genome sequence data at scale, there is still a considerable proportion of heritability in most complex diseases that remains unexplained. Because many of the discoveries have been single-nucleotide variants with small to moderate effects on disease, the functional implication of many of the variants is still unknown and, thus, we have limited new drug targets and therapeutics. We, and many others, posit that one primary factor that has limited our ability to identify novel drug targets from genome-wide association studies may be due to gene interactions (epistasis), gene–environment interactions, network/pathway effects, or multiomic relationships. We propose that many of these complex models explain much of the underlying genetic architecture of complex disease. In this review, we discuss the evidence from multiple research avenues, ranging from pairs of alleles to multiomic integration studies and pharmacogenomics, that supports the need for further investigation of gene interactions (or epistasis) in genetic and genomic studies of human disease. Our goal is to catalog the mounting evidence for epistasis in genetic studies and the connections between genetic interactions and human health and disease that could enable precision medicine of the future.
-
-
-
Noninvasive Prenatal Testing Using Circulating DNA and RNA: Advances, Challenges, and Possibilities
Vol. 6 (2023), pp. 397–418More LessPrenatal screening using sequencing of circulating cell-free DNA has transformed obstetric care over the past decade and significantly reduced the number of invasive diagnostic procedures like amniocentesis for genetic disorders. Nonetheless, emergency care remains the only option for complications like preeclampsia and preterm birth, two of the most prevalent obstetrical syndromes. Advances in noninvasive prenatal testing expand the scope of precision medicine in obstetric care. In this review, we discuss advances, challenges, and possibilities toward the goal of providing proactive, personalized prenatal care. The highlighted advances focus mainly on cell-free nucleic acids; however, we also review research that uses signals from metabolomics, proteomics, intact cells, and the microbiome. We discuss ethical challenges in providing care. Finally, we look to future possibilities, including redefining disease taxonomy and moving from biomarker correlation to biological causation.
-
-
-
Challenges and Progress in Designing Broad-Spectrum Vaccines Against Rapidly Mutating Viruses
Vol. 6 (2023), pp. 419–441More LessViruses evolve to evade prior immunity, causing significant disease burden. Vaccine effectiveness deteriorates as pathogens mutate, requiring redesign. This is a problem that has grown worse due to population increase, global travel, and farming practices. Thus, there is significant interest in developing broad-spectrum vaccines that mitigate disease severity and ideally inhibit disease transmission without requiring frequent updates. Even in cases where vaccines against rapidly mutating pathogens have been somewhat effective, such as seasonal influenza and SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2), designing vaccines that provide broad-spectrum immunity against routinely observed viral variation remains a desirable but not yet achieved goal. This review highlights the key theoretical advances in understanding the interplay between polymorphism and vaccine efficacy, challenges in designing broad-spectrum vaccines, and technology advances and possible avenues forward. We also discuss data-driven approaches for monitoring vaccine efficacy and predicting viral escape from vaccine-induced protection. In each case, we consider illustrative examples in vaccine development from influenza, SARS-CoV-2, and HIV (human immunodeficiency virus)—three examples of highly prevalent rapidly mutating viruses with distinct phylogenetics and unique histories of vaccine technology development.
-