Annual Review of Biomedical Data Science - Current Issue
Volume 7, 2024
-
-
Biomedical Data Science, Artificial Intelligence, and Ethics: Navigating Challenges in the Face of Explosive Growth
Vol. 7 (2024), pp. 1–14More LessAdvances in biomedical data science and artificial intelligence (AI) are profoundly changing the landscape of healthcare. This article reviews the ethical issues that arise with the development of AI technologies, including threats to privacy, data security, consent, and justice, as they relate to donors of tissue and data. It also considers broader societal obligations, including the importance of assessing the unintended consequences of AI research in biomedicine. In addition, this article highlights the challenge of rapid AI development against the backdrop of disparate regulatory frameworks, calling for a global approach to address concerns around data misuse, unintended surveillance, and the equitable distribution of AI's benefits and burdens. Finally, a number of potential solutions to these ethical quandaries are offered. Namely, the merits of advocating for a collaborative, informed, and flexible regulatory approach that balances innovation with individual rights and public welfare, fostering a trustworthy AI-driven healthcare ecosystem, are discussed.
-
-
-
Computational Approaches to Drug Repurposing: Methods, Challenges, and Opportunities
Vol. 7 (2024), pp. 15–29More LessDrug repurposing refers to the inference of therapeutic relationships between a clinical indication and existing compounds. As an emerging paradigm in drug development, drug repurposing enables more efficient treatment of rare diseases, stratified patient populations, and urgent threats to public health. However, prioritizing well-suited drug candidates from among a nearly infinite number of repurposing options continues to represent a significant challenge in drug development. Over the past decade, advances in genomic profiling, database curation, and machine learning techniques have enabled more accurate identification of drug repurposing candidates for subsequent clinical evaluation. This review outlines the major methodologic classes that these approaches comprise, which rely on (a) protein structure, (b) genomic signatures, (c) biological networks, and (d) real-world clinical data. We propose that realizing the full impact of drug repurposing methodologies requires a multidisciplinary understanding of each method's advantages and limitations with respect to clinical practice.
-
-
-
Generating Clinical-Grade Gene–Disease Validity Classifications Through the ClinGen Data Platforms
Matt W. Wright, Courtney L. Thaxton, Tristan Nelson, Marina T. DiStefano, Juliann M. Savatt, Matthew H. Brush, Gloria Cheung, Mark E. Mandell, Bryan Wulf, TJ Ward, Scott Goehringer, Terry O'Neill, Phil Weller, Christine G. Preston, Ingrid M. Keseler, Jennifer L. Goldstein, Natasha T. Strande, Jennifer McGlaughon, Danielle R. Azzariti, Ineke Cordova, Hannah Dziadzio, Lawrence Babb, Kevin Riehle, Aleksandar Milosavljevic, Christa Lese Martin, Heidi L. Rehm, Sharon E. Plon, Jonathan S. Berg, Erin R. Riggs, and Teri E. KleinVol. 7 (2024), pp. 31–50More LessClinical genetic laboratories must have access to clinically validated biomedical data for precision medicine. A lack of accessibility, normalized structure, and consistency in evaluation complicates interpretation of disease causality, resulting in confusion in assessing the clinical validity of genes and genetic variants for diagnosis. A key goal of the Clinical Genome Resource (ClinGen) is to fill the knowledge gap concerning the strength of evidence supporting the role of a gene in a monogenic disease, which is achieved through a process known as Gene–Disease Validity curation. Here we review the work of ClinGen in developing a curation infrastructure that supports the standardization, harmonization, and dissemination of Gene–Disease Validity data through the creation of frameworks and the utilization of common data standards. This infrastructure is based on several applications, including the ClinGen GeneTracker, Gene Curation Interface, Data Exchange, GeneGraph, and website.
-
-
-
AlphaFold and Protein Folding: Not Dead Yet! The Frontier Is Conformational Ensembles
Vol. 7 (2024), pp. 51–57More LessLike the black knight in the classic Monty Python movie, grand scientific challenges such as protein folding are hard to finish off. Notably, AlphaFold is revolutionizing structural biology by bringing highly accurate structure prediction to the masses and opening up innumerable new avenues of research. Despite this enormous success, calling structure prediction, much less protein folding and related problems, “solved” is dangerous, as doing so could stymie further progress. Imagine what the world would be like if we had declared flight solved after the first commercial airlines opened and stopped investing in further research and development. Likewise, there are still important limitations to structure prediction that we would benefit from addressing. Moreover, we are limited in our understanding of the enormous diversity of different structures a single protein can adopt (called a conformational ensemble) and the dynamics by which a protein explores this space. What is clear is that conformational ensembles are critical to protein function, and understanding this aspect of protein dynamics will advance our ability to design new proteins and drugs.
-
-
-
Human Genetics and Genomics for Drug Target Identification and Prioritization: Open Targets’ Perspective
Vol. 7 (2024), pp. 59–81More LessOpen Targets, a consortium among academic and industry partners, focuses on using human genetics and genomics to provide insights to key questions that build therapeutic hypotheses. Large-scale experiments generate foundational data, and open-source informatic platforms systematically integrate evidence for target–disease relationships and provide dynamic tooling for target prioritization. A locus-to-gene machine learning model uses evidence from genome-wide association studies (GWAS Catalog, UK BioBank, and FinnGen), functional genomic studies, epigenetic studies, and variant effect prediction to predict potential drug targets for complex diseases. These predictions are combined with genetic evidence from gene burden analyses, rare disease genetics, somatic mutations, perturbation assays, pathway analyses, scientific literature, differential expression, and mouse models to systematically build target–disease associations (https://platform.opentargets.org). Scored target attributes such as clinical precedence, tractability, and safety guide target prioritization. Here we provide our perspective on the value and impact of human genetics and genomics for generating therapeutic hypotheses.
-
-
-
The Evolutionary Interplay of Somatic and Germline Mutation Rates
Vol. 7 (2024), pp. 83–105More LessNovel sequencing technologies are making it increasingly possible to measure the mutation rates of somatic cell lineages. Accurate germline mutation rate measurement technologies have also been available for a decade, making it possible to assess how this fundamental evolutionary parameter varies across the tree of life. Here, we review some classical theories about germline and somatic mutation rate evolution that were formulated using principles of population genetics and the biology of aging and cancer. We find that somatic mutation rate measurements, while still limited in phylogenetic diversity, seem consistent with the theory that selection to preserve the soma is proportional to life span. However, germline and somatic theories make conflicting predictions regarding which species should have the most accurate DNA repair. Resolving this conflict will require carefully measuring how mutation rates scale with time and cell division and achieving a better understanding of mutation rate pleiotropy among cell types.
-
-
-
Bringing the Genomic Revolution to Comparative Oncology: Human and Dog Cancers
Vol. 7 (2024), pp. 107–129More LessDogs are humanity's oldest friend, the first species we domesticated 20,000–40,000 years ago. In this unequaled collaboration, dogs have inadvertently but serendipitously been molded into a potent human cancer model. Unlike many common model species, dogs are raised in the same environment as humans and present with spontaneous tumors with human-like comorbidities, immunocompetency, and heterogeneity. In breast, bladder, blood, and several pediatric cancers, in-depth profiling of dog and human tumors has established the benefits of the dog model. In addition to this clinical and molecular similarity, veterinary studies indicate that domestic dogs have relatively high tumor incidence rates. As a result, there are a plethora of data for analysis, the statistical power of which is bolstered by substantial breed-specific variability. As such, dog tumors provide a unique opportunity to interrogate the molecular factors underpinning cancer and facilitate the modeling of new therapeutic targets. This review discusses the emerging field of comparative oncology, how it complements human and rodent cancer studies, and where challenges remain, given the rapid proliferation of genomic resources. Increasingly, it appears that human's best friend is becoming an irreplaceable component of oncology research.
-
-
-
Spatially Resolved Single-Cell Omics: Methods, Challenges, and Future Perspectives
Vol. 7 (2024), pp. 131–153More LessOverlaying omics data onto spatial biological dimensions has been a promising technology to provide high-resolution insights into the interactome and cellular heterogeneity relative to the organization of the molecular microenvironment of tissue samples in normal and disease states. Spatial omics can be categorized into three major modalities: (a) next-generation sequencing–based assays, (b) imaging-based spatially resolved transcriptomics approaches including in situ hybridization/in situ sequencing, and (c) imaging-based spatial proteomics. These modalities allow assessment of transcripts and proteins at a cellular level, generating large and computationally challenging datasets. The lack of standardized computational pipelines to analyze and integrate these nonuniform structured data has made it necessary to apply artificial intelligence and machine learning strategies to best visualize and translate their complexity. In this review, we summarize the currently available techniques and computational strategies, highlight their advantages and limitations, and discuss their future prospects in the scientific field.
-
-
-
Mapping the Human Cell Surface Interactome: A Key to Decode Cell-to-Cell Communication
Vol. 7 (2024), pp. 155–177More LessProteins on the surfaces of cells serve as physical connection points to bridge one cell with another, enabling direct communication between cells and cohesive structure. As biomedical research makes the leap from characterizing individual cells toward understanding the multicellular organization of the human body, the binding interactions between molecules on the surfaces of cells are foundational both for computational models and for clinical efforts to exploit these influential receptor pathways. To achieve this grander vision, we must assemble the full interactome of ways surface proteins can link together. This review investigates how close we are to knowing the human cell surface protein interactome. We summarize the current state of databases and systematic technologies to assemble surface protein interactomes, while highlighting substantial gaps that remain. We aim for this to serve as a road map for eventually building a more robust picture of the human cell surface protein interactome.
-
-
-
Centralized and Federated Models for the Analysis of Clinical Data
Vol. 7 (2024), pp. 179–199More LessThe progress of precision medicine research hinges on the gathering and analysis of extensive and diverse clinical datasets. With the continued expansion of modalities, scales, and sources of clinical datasets, it becomes imperative to devise methods for aggregating information from these varied sources to achieve a comprehensive understanding of diseases. In this review, we describe two important approaches for the analysis of diverse clinical datasets, namely the centralized model and federated model. We compare and contrast the strengths and weaknesses inherent in each model and present recent progress in methodologies and their associated challenges. Finally, we present an outlook on the opportunities that both models hold for the future analysis of clinical data.
-
-
-
Data Science Methods for Real-World Evidence Generation in Real-World Data
Vol. 7 (2024), pp. 201–224More LessIn the healthcare landscape, data science (DS) methods have emerged as indispensable tools to harness real-world data (RWD) from various data sources such as electronic health records, claim and registry data, and data gathered from digital health technologies. Real-world evidence (RWE) generated from RWD empowers researchers, clinicians, and policymakers with a more comprehensive understanding of real-world patient outcomes. Nevertheless, persistent challenges in RWD (e.g., messiness, voluminousness, heterogeneity, multimodality) and a growing awareness of the need for trustworthy and reliable RWE demand innovative, robust, and valid DS methods for analyzing RWD. In this article, I review some common current DS methods for extracting RWE and valuable insights from complex and diverse RWD. This article encompasses the entire RWE-generation pipeline, from study design with RWD to data preprocessing, exploratory analysis, methods for analyzing RWD, and trustworthiness and reliability guarantees, along with data ethics considerations and open-source tools. This review, tailored for an audience that may not be experts in DS, aspires to offer a systematic review of DS methods and assists readers in selecting suitable DS methods and enhancing the process of RWE generation for addressing their specific challenges.
-
-
-
Harnessing Artificial Intelligence in Multimodal Omics Data Integration: Paving the Path for the Next Frontier in Precision Medicine
Vol. 7 (2024), pp. 225–250More LessThe integration of multiomics data with detailed phenotypic insights from electronic health records marks a paradigm shift in biomedical research, offering unparalleled holistic views into health and disease pathways. This review delineates the current landscape of multimodal omics data integration, emphasizing its transformative potential in generating a comprehensive understanding of complex biological systems. We explore robust methodologies for data integration, ranging from concatenation-based to transformation-based and network-based strategies, designed to harness the intricate nuances of diverse data types. Our discussion extends from incorporating large-scale population biobanks to dissecting high-dimensional omics layers at the single-cell level. The review underscores the emerging role of large language models in artificial intelligence, anticipating their influence as a near-future pivot in data integration approaches. Highlighting both achievements and hurdles, we advocate for a concerted effort toward sophisticated integration models, fortifying the foundation for groundbreaking discoveries in precision medicine.
-
-
-
Disease Trajectories from Healthcare Data: Methodologies, Key Results, and Future Perspectives
Vol. 7 (2024), pp. 251–276More LessDisease trajectories, defined as sequential, directional disease associations, have become an intense research field driven by the availability of electronic population-wide healthcare data and sufficient computational power. Here, we provide an overview of disease trajectory studies with a focus on European work, including ontologies used as well as computational methodologies for the construction of disease trajectories. We also discuss different applications of disease trajectories from descriptive risk identification to disease progression, patient stratification, and personalized predictions using machine learning. We describe challenges and opportunities in the area that eventually will benefit from initiatives such as the European Health Data Space, which, with time, will make it possible to analyze data from cohorts comprising hundreds of millions of patients.
-
-
-
The Value Proposition of Coordinated Population Cohorts Across Africa
Vol. 7 (2024), pp. 277–294More LessBuilding longitudinal population cohorts in Africa for coordinated research and surveillance can influence the setting of national health priorities, lead to the introduction of appropriate interventions, and provide evidence for targeted treatment, leading to better health across the continent. However, compared to cohorts from the global north, longitudinal continental African population cohorts remain scarce, are relatively small in size, and lack data complexity. As infections and noncommunicable diseases disproportionately affect Africa's approximately 1.4 billion inhabitants, African cohorts present a unique opportunity for research and surveillance. High genetic diversity in African populations and multiomic research studies, together with detailed phenotyping and clinical profiling, will be a treasure trove for discovery. The outcomes, including novel drug targets, biological pathways for disease, and gene-environment interactions, will boost precision medicine approaches, not only in Africa but across the globe.
-
-
-
Computational Methods for Predicting Key Interactions in T Cell–Mediated Adaptive Immunity
Vol. 7 (2024), pp. 295–316More LessThe adaptive immune system recognizes pathogen- and cancer-specific features and is endowed with memory, enabling it to respond quickly and efficiently to repeated encounters with the same antigens. T cells play a central role in the adaptive immune system by directly targeting intracellular pathogens and helping to activate B cells to secrete antibodies. Several fundamental protein interactions—including those between major histocompatibility complex (MHC) proteins and antigen-derived peptides as well as between T cell receptors and peptide–MHC complexes—underlie the ability of T cells to recognize antigens with great precision. Computational approaches to predict these interactions are increasingly being used for medically relevant applications, including vaccine design and prediction of patient response to cancer immunotherapies. We provide computational researchers with an accessible introduction to the adaptive immune system, review computational approaches to predict the key protein interactions underlying T cell–mediated adaptive immunity, and highlight remaining challenges.
-
-
-
Privacy-Enhancing Technologies in Biomedical Data Science
Vol. 7 (2024), pp. 317–343More LessThe rapidly growing scale and variety of biomedical data repositories raise important privacy concerns. Conventional frameworks for collecting and sharing human subject data offer limited privacy protection, often necessitating the creation of data silos. Privacy-enhancing technologies (PETs) promise to safeguard these data and broaden their usage by providing means to share and analyze sensitive data while protecting privacy. Here, we review prominent PETs and illustrate their role in advancing biomedicine. We describe key use cases of PETs and their latest technical advances and highlight recent applications of PETs in a range of biomedical domains. We conclude by discussing outstanding challenges and social considerations that need to be addressed to facilitate a broader adoption of PETs in biomedical data science.
-
-
-
Graph Artificial Intelligence in Medicine
Vol. 7 (2024), pp. 345–368More LessIn clinical artificial intelligence (AI), graph representation learning, mainly through graph neural networks and graph transformer architectures, stands out for its capability to capture intricate relationships and structures within clinical datasets. With diverse data—from patient records to imaging—graph AI models process data holistically by viewing modalities and entities within them as nodes interconnected by their relationships. Graph AI facilitates model transfer across clinical tasks, enabling models to generalize across patient populations without additional parameters and with minimal to no retraining. However, the importance of human-centered design and model interpretability in clinical decision-making cannot be overstated. Since graph AI models capture information through localized neural transformations defined on relational datasets, they offer both an opportunity and a challenge in elucidating model rationale. Knowledge graphs can enhance interpretability by aligning model-driven insights with medical knowledge. Emerging graph AI models integrate diverse data modalities through pretraining, facilitate interactive feedback loops, and foster human–AI collaboration, paving the way toward clinically meaningful predictions.
-
-
-
Mapping the Multiscale Proteomic Organization of Cellular and Disease Phenotypes
Vol. 7 (2024), pp. 369–389More LessWhile the primary sequences of human proteins have been cataloged for over a decade, determining how these are organized into a dynamic collection of multiprotein assemblies, with structures and functions spanning biological scales, is an ongoing venture. Systematic and data-driven analyses of these higher-order structures are emerging, facilitating the discovery and understanding of cellular phenotypes. At present, knowledge of protein localization and function has been primarily derived from manual annotation and curation in resources such as the Gene Ontology, which are biased toward richly annotated genes in the literature. Here, we envision a future powered by data-driven mapping of protein assemblies. These maps can capture and decode cellular functions through the integration of protein expression, localization, and interaction data across length scales and timescales. In this review, we focus on progress toward constructing integrated cell maps that accelerate the life sciences and translational research.
-
-
-
Employing Informatics Strategies in Alzheimer's Disease Research: A Review from Genetics, Multiomics, and Biomarkers to Clinical Outcomes
Vol. 7 (2024), pp. 391–418More LessAlzheimer's disease (AD) is a critical national concern, affecting 5.8 million people and costing more than $250 billion annually. However, there is no available cure. Thus, effective strategies are in urgent need to discover AD biomarkers for disease early detection and drug development. In this review, we study AD from a biomedical data scientist perspective to discuss the four fundamental components in AD research: genetics (G), molecular multiomics (M), multimodal imaging biomarkers (B), and clinical outcomes (O) (collectively referred to as the GMBO framework). We provide a comprehensive review of common statistical and informatics methodologies for each component within the GMBO framework, accompanied by the major findings from landmark AD studies. Our review highlights the potential of multimodal biobank data in addressing key challenges in AD, such as early diagnosis, disease heterogeneity, and therapeutic development. We identify major hurdles in AD research, including data scarcity and complexity, and advocate for enhanced collaboration, data harmonization, and advanced modeling techniques. This review aims to be an essential guide for understanding current biomedical data science strategies in AD research, emphasizing the need for integrated, multidisciplinary approaches to advance our understanding and management of AD.
-