Annual Review of Biomedical Data Science - Volume 3, 2020
Volume 3, 2020
-
-
Deciphering Cell Fate Decision by Integrated Single-Cell Sequencing Analysis
Sagar, and Dominic GrünVol. 3 (2020), pp. 1–22More LessCellular differentiation is a common underlying feature of all multicellular organisms through which naïve cells progressively become fate restricted and develop into mature cells with specialized functions. A comprehensive understanding of the regulatory mechanisms of cell fate choices during development, regeneration, homeostasis, and disease is a central goal of modern biology. Ongoing rapid advances in single-cell biology are enabling the exploration of cell fate specification at unprecedented resolution. Here, we review single-cell RNA sequencing and sequencing of other modalities as methods to elucidate the molecular underpinnings of lineage specification. We specifically discuss how the computational tools available to reconstruct lineage trajectories, quantify cell fate bias, and perform dimensionality reduction for data visualization are providing new mechanistic insights into the process of cell fate decision. Studying cellular differentiation using single-cell genomic tools is paving the way for a detailed understanding of cellular behavior in health and disease.
-
-
-
Knowledge-Based Biomedical Data Science
Vol. 3 (2020), pp. 23–41More LessKnowledge-based biomedical data science involves the design and implementation of computer systems that act as if they knew about biomedicine. Such systems depend on formally represented knowledge in computer systems, often in the form of knowledge graphs. Here we survey recent progress in systems that use formally represented knowledge to address data science problems in both clinical and biological domains, as well as progress on approaches for creating knowledge graphs. Major themes include the relationships between knowledge graphs and machine learning, the use of natural language processing to construct knowledge graphs, and the expansion of novel knowledge-based approaches to clinical and biological domains.
-
-
-
Infectious Disease Research in the Era of Big Data
Vol. 3 (2020), pp. 43–59More LessInfectious disease research spans scales from the molecular to the global—from specific mechanisms of pathogen drug resistance, virulence, and replication to the movement of people, animals, and pathogens around the world. All of these research areas have been impacted by the recent growth of large-scale data sources and data analytics. Some of these advances rely on data or analytic methods that are common to most biomedical data science, while others leverage the unique nature of infectious disease, namely its communicability. This review outlines major research progress in the past few years and highlights some remaining opportunities, focusing on data or methodological approaches particular to infectious disease.
-
-
-
Spatial Metabolomics and Imaging Mass Spectrometry in the Age of Artificial Intelligence
Vol. 3 (2020), pp. 61–87More LessSpatial metabolomics is an emerging field of omics research that has enabled localizing metabolites, lipids, and drugs in tissue sections, a feat considered impossible just two decades ago. Spatial metabolomics and its enabling technology—imaging mass spectrometry—generate big hyperspectral imaging data that have motivated the development of tailored computational methods at the intersection of computational metabolomics and image analysis. Experimental and computational developments have recently opened doors to applications of spatial metabolomics in life sciences and biomedicine. At the same time, these advances have coincided with a rapid evolution in machine learning, deep learning, and artificial intelligence, which are transforming our everyday life and promise to revolutionize biology and healthcare. Here, we introduce spatial metabolomics through the eyes of a computational scientist, review the outstanding challenges, provide a look into the future, and discuss opportunities granted by the ongoing convergence of human and artificial intelligence.
-
-
-
Protein–Protein Interaction Methods and Protein Phase Separation
Vol. 3 (2020), pp. 89–112More LessIn the last decade, newly developed experimental methods have made it possible to highlight that macromolecules in the cell milieu physically interact to support physiology. This has shifted the problem of protein–protein interaction from a microscopic, electron-density scale to a mesoscopic one. Further, nowadays there is increasing evidence that proteins in the nucleus and in the cytoplasm can aggregate in membraneless organelles for different physiological reasons. In this scenario, it is urgent to face the problem of biomolecule functional annotation with efficient computational methods, suited to extract knowledge from reliable data and transfer information across different domains of investigation. Here, we revise the present state of the art of our knowledge of protein–protein interaction and the computational methods that differently implement it. Furthermore, we explore experimental and computational features of a set of proteins involved in phase separation.
-
-
-
Data Integration for Immunology
Vol. 3 (2020), pp. 113–136More LessOver the last several years, next-generation sequencing and its recent push toward single-cell resolution have transformed the landscape of immunology research by revealing novel complexities about all components of the immune system. With the vast amounts of diverse data currently being generated, and with the methods of analyzing and combining diverse data improving as well, integrative systems approaches are becoming more powerful. Previous integrative approaches have combined multiple data types and revealed ways that the immune system, both as a whole and as individual parts, is affected by genetics, the microbiome, and other factors. In this review, we explore the data types that are available for studying immunology with an integrative systems approach, as well as the current strategies and challenges for conducting such analyses.
-
-
-
Computational Methods for Analysis of Large-Scale CRISPR Screens
Vol. 3 (2020), pp. 137–162More LessLarge-scale CRISPR-Cas pooled screens have shown great promise to investigate functional links between genotype and phenotype at the genome-wide scale. In addition to technological advancement, there is a need to develop computational methods to analyze the large datasets obtained from high-throughput CRISPR screens. Many computational methods have been developed to identify reliable gene hits from various screens. In this review, we provide an overview of the technology development of CRISPR screening platforms, with a focus on recent advances in computational methods to identify and model gene effects using CRISPR screen datasets. We also discuss existing challenges and opportunities for future computational methods development.
-
-
-
Computational Methods for Single-Particle Electron Cryomicroscopy
Vol. 3 (2020), pp. 163–190More LessSingle-particle electron cryomicroscopy (cryo-EM) is an increasingly popular technique for elucidating the three-dimensional (3D) structure of proteins and other biologically significant complexes at near-atomic resolution. It is an imaging method that does not require crystallization and can capture molecules in their native states. In single-particle cryo-EM, the 3D molecular structure needs to be determined from many noisy 2D tomographic projections of individual molecules, whose orientations and positions are unknown. The high level of noise and the unknown pose parameters are two key elements that make reconstruction a challenging computational problem. Even more challenging is the inference of structural variability and flexible motions when the individual molecules being imaged are in different conformational states. This review discusses computational methods for structure determination by single-particle cryo-EM and their guiding principles from statistical inference, machine learning, and signal processing, which also play a significant role in many other data science applications.
-
-
-
Immunoinformatics: Predicting Peptide–MHC Binding
Vol. 3 (2020), pp. 191–215More LessImmunoinformatics is a discipline that applies methods of computer science to study and model the immune system. A fundamental question addressed by immunoinformatics is how to understand the rules of antigen presentation by MHC molecules to T cells, a process that is central to adaptive immune responses to infections and cancer. In the modern era of personalized medicine, the ability to model and predict which antigens can be presented by MHC is key to manipulating the immune system and designing strategies for therapeutic intervention. Since the MHC is both polygenic and extremely polymorphic, each individual possesses a personalized set of MHC molecules with different peptide-binding specificities, and collectively they present a unique individualized peptide imprint of the ongoing protein metabolism. Mapping all MHC allotypes is an enormous undertaking that cannot be achieved without a strong bioinformatics component. Computational tools for the prediction of peptide–MHC binding have thus become essential in most pipelines for T cell epitope discovery and an inescapable component of vaccine and cancer research. Here, we describe the development of several such tools, from pioneering efforts to the current state-of-the-art methods, that have allowed for accurate predictions of peptide binding of all MHC molecules, even including those that have not yet been characterized experimentally.
-
-
-
Analytic and Translational Genetics
Vol. 3 (2020), pp. 217–241More LessUnderstanding the influence of genetics on human disease is among the primary goals for biology and medicine. To this end, the direct study of natural human genetic variation has provided valuable insights into human physiology and disease as well as into the origins and migrations of humans. In this review, we discuss the foundations of population genetics, which provide a crucial context to the study of human genes and traits. In particular, genome-wide association studies and similar methods have revealed thousands of genetic loci associated with diseases and traits, providing invaluable information into the biology of these traits. Simultaneously, as the study of rare genetic variation has expanded, so-called human knockouts have elucidated the function of human genes and the therapeutic potential of targeting them.
-
-
-
Mobile Health Monitoring of Cardiac Status
Vol. 3 (2020), pp. 243–263More LessCardiovascular diseases (CVDs) are responsible for more deaths than any other cause, with coronary heart disease and stroke accounting for two-thirds of those deaths. Morbidity and mortality due to CVD are largely preventable, through either primary prevention of disease or secondary prevention of cardiac events. Monitoring cardiac status in healthy and diseased cardiovascular systems has the potential to dramatically reduce cardiac illness and injury. Smart technology in concert with mobile health platforms is creating an environment where timely prevention of and response to cardiac events are becoming a reality.
-
-
-
Statistical Methods in Genome-Wide Association Studies
Ning Sun, and Hongyu ZhaoVol. 3 (2020), pp. 265–288More LessSince the initial success of genome-wide association studies (GWAS) in 2005, tens of thousands of genetic variants have been identified for hundreds of human diseases and traits. In a GWAS, genotype information at up to millions of genetic markers is collected from up to hundreds of thousands of individuals, together with their phenotype information. Several scientific goals can be accomplished through the analysis of GWAS data, including the identification of variants, genes, and pathways associated with diseases and traits of interest; the inference of the genetic architecture of these traits; and the development of genetic risk prediction models. In this review, we provide an overview of the statistical challenges in achieving these goals and recent progress in statistical methodology to address these challenges.
-
-
-
Biomedical Data Science and Informatics Challenges to Implementing Pharmacogenomics with Electronic Health Records
Vol. 3 (2020), pp. 289–314More LessPharmacogenomic information must be incorporated into electronic health records (EHRs) with clinical decision support in order to fully realize its potential to improve drug therapy. Supported by various clinical knowledge resources, pharmacogenomic workflows have been implemented in several healthcare systems. Little standardization exists across these efforts, however, which limits scalability both within and across clinical sites. Limitations in information standards, knowledge management, and the capabilities of modern EHRs remain challenges for the widespread use of pharmacogenomics in the clinic, but ongoing efforts are addressing these challenges. Although much work remains to use pharmacogenomic information more effectively within clinical systems, the experiences of pioneering sites and lessons learned from those programs may be instructive for other clinical areas beyond genomics. We present a vision of what can be achieved as informatics and data science converge to enable further adoption of pharmacogenomics in the clinic.
-
-
-
Identifying Regulatory Elements via Deep Learning
Vol. 3 (2020), pp. 315–338More LessDeep neural networks have been revolutionizing the field of machine learning for the past several years. They have been applied with great success in many domains of the biomedical data sciences and are outperforming extant methods by a large margin. The ability of deep neural networks to pick up local image features and model the interactions between them makes them highly applicable to regulatory genomics. Instead of an image, the networks analyze DNA and RNA sequences and additional epigenomic data. In this review, we survey the successes of deep learning in the field of regulatory genomics. We first describe the fundamental building blocks of deep neural networks, popular architectures used in regulatory genomics, and their training process on molecular sequence data. We then review several key methods in different gene regulation domains. We start with the pioneering method DeepBind and its successors, which were developed to predict protein–DNA binding. We then review methods developed to predict and model epigenetic information, such as histone marks and nucleosome occupancy. Following epigenomics, we review methods to predict protein–RNA binding with its unique challenge of incorporating RNA structure information. Finally, we provide our overall view of the strengths and weaknesses of deep neural networks and prospects for future developments.
-
-
-
Computational Methods for Single-Cell RNA Sequencing
Vol. 3 (2020), pp. 339–364More LessSingle-cell RNA sequencing (scRNA-seq) has provided a high-dimensional catalog of millions of cells across species and diseases. These data have spurred the development of hundreds of computational tools to derive novel biological insights. Here, we outline the components of scRNA-seq analytical pipelines and the computational methods that underlie these steps. We describe available methods, highlight well-executed benchmarking studies, and identify opportunities for additional benchmarking studies and computational methods. As the biochemical approaches for single-cell omics advance, we propose coupled development of robust analytical pipelines suited for the challenges that new data present and principled selection of analytical methods that are suited for the biological questions to be addressed.
-
-
-
Analysis of MRI Data in Diagnostic Neuroradiology
Vol. 3 (2020), pp. 365–390More LessMagnetic resonance imaging (MRI) is a noninvasive imaging tool for neuroradiological diagnosis. Numerous concepts of automated MRI analysis and the use of machine learning have been proposed to assist diagnosis and prognosis. While these academic innovations have proven effective in principle within controlled environments, their application to clinical practice has faced unmet requirements, such as the ability to perform reliably across a heterogeneous population, to work robustly in the presence of comorbidities, and to be invariant to scanner hardware and image quality. The lack of realistic confidence bounds and the inability to handle missing data have also reduced the application of most of these methods outside of academic studies. Mastering the complex challenges in the diagnostic process may help researchers discover novel biological constructs in multimodal data and improve stratification for clinical trials, paving the way for precision medicine. This review presents the state of the art of computerized brain MRI analysis for diagnostic purposes. We critically evaluate the current clinical usefulness of the methods and highlight challenges and future perspectives of the field.
-
-
-
Supercomputing and Secure Cloud Infrastructures in Biology and Medicine
Vol. 3 (2020), pp. 391–410More LessThe increasing amounts of healthcare data stored in health registries, in combination with genomic and other types of data, have the potential to enable better decision making and pave the path for personalized medicine. However, reaping the full benefits of big, sensitive data for the benefit of patients requires greater access to data across organizations and institutions in various regions. This overview first introduces cloud computing and takes stock of the challenges to enhancing data availability in the healthcare system. Four models for ensuring higher data accessibility are then discussed. Finally, several cases are discussed that explore how enhanced access to data would benefit the end user.
-
-
-
Computational Approaches for Unraveling the Effects of Variation in the Human Genome and Microbiome
Vol. 3 (2020), pp. 411–432More LessThe past two decades of analytical efforts have highlighted how much more remains to be learned about the human genome and, particularly, its complex involvement in promoting disease development and progression. While numerous computational tools exist for the assessment of the functional and pathogenic effects of genome variants, their precision is far from satisfactory, particularly for clinical use. Accumulating evidence also suggests that the human microbiome's interaction with the human genome plays a critical role in determining health and disease states. While numerous microbial taxonomic groups and molecular functions of the human microbiome have been associated with disease, the reproducibility of these findings is lacking. The human microbiome–genome interaction in healthy individuals is even less well understood. This review summarizes the available computational methods built to analyze the effect of variation in the human genome and microbiome. We address the applicability and precision of these methods across their possible uses. We also briefly discuss the exciting, necessary, and now possible integration of the two types of data to improve the understanding of pathogenicity mechanisms.
-
-
-
Mining Social Media Data for Biomedical Signals and Health-Related Behavior
Vol. 3 (2020), pp. 433–458More LessSocial media data have been increasingly used to study biomedical and health-related phenomena. From cohort-level discussions of a condition to population-level analyses of sentiment, social media have provided scientists with unprecedented amounts of data to study human behavior associated with a variety of health conditions and medical treatments. Here we review recent work in mining social media for biomedical, epidemiological, and social phenomena information relevant to the multilevel complexity of human health. We pay particular attention to topics where social media data analysis has shown the most progress, including pharmacovigilance and sentiment analysis, especially for mental health. We also discuss a variety of innovative uses of social media data for health-related applications as well as important limitations of social media data access and use.
-