Volume 2

Annual Review of Biomedical Data Science - Volume 2, 2019

Volume 2, 2019

- Discovering Pathway and Cell Type Signatures in Transcriptomic Compendia with Machine Learning
  
  Gregory P. Way, and Casey S. Greene
  
  Vol. 2 (2019), pp. 1–17
  
  https://doi.org/10.1146/annurev-biodatasci-072018-021348
  More Less
  
  Pathway and cell type signatures are patterns present in transcriptome data that are associated with biological processes or phenotypic consequences. These signatures result from specific cell type and pathway expression but can require large transcriptomic compendia to detect. Machine learning techniques can be powerful tools for signature discovery through their ability to provide accurate and interpretable results. In this review, we discuss various machine learning applications to extract pathway and cell type signatures from transcriptomic compendia. We focus on the biological motivations and interpretation for both supervised and unsupervised learning approaches in this setting. We consider recent advances, including deep learning, and their applications to expanding bulk and single-cell RNA data. As data and computational resources increase, there will be more opportunities for machine learning to aid in revealing biological signatures.
  
  Add to my favoritesFavourites
  
  Email this

- Genomic Data Compression
  
  Mikel Hernaez, Dmitri Pavlichin, Tsachy Weissman, and Idoia Ochoa
  
  Vol. 2 (2019), pp. 19–37
  
  https://doi.org/10.1146/annurev-biodatasci-072018-021229
  More Less
  
  Recently, there has been growing interest in genome sequencing, driven by advances in sequencing technology, in terms of both efficiency and affordability. These developments have allowed many to envision whole-genome sequencing as an invaluable tool for both personalized medical care and public health. As a result, increasingly large and ubiquitous genomic data sets are being generated. This poses a significant challenge for the storage and transmission of these data. Already, it is more expensive to store genomic data for a decade than it is to obtain the data in the first place. This situation calls for efficient representations of genomic information. In this review, we emphasize the need for designing specialized compressors tailored to genomic data and describe the main solutions already proposed. We also give general guidelines for storing these data and conclude with our thoughts on the future of genomic formats and compressors.
  
  Add to my favoritesFavourites
  
  Email this

- Molecular Heterogeneity in Large-Scale Biological Data: Techniques and Applications
  
  Chao Deng, Timothy Daley, Guilherme De Sena Brandine, and Andrew D. Smith
  
  Vol. 2 (2019), pp. 39–67
  
  https://doi.org/10.1146/annurev-biodatasci-072018-021339
  More Less
  
  High-throughput sequencing technologies have evolved at a stellar pace for almost a decade and have greatly advanced our understanding of genome biology. In these sampling-based technologies, there is an important detail that is often overlooked in the analysis of the data and the design of the experiments, specifically that the sampled observations often do not give a representative picture of the underlying population. This has long been recognized as a problem in statistical ecology and in the broader statistics literature. In this review, we discuss the connections between these fields, methodological advances that parallel both the needs and opportunities of large-scale data analysis, and specific applications in modern biology. In the process we describe unique aspects of applying these approaches to sequencing technologies, including sequencing error, population and individual heterogeneity, and the design of experiments.
  
  Add to my favoritesFavourites
  
  Email this

- Connectivity Mapping: Methods and Applications
  
  Alexandra B. Keenan, Megan L. Wojciechowicz, Zichen Wang, Kathleen M. Jagodnik, Sherry L. Jenkins, Alexander Lachmann, and Avi Ma'ayan
  
  Vol. 2 (2019), pp. 69–92
  
  https://doi.org/10.1146/annurev-biodatasci-072018-021211
  More Less
  
  Connectivity mapping resources consist of signatures representing changes in cellular state following systematic small-molecule, disease, gene, or other form of perturbations. Such resources enable the characterization of signatures from novel perturbations based on similarity; provide a global view of the space of many themed perturbations; and allow the ability to predict cellular, tissue, and organismal phenotypes for perturbagens. A signature search engine enables hypothesis generation by finding connections between query signatures and the database of signatures. This framework has been used to identify connections between small molecules and their targets, to discover cell-specific responses to perturbations and ways to reverse disease expression states with small molecules, and to predict small-molecule mimickers for existing drugs. This review provides a historical perspective and the current state of connectivity mapping resources with a focus on both methodology and community implementations.
  
  Add to my favoritesFavourites
  
  Email this

- Sketching and Sublinear Data Structures in Genomics
  
  Guillaume Marçais, Brad Solomon, Rob Patro, and Carl Kingsford
  
  Vol. 2 (2019), pp. 93–118
  
  https://doi.org/10.1146/annurev-biodatasci-072018-021156
  More Less
  
  Large-scale genomics demands computational methods that scale sublinearly with the growth of data. We review several data structures and sketching techniques that have been used in genomic analysis methods. Specifically, we focus on four key ideas that take different approaches to achieve sublinear space usage and processing time: compressed full-text indices, approximate membership query data structures, locality-sensitive hashing, and minimizers schemes. We describe these techniques at a high level and give several representative applications of each.
  
  Add to my favoritesFavourites
  
  Email this

- Computational and Informatic Advances for Reproducible Data Analysis in Neuroimaging
  
  Russell A. Poldrack, Krzysztof J. Gorgolewski, and Gaël Varoquaux
  
  Vol. 2 (2019), pp. 119–138
  
  https://doi.org/10.1146/annurev-biodatasci-072018-021237
  More Less
  
  The reproducibility of scientific research has become a point of critical concern. We argue that openness and transparency are critical for reproducibility, and we outline an ecosystem for open and transparent science that has emerged within the human neuroimaging community. We discuss the range of open data-sharing resources that have been developed for neuroimaging data, as well as the role of data standards (particularly the brain imaging data structure) in enabling the automated sharing, processing, and reuse of large neuroimaging data sets. We outline how the open source Python language has provided the basis for a data science platform that enables reproducible data analysis and visualization. We also discuss how new advances in software engineering, such as containerization, provide the basis for greater reproducibility in data analysis. The emergence of this new ecosystem provides an example for many areas of science that are currently struggling with reproducibility.
  
  Add to my favoritesFavourites
  
  Email this

- RNA Sequencing Data: Hitchhiker's Guide to Expression Analysis
  
  Koen Van den Berge, Katharina M. Hembach, Charlotte Soneson, Simone Tiberi, Lieven Clement, Michael I. Love, Rob Patro, and Mark D. Robinson
  
  Vol. 2 (2019), pp. 139–173
  
  https://doi.org/10.1146/annurev-biodatasci-072018-021255
  More Less
  
  Gene expression is the fundamental level at which the results of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq data sets, as well as the performance of the myriad of methods developed. In this review, we give an overview of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on the quantification of gene expression and statistical approachesfor differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.
  
  Add to my favoritesFavourites
  
  Email this

- Integrating Imaging and Omics: Computational Methods and Challenges
  
  Jean-Karim Hériché, Stephanie Alexander, and Jan Ellenberg
  
  Vol. 2 (2019), pp. 175–197
  
  https://doi.org/10.1146/annurev-biodatasci-080917-013328
  More Less
  
  Fluorescence microscopy imaging has long been complementary to DNA sequencing- and mass spectrometry–based omics in biomedical research, but these approaches are now converging. On the one hand, omics methods are moving from in vitro methods that average across large cell populations to in situ molecular characterization tools with single-cell sensitivity. On the other hand, fluorescence microscopy imaging has moved from a morphological description of tissues and cells to quantitative molecular profiling with single-molecule resolution. Recent technological developments underpinned by computational methods have started to blur the lines between imaging and omics and have made their direct correlation and seamless integration an exciting possibility. As this trend continues rapidly, it will allow us to create comprehensive molecular profiles of living systems with spatial and temporal context and subcellular resolution. Key to achieving this ambitious goal will be novel computational methods and successfully dealing with the challenges of data integration and sharing as well as cloud-enabled big data analysis.
  
  Add to my favoritesFavourites
  
  Email this

- Biomolecular Data Resources: Bioinformatics Infrastructure for Biomedical Data Science
  
  Jessica Vamathevan, Rolf Apweiler, and Ewan Birney
  
  Vol. 2 (2019), pp. 199–222
  
  https://doi.org/10.1146/annurev-biodatasci-072018-021321
  More Less
  
  Technological advances have continuously driven the generation of bio-molecular data and the development of bioinformatics infrastructure, which enables data reuse for scientific discovery. Several types of data management resources have arisen, such as data deposition databases, added-value databases or knowledgebases, and biology-driven portals. In this review, we provide a unique overview of the gradual evolution of these resources and discuss the goals and features that must be considered in their development. With the increasing application of genomics in the health care context and with 60 to 500 million whole genomes estimated to be sequenced by 2022, biomedical research infrastructure is transforming, too. Systems for federated access, portable tools, provision of reference data, and interpretation tools will enable researchers to derive maximal benefits from these data. Collaboration, coordination, and sustainability of data resources are key to ensure that biomedical knowledge management can scale with technology shifts and growing data volumes.
  
  Add to my favoritesFavourites
  
  Email this

- Imaging, Visualization, and Computation in Developmental Biology
  
  Francesco Cutrale, Scott E. Fraser, and Le A. Trinh
  
  Vol. 2 (2019), pp. 223–251
  
  https://doi.org/10.1146/annurev-biodatasci-072018-021305
  More Less
  
  Embryonic development is highly complex and dynamic, requiring the coordination of numerous molecular and cellular events at precise times and places. Advances in imaging technology have made it possible to follow developmental processes at cellular, tissue, and organ levels over time as they take place in the intact embryo. Parallel innovations of in vivo probes permit imaging to report on molecular, physiological, and anatomical events of embryogenesis, but the resulting multidimensional data sets pose significant challenges for extracting knowledge. In this review, we discuss recent and emerging advances in imaging technologies, in vivo labeling, and data processing that offer the greatest potential for jointly deciphering the intricate cellular dynamics and the underlying molecular mechanisms. Our discussion of the emerging area of “image-omics” highlights both the challenges of data analysis and the promise of more fully embracing computation and data science for rapidly advancing our understanding of biology.
  
  Add to my favoritesFavourites
  
  Email this

- Scientific Discovery Games for Biomedical Research
  
  Rhiju Das, Benjamin Keep, Peter Washington, and Ingmar H. Riedel-Kruse
  
  Vol. 2 (2019), pp. 253–279
  
  https://doi.org/10.1146/annurev-biodatasci-072018-021139
  More Less
  
  Over the past decade, scientific discovery games (SDGs) have emerged as a viable approach for biomedical research, engaging hundreds of thousands of volunteer players and resulting in numerous scientific publications. After describing the origins of this novel research approach, we review the scientific output of SDGs across molecular modeling, sequence alignment, neuroscience, pathology, cellular biology, genomics, and human cognition. We find compelling results and technical innovations arising in problem-oriented games such as Foldit and Eterna and in data-oriented games such as EyeWire and Project Discovery. We discuss emergent properties of player communities shared across different projects, including the diversity of communities and the extraordinary contributions of some volunteers, such as paper writing. Finally, we highlight connections to artificial intelligence, biological cloud laboratories, new game genres, science education, and open science that may drive the next generation of SDGs.
  
  Add to my favoritesFavourites
  
  Email this

Annual Review of Biomedical Data Science - Volume 2, 2019

Volume 2, 2019

Discovering Pathway and Cell Type Signatures in Transcriptomic Compendia with Machine Learning

Genomic Data Compression

Molecular Heterogeneity in Large-Scale Biological Data: Techniques and Applications

Connectivity Mapping: Methods and Applications

Sketching and Sublinear Data Structures in Genomics

Computational and Informatic Advances for Reproducible Data Analysis in Neuroimaging

RNA Sequencing Data: Hitchhiker's Guide to Expression Analysis

Integrating Imaging and Omics: Computational Methods and Challenges

Biomolecular Data Resources: Bioinformatics Infrastructure for Biomedical Data Science

Imaging, Visualization, and Computation in Developmental Biology

Scientific Discovery Games for Biomedical Research

Previous Volumes

Volume 6 (2023)

Volume 5 (2022)

Volume 4 (2021)

Volume 3 (2020)

Volume 2 (2019)

Volume 1 (2018)

Volume 0 (1932)