Biomolecular Data Resources: Bioinformatics Infrastructure for Biomedical Data Science

Jessica Vamathevan; Rolf Apweiler; Ewan Birney

doi:10.1146/annurev-biodatasci-072018-021321

Annual Review of Biomedical Data Science

Volume 2, 2019

Review Article

Free

Biomolecular Data Resources: Bioinformatics Infrastructure for Biomedical Data Science

Jessica Vamathevan¹, Rolf Apweiler¹, and Ewan Birney¹
View Affiliations Hide Affiliations

Affiliations: European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom; email: [email protected]
Vol. 2:199-222 (Volume publication date July 2019) https://doi.org/10.1146/annurev-biodatasci-072018-021321
First published as a Review in Advance on May 15, 2019
Copyright © 2019 by Annual Reviews. All rights reserved

Abstract

Technological advances have continuously driven the generation of bio-molecular data and the development of bioinformatics infrastructure, which enables data reuse for scientific discovery. Several types of data management resources have arisen, such as data deposition databases, added-value databases or knowledgebases, and biology-driven portals. In this review, we provide a unique overview of the gradual evolution of these resources and discuss the goals and features that must be considered in their development. With the increasing application of genomics in the health care context and with 60 to 500 million whole genomes estimated to be sequenced by 2022, biomedical research infrastructure is transforming, too. Systems for federated access, portable tools, provision of reference data, and interpretation tools will enable researchers to derive maximal benefits from these data. Collaboration, coordination, and sustainability of data resources are key to ensure that biomedical knowledge management can scale with technology shifts and growing data volumes.

Keyword(s): bioinformatics infrastructure, biomedical data management, biomolecular databases, clinical genomics, data deposition databases, knowledgebases

Article metrics loading...

/content/journals/10.1146/annurev-biodatasci-072018-021321

2019-07-20

2024-05-01

Full text loading...

/deliver/fulltext/biodatasci/2/1/annurev-biodatasci-072018-021321.html?itemId=/content/journals/10.1146/annurev-biodatasci-072018-021321&mimeType=html&fmt=ahah

Literature Cited

1.
Molloy JC. 2011. The Open Knowledge Foundation: Open data means better science. PLOS Biol 9:e1001195
[Google Scholar]
2.
Helmy M, Crits-Christoph A, Bader GD 2016. Ten simple rules for developing public biological databases. PLOS Comput. Biol. 12:e1005128
[Google Scholar]
3.
Cook CE, Bergman MT, Cochrane G, Apweiler R, Birney E 2018. The European Bioinformatics Institute in 2017: data coordination and integration. Nucleic Acids Res 46:D21–29
[Google Scholar]
4.
NCBI (Natl. Cent. Biotechnol. Inf.) Resour. Coord 2016. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 44:D1D7–19
[Google Scholar]
5.
Sousa SA, Leitão JH, Martins RC, Sanches JM, Suri JS, Giorgetti A 2016. Bioinformatics applications in life sciences and technologies. BioMed Res. Int. 2016:3603827
[Google Scholar]
6.
Wooller SK, Benstead-Hume G, Chen X, Ali Y, Pearl FMG 2017. Bioinformatics in translational drug discovery. Biosci. Rep. 37:BSR20160180
[Google Scholar]
7.
Dayoff MO. 1965. Atlas of Protein Sequence and Structure Silver Spring, MD: Natl. Biomed. Res. Found.
8.
McKusick VA 1966–1998. Mendelian Inheritance in Man Baltimore: John Hopkins Univ. Press
9.
Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A 2015. OMIM.org: Online Mendelian Inheritance in Man (OMIM^®), an online catalog of human genes and genetic disorders. Nucleic Acids Res 43:D789–98
[Google Scholar]
10.
Burley SK, Berman HM, Kleywegt GJ, Markley JL, Nakamura H, Velankar S 2017. Protein Data Bank (PDB): the single global macromolecular structure archive. Protein Crystallography: Methods and Protocols A Wlodawer, Z Dauter, M Jaskolski627–41 New York: Springer
[Google Scholar]
11.
Harrison PW, Alako B, Amid C, Cerdeño-Tárraga A, Cleland I et al. 2018. The European Nucleotide Archive in 2018. Nucleic Acids Res 47:D1D84–88
[Google Scholar]
12.
UniProt Consort 2017. UniProt: the universal protein knowledgebase. Nucleic Acids Res 45:D158–69
[Google Scholar]
13.
Imker HJ. 2018. 25 years of molecular biology databases: a study of proliferation, impact, and maintenance. Front. Res. Metr. Anal. 3: https://doi.org/10.3389/frma.2018.00018
[Crossref] [Google Scholar]
14.
Vines TH, Andrew RL, Bock DG, Franklin MT, Gilbert KJ et al. 2013. Mandated data archiving greatly improves access to research data. FASEB J 27:1304–8
[Google Scholar]
15.
Cochrane G, Karsch-Mizrachi I, Nakamura Y 2011. The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res 39:D15–18
[Google Scholar]
16.
Nakamura Y, Cochrane G, Karsch-Mizrachi I 2013. The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res 41:D21–24
[Google Scholar]
17.
Berman HM, Burley SK, Kleywegt GJ, Markley JL, Nakamura H, Velankar S 2016. The archiving and dissemination of biological structure data. Curr. Opin. Struct. Biol. 40:17–22
[Google Scholar]
18.
Lawson CL, Patwardhan A, Baker ML, Hryc C, Garcia ES et al. 2016. EMDataBank unified data resource for 3DEM. Nucleic Acids Res 44:D396–403
[Google Scholar]
19.
Berman H, Henrick K, Nakamura H, Markley JL 2007. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res 35:D301–3
[Google Scholar]
20.
Iudin A, Korir PK, Salavert-Torres J, Kleywegt GJ, Patwardhan A 2016. EMPIAR: a public archive for raw electron microscopy image data. Nat. Methods 13:387–88
[Google Scholar]
21.
Patwardhan A. 2017. Trends in the Electron Microscopy Data Bank (EMDB). Acta Crystallogr. D 73:503–8
[Google Scholar]
22.
Shabani M, Knoppers BM, Borry P 2016. Genomic databases, access review, and data access committees. Medical and Health Genomics D Kumar, S Antonarakis29–35 San Diego, CA: Academic
[Google Scholar]
23.
Lappalainen I, Almeida-King J, Kumanduri V, Senf A, Spalding JD et al. 2015. The European Genome-phenome Archive of human data consented for biomedical research. Nat. Genet. 47:692–95
[Google Scholar]
24.
Tryka KA, Hao L, Sturcke A, Jin Y, Wang ZY et al. 2014. NCBI's Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Res 42:D975–79
[Google Scholar]
25.
Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E et al. 2013. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45:580–85
[Google Scholar]
26.
UK10K Consort 2015. The UK10K project identifies rare variants in health and disease. Nature 526:82–90
[Google Scholar]
27.
Keegan KP, Glass EM, Meyer F 2016. MG-RAST, a metagenomics service for analysis of microbial community structure and function. Methods Mol. Biol. 1399:207–33
[Google Scholar]
28.
Mitchell AL, Scheremetjew M, Denise H, Potter S, Tarkowska A et al. 2018. EBI metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies. Nucleic Acids Res 46:D726–35
[Google Scholar]
29.
Deutsch EW, Csordas A, Sun Z, Jarnuczak A, Perez-Riverol Y et al. 2017. The ProteomeXchange consortium in 2017: supporting the cultural change in proteomics public data deposition. Nucleic Acids Res 45:D1100–6
[Google Scholar]
30.
Orchard S, Kerrien S, Abbani S, Aranda B, Bhate J et al. 2012. Protein interaction data curation: the International Molecular Exchange (IMEx) consortium. Nat. Methods 9:345–50
[Google Scholar]
31.
Haug K, Salek RM, Conesa P, Hastings J, de Matos P et al. 2013. MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic Acids Res 41:D781–86
[Google Scholar]
32.
Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D et al. 2018. Ensembl 2018. Nucleic Acids Res 46:D754–61
[Google Scholar]
33.
Papatheodorou I, Fonseca NA, Keays M, Tang YA, Barrera E et al. 2018. Expression Atlas: gene and protein expression across multiple studies and organisms. Nucleic Acids Res 46:D246–51
[Google Scholar]
34.
Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M et al. 2018. The Reactome Pathway Knowledgebase. Nucleic Acids Res 46:D649–55
[Google Scholar]
35.
Gramates LS, Marygold SJ, dos Santos G, Urbano J-M, Antonazzo G et al. 2017. FlyBase at 25: looking to the future. Nucleic Acids Res 45:D663–71
[Google Scholar]
36.
Lee RYN, Howe KL, Harris TW, Arnaboldi V, Cain S et al. 2018. WormBase 2017: molting into a new stage. Nucleic Acids Res 46:D869–74
[Google Scholar]
37.
Aurrecoechea C, Brestelli J, Brunk BP, Dommer J, Fischer S et al. 2009. PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res 37:D539–43
[Google Scholar]
38.
RNAcentral Consort 2017. RNAcentral: a comprehensive database of non-coding RNA sequences. Nucleic Acids Res 45:D128–34
[Google Scholar]
39.
Malone J, Holloway E, Adamusiak T, Kapushesky M, Zheng J et al. 2010. Modeling sample variables with an Experimental Factor Ontology. Bioinformatics 26:1112–18
[Google Scholar]
40.
Eur. PMC Consort 2015. Europe PMC: a full-text literature database for the life sciences and platform for innovation. Nucleic Acids Res 43:D1042–48
[Google Scholar]
41.
Natl. Res. Counc. Comm. Responsib. Authorship Biol. Sci 2003. Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences Washington, DC: Natl. Acad. Press
42.
Boulton G, Rawlins M, Vallance P, Walport M 2011. Science as a public enterprise: the case for open data. Lancet 377:1633–35
[Google Scholar]
43.
Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018
[Google Scholar]
44.
Ball C, Brazma A, Causton H, Chervitz S, Edgar R et al. 2004. Standards for microarray data: an open letter. Environ. Health Perspect. 112:A666–67
[Google Scholar]
45.
Moftah RA, Maatuk AM, White R 2016. Methods to access structured and semi-structured data in bioinformatics databases: a perspective. Proceedings of the 2016 International Conference on Engineering & MIS (ICEMIS) New York: IEEE
[Google Scholar]
46.
Cook CE, Bergman MT, Finn RD, Cochrane G et al. 2016. The European Bioinformatics Institute in 2016: data growth and integration. Nucleic Acids Res 44:D20–26
[Google Scholar]
47.
Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E 2011. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21:734–40
[Google Scholar]
48.
Glob. Alliance Genomics Health 2016. A federated ecosystem for sharing genomic, clinical data. Science 352:1278–80
[Google Scholar]
49.
Malone J, Stevens R, Jupp S, Hancocks T, Parkinson H, Brooksbank C 2016. Ten simple rules for selecting a bio-ontology. PLOS Comput. Biol. 12:e1004743
[Google Scholar]
50.
Odell SG, Lazo GR, Woodhouse MR, Hane DL, Sen TZ 2017. The art of curation at a biological database: principles and application. Curr. Plant Biol. 11–12:2–11
[Google Scholar]
51.
Venkatesan A, Kim JH, Talo F, Ide-Smith M, Gobeill J et al. 2017. SciLite: a platform for displaying text-mined annotations as a means to link research articles with biological data. Wellcome Open Res 1:25 https://doi.org/10.12688/wellcomeopenres.10210.2
[Crossref] [Google Scholar]
52.
Orchard S, Ammari M, Aranda B, Breuza L, Briganti L et al. 2014. The MIntAct project–IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 42:D358–63
[Google Scholar]
53.
Poux S, Arighi CN, Magrane M, Bateman A, Wei C-H et al. 2017. On expert curation and scalability: UniProtKB/Swiss-Prot as a case study. Bioinformatics 33:3454–60
[Google Scholar]
54.
Williams E, Moore J, Li SW, Rustici G, Tarkowska A et al. 2017. The Image Data Resource: a bioimage data integration and publication platform. Nat. Methods 14:775–81
[Google Scholar]
55.
Chojnacki S, Cowley A, Lee J, Foix A, Lopez R 2017. Programmatic access to bioinformatics tools from EMBL-EBI update: 2017. Nucleic Acids Res 45:W550–53
[Google Scholar]
56.
Pavelin K, Cham JA, de Matos P, Brooksbank C, Cameron G, Steinbeck C 2012. Bioinformatics meets user-centred design: a perspective. PLOS Comput. Biol. 8:e1002554
[Google Scholar]
57.
Javahery H, Seffah A, Radhakrishnan T 2004. Beyond power: making bioinformatics tools user-centered. Commun. ACM 47:58–63
[Google Scholar]
58.
Bolchini D, Finkelstein A, Perrone V, Nagl S 2009. Better bioinformatics through usability analysis. Bioinformatics 25:406–12
[Google Scholar]
59.
Koscielny G, An P, Carvalho-Silva D, Cham JA, Fumis L et al. 2017. Open Targets: a platform for therapeutic target identification and validation. Nucleic Acids Res 45:D985–94
[Google Scholar]
60.
Karamanis N, Pignatelli M, Carvalho-Silva D, Rowland F, Cham JA, Dunham I 2018. Designing an intuitive web application for drug discovery scientists. Drug Discov. Today 23:1169–74
[Google Scholar]
61.
Côté RG, Jones P, Martens L, Apweiler R, Hermjakob H 2008. The Ontology Lookup Service: more data and better tools for controlled vocabulary queries. Nucleic Acids Res 36:W372–76
[Google Scholar]
62.
Regev A, Teichmann SA, Lander ES, Amit I, Benoist C et al. 2017. The Human Cell Atlas. eLife 6:e27041
[Google Scholar]
63.
Ellenberg J, Swedlow JR, Barlow M, Cook CE, Sarkans U et al. 2018. A call for public archives for biological image data. Nat. Methods 15:849–54
[Google Scholar]
64.
Marx V. 2013. The big challenges of big data. Nature 498:255–60
[Google Scholar]
65.
Durinx C, McEntyre J, Appel R, Apweiler R, Barlow M et al. 2017. Identifying ELIXIR Core Data Resources. F1000Research 5:2422
[Google Scholar]
66.
Stark Z, Dolman D, Manolio TA, Ozenberger B, Hill SL et al. 2019. Integrating genomics into healthcare: a global responsibility. Am. J. Hum. Genet. 104:113–20
[Google Scholar]
67.
Birney E, Vamathevan J, Goodhand P 2017. Genomics in healthcare: GA4GH looks to. 2022 bioRxiv 203554. https://doi.org/10.1101/203554
[Crossref]
68.
Burton PR, Hansell AL, Fortier I, Manolio TA, Khoury MJ et al. 2009. Size matters: Just how big is BIG? Quantifying realistic sample size requirements for human genome epidemiology. Int. J. Epidemiol. 38:263–73
[Google Scholar]
69.
Philippakis AA, Azzariti DR, Beltran S, Brookes AJ, Brownstein CA et al. 2015. The Matchmaker Exchange: a platform for rare disease gene discovery. Hum. Mutat. 36:915–21
[Google Scholar]
70.
Holub P, Swertz M, Reihs R, van Enckevort D, Müller H, Litton J-E 2016. BBMRI-ERIC Directory: 515 biobanks with over 60 million biological samples. Biopreserv. Biobank. 14:559–62
[Google Scholar]
71.
Sudlow C, Gallacher J, Allen N, Beral V, Burton P et al. 2015. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Med 12:e1001779
[Google Scholar]
72.
Spjuth O, Krestyaninova M, Hastings J, Shen H-Y, Heikkinen J et al. 2016. Harmonising and linking biomedical and clinical data across disparate data archives to enable integrative cross-biobank research. Eur. J. Hum. Genet. 24:521–28
[Google Scholar]
73.
ENCODE Proj. Consort 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74
[Google Scholar]
74.
1000 Genomes Proj. Consort 2015. A global reference for human genetic variation. Nature 526:68–74
[Google Scholar]
75.
Stunnenberg HG, Abrignani S, Adams D, de Almeida M, Altucci L et al. The International Human Epigenome Consortium: a blueprint for scientific collaboration and discovery. Cell 167:1145–49
[Google Scholar]
76.
Church DM, Schneider VA, Steinberg KM, Schatz MC, Quinlan AR et al. 2015. Extending reference assembly models. Genome Biol 16:13
[Google Scholar]
77.
Wright CF, Middleton A, Burton H, Cunningham F, Humphries SE et al. 2013. Policy challenges of clinical genome sequencing. BMJ 347:f6845
[Google Scholar]
78.
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS et al. 2016. The Ensembl Variant Effect Predictor. Genome Biol 17:122
[Google Scholar]
79.
Knoppers BM, Joly Y. 2018. Introduction: the why and whither of genomic data sharing. Hum. Genet. 137:569–74
[Google Scholar]
80.
Dyke SOM, Linden M, Lappalainen I, De Argila JR, Carey K et al. 2018. Registered access: authorizing data access. Eur. J. Hum. Genet. 26:1721–31
[Google Scholar]
81.
Ormondroyd E, Mackley MP, Blair E, Craft J, Knight JC et al. 2017. Insights from early experience of a Rare Disease Genomic Medicine Multidisciplinary Team: a qualitative study. Eur. J. Hum. Genet. 25:680–86
[Google Scholar]
82.
Turnbull C, Scott RH, Thomas E, Jones L, Murugaesu N et al. 2018. The 100000 Genomes Project: bringing whole genome sequencing to the NHS. BMJ 361:k1687
[Google Scholar]
83.
Shameer K, Badgeley MA, Miotto R, Glicksberg BS, Morgan JW, Dudley JT 2016. Translational bioinformatics in the era of real-time biomedical, health care and wellness data streams. Brief. Bioinform. 18:105–24
[Google Scholar]

/content/journals/10.1146/annurev-biodatasci-072018-021321

Biomolecular Data Resources: Bioinformatics Infrastructure for Biomedical Data Science

Annual Review of Biomedical Data Science 2, 199 (2019); https://doi.org/10.1146/annurev-biodatasci-072018-021321

/content/journals/10.1146/annurev-biodatasci-072018-021321

Data & Media loading...

Article Type: Review Article

Most Cited Most Cited RSS feed

- Ethical Machine Learning in Healthcare
  
  Irene Y. Chen, Emma Pierson, Sherri Rose, Shalmali Joshi, Kadija Ferryman, and Marzyeh Ghassemi
  
  Vol. 4 (2021), pp. 123–144
- Spatial Metabolomics and Imaging Mass Spectrometry in the Age of Artificial Intelligence
  
  Theodore Alexandrov
  
  Vol. 3 (2020), pp. 61–87
- Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models
  
  Juan M. Banda, Martin Seneviratne, Tina Hernandez-Boussard, and Nigam H. Shah
  
  Vol. 1 (2018), pp. 53–68
- Computational Methods for Understanding Mass Spectrometry–Based Shotgun Proteomics Data
  
  Pavel Sinitcyn, Jan Daniel Rudolph, and Jürgen Cox
  
  Vol. 1 (2018), pp. 207–234
- RNA Sequencing Data: Hitchhiker's Guide to Expression Analysis
  
  Koen Van den Berge, Katharina M. Hembach, Charlotte Soneson, Simone Tiberi, Lieven Clement, Michael I. Love, Rob Patro, and Mark D. Robinson
  
  Vol. 2 (2019), pp. 139–173
- Deep Learning in Biomedical Data Science
  
  Pierre Baldi
  
  Vol. 1 (2018), pp. 181–205
- From Tissues to Cell Types and Back: Single-Cell Gene Expression Analysis of Tissue Architecture
  
  Xi Chen, Sarah A. Teichmann, and Kerstin B. Meyer
  
  Vol. 1 (2018), pp. 29–51
- Visualization of Biomedical Data
  
  Seán I. O'Donoghue, Benedetta Frida Baldi, Susan J. Clark, Aaron E. Darling, James M. Hogan, Sandeep Kaur, Lena Maier-Hein, Davis J. McCarthy, William J. Moore, Esther Stenau, Jason R. Swedlow, Jenny Vuong, and James B. Procter
  
  Vol. 1 (2018), pp. 275–304
- Using Phecodes for Research with the Electronic Health Record: From PheWAS to PheRS
  
  Lisa Bastarache
  
  Vol. 4 (2021), pp. 1–19
- Alignment-Free Sequence Analysis and Applications
  
  Jie Ren, Xin Bai, Yang Young Lu, Kujin Tang, Ying Wang, Gesine Reinert, and Fengzhu Sun
  
  Vol. 1 (2018), pp. 93–114
More Less

Annual Review of Biomedical Data Science

Volume 2, 2019

Review Article

Free

Biomolecular Data Resources: Bioinformatics Infrastructure for Biomedical Data Science

Abstract

Most Read This Month

Most Cited Most Cited RSS feed