1932

Abstract

Technological advances have continuously driven the generation of bio-molecular data and the development of bioinformatics infrastructure, which enables data reuse for scientific discovery. Several types of data management resources have arisen, such as data deposition databases, added-value databases or knowledgebases, and biology-driven portals. In this review, we provide a unique overview of the gradual evolution of these resources and discuss the goals and features that must be considered in their development. With the increasing application of genomics in the health care context and with 60 to 500 million whole genomes estimated to be sequenced by 2022, biomedical research infrastructure is transforming, too. Systems for federated access, portable tools, provision of reference data, and interpretation tools will enable researchers to derive maximal benefits from these data. Collaboration, coordination, and sustainability of data resources are key to ensure that biomedical knowledge management can scale with technology shifts and growing data volumes.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-biodatasci-072018-021321
2019-07-20
2024-04-15
Loading full text...

Full text loading...

/deliver/fulltext/biodatasci/2/1/annurev-biodatasci-072018-021321.html?itemId=/content/journals/10.1146/annurev-biodatasci-072018-021321&mimeType=html&fmt=ahah

Literature Cited

  1. 1. 
    Molloy JC. 2011. The Open Knowledge Foundation: Open data means better science. PLOS Biol 9:e1001195
    [Google Scholar]
  2. 2. 
    Helmy M, Crits-Christoph A, Bader GD 2016. Ten simple rules for developing public biological databases. PLOS Comput. Biol. 12:e1005128
    [Google Scholar]
  3. 3. 
    Cook CE, Bergman MT, Cochrane G, Apweiler R, Birney E 2018. The European Bioinformatics Institute in 2017: data coordination and integration. Nucleic Acids Res 46:D21–29
    [Google Scholar]
  4. 4. 
    NCBI (Natl. Cent. Biotechnol. Inf.) Resour. Coord 2016. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 44:D1D7–19
    [Google Scholar]
  5. 5. 
    Sousa SA, Leitão JH, Martins RC, Sanches JM, Suri JS, Giorgetti A 2016. Bioinformatics applications in life sciences and technologies. BioMed Res. Int. 2016:3603827
    [Google Scholar]
  6. 6. 
    Wooller SK, Benstead-Hume G, Chen X, Ali Y, Pearl FMG 2017. Bioinformatics in translational drug discovery. Biosci. Rep. 37:BSR20160180
    [Google Scholar]
  7. 7. 
    Dayoff MO. 1965. Atlas of Protein Sequence and Structure Silver Spring, MD: Natl. Biomed. Res. Found.
  8. 8. 
    McKusick VA 1966–1998. Mendelian Inheritance in Man Baltimore: John Hopkins Univ. Press
  9. 9. 
    Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A 2015. OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res 43:D789–98
    [Google Scholar]
  10. 10. 
    Burley SK, Berman HM, Kleywegt GJ, Markley JL, Nakamura H, Velankar S 2017. Protein Data Bank (PDB): the single global macromolecular structure archive. Protein Crystallography: Methods and Protocols A Wlodawer, Z Dauter, M Jaskolski627–41 New York: Springer
    [Google Scholar]
  11. 11. 
    Harrison PW, Alako B, Amid C, Cerdeño-Tárraga A, Cleland I et al. 2018. The European Nucleotide Archive in 2018. Nucleic Acids Res 47:D1D84–88
    [Google Scholar]
  12. 12. 
    UniProt Consort 2017. UniProt: the universal protein knowledgebase. Nucleic Acids Res 45:D158–69
    [Google Scholar]
  13. 13. 
    Imker HJ. 2018. 25 years of molecular biology databases: a study of proliferation, impact, and maintenance. Front. Res. Metr. Anal. 3: https://doi.org/10.3389/frma.2018.00018
    [Crossref] [Google Scholar]
  14. 14. 
    Vines TH, Andrew RL, Bock DG, Franklin MT, Gilbert KJ et al. 2013. Mandated data archiving greatly improves access to research data. FASEB J 27:1304–8
    [Google Scholar]
  15. 15. 
    Cochrane G, Karsch-Mizrachi I, Nakamura Y 2011. The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res 39:D15–18
    [Google Scholar]
  16. 16. 
    Nakamura Y, Cochrane G, Karsch-Mizrachi I 2013. The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res 41:D21–24
    [Google Scholar]
  17. 17. 
    Berman HM, Burley SK, Kleywegt GJ, Markley JL, Nakamura H, Velankar S 2016. The archiving and dissemination of biological structure data. Curr. Opin. Struct. Biol. 40:17–22
    [Google Scholar]
  18. 18. 
    Lawson CL, Patwardhan A, Baker ML, Hryc C, Garcia ES et al. 2016. EMDataBank unified data resource for 3DEM. Nucleic Acids Res 44:D396–403
    [Google Scholar]
  19. 19. 
    Berman H, Henrick K, Nakamura H, Markley JL 2007. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res 35:D301–3
    [Google Scholar]
  20. 20. 
    Iudin A, Korir PK, Salavert-Torres J, Kleywegt GJ, Patwardhan A 2016. EMPIAR: a public archive for raw electron microscopy image data. Nat. Methods 13:387–88
    [Google Scholar]
  21. 21. 
    Patwardhan A. 2017. Trends in the Electron Microscopy Data Bank (EMDB). Acta Crystallogr. D 73:503–8
    [Google Scholar]
  22. 22. 
    Shabani M, Knoppers BM, Borry P 2016. Genomic databases, access review, and data access committees. Medical and Health Genomics D Kumar, S Antonarakis29–35 San Diego, CA: Academic
    [Google Scholar]
  23. 23. 
    Lappalainen I, Almeida-King J, Kumanduri V, Senf A, Spalding JD et al. 2015. The European Genome-phenome Archive of human data consented for biomedical research. Nat. Genet. 47:692–95
    [Google Scholar]
  24. 24. 
    Tryka KA, Hao L, Sturcke A, Jin Y, Wang ZY et al. 2014. NCBI's Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Res 42:D975–79
    [Google Scholar]
  25. 25. 
    Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E et al. 2013. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45:580–85
    [Google Scholar]
  26. 26. 
    UK10K Consort 2015. The UK10K project identifies rare variants in health and disease. Nature 526:82–90
    [Google Scholar]
  27. 27. 
    Keegan KP, Glass EM, Meyer F 2016. MG-RAST, a metagenomics service for analysis of microbial community structure and function. Methods Mol. Biol. 1399:207–33
    [Google Scholar]
  28. 28. 
    Mitchell AL, Scheremetjew M, Denise H, Potter S, Tarkowska A et al. 2018. EBI metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies. Nucleic Acids Res 46:D726–35
    [Google Scholar]
  29. 29. 
    Deutsch EW, Csordas A, Sun Z, Jarnuczak A, Perez-Riverol Y et al. 2017. The ProteomeXchange consortium in 2017: supporting the cultural change in proteomics public data deposition. Nucleic Acids Res 45:D1100–6
    [Google Scholar]
  30. 30. 
    Orchard S, Kerrien S, Abbani S, Aranda B, Bhate J et al. 2012. Protein interaction data curation: the International Molecular Exchange (IMEx) consortium. Nat. Methods 9:345–50
    [Google Scholar]
  31. 31. 
    Haug K, Salek RM, Conesa P, Hastings J, de Matos P et al. 2013. MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic Acids Res 41:D781–86
    [Google Scholar]
  32. 32. 
    Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D et al. 2018. Ensembl 2018. Nucleic Acids Res 46:D754–61
    [Google Scholar]
  33. 33. 
    Papatheodorou I, Fonseca NA, Keays M, Tang YA, Barrera E et al. 2018. Expression Atlas: gene and protein expression across multiple studies and organisms. Nucleic Acids Res 46:D246–51
    [Google Scholar]
  34. 34. 
    Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M et al. 2018. The Reactome Pathway Knowledgebase. Nucleic Acids Res 46:D649–55
    [Google Scholar]
  35. 35. 
    Gramates LS, Marygold SJ, dos Santos G, Urbano J-M, Antonazzo G et al. 2017. FlyBase at 25: looking to the future. Nucleic Acids Res 45:D663–71
    [Google Scholar]
  36. 36. 
    Lee RYN, Howe KL, Harris TW, Arnaboldi V, Cain S et al. 2018. WormBase 2017: molting into a new stage. Nucleic Acids Res 46:D869–74
    [Google Scholar]
  37. 37. 
    Aurrecoechea C, Brestelli J, Brunk BP, Dommer J, Fischer S et al. 2009. PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res 37:D539–43
    [Google Scholar]
  38. 38. 
    RNAcentral Consort 2017. RNAcentral: a comprehensive database of non-coding RNA sequences. Nucleic Acids Res 45:D128–34
    [Google Scholar]
  39. 39. 
    Malone J, Holloway E, Adamusiak T, Kapushesky M, Zheng J et al. 2010. Modeling sample variables with an Experimental Factor Ontology. Bioinformatics 26:1112–18
    [Google Scholar]
  40. 40. 
    Eur. PMC Consort 2015. Europe PMC: a full-text literature database for the life sciences and platform for innovation. Nucleic Acids Res 43:D1042–48
    [Google Scholar]
  41. 41. 
    Natl. Res. Counc. Comm. Responsib. Authorship Biol. Sci 2003. Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences Washington, DC: Natl. Acad. Press
  42. 42. 
    Boulton G, Rawlins M, Vallance P, Walport M 2011. Science as a public enterprise: the case for open data. Lancet 377:1633–35
    [Google Scholar]
  43. 43. 
    Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018
    [Google Scholar]
  44. 44. 
    Ball C, Brazma A, Causton H, Chervitz S, Edgar R et al. 2004. Standards for microarray data: an open letter. Environ. Health Perspect. 112:A666–67
    [Google Scholar]
  45. 45. 
    Moftah RA, Maatuk AM, White R 2016. Methods to access structured and semi-structured data in bioinformatics databases: a perspective. Proceedings of the 2016 International Conference on Engineering & MIS (ICEMIS) New York: IEEE
    [Google Scholar]
  46. 46. 
    Cook CE, Bergman MT, Finn RD, Cochrane G et al. 2016. The European Bioinformatics Institute in 2016: data growth and integration. Nucleic Acids Res 44:D20–26
    [Google Scholar]
  47. 47. 
    Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E 2011. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21:734–40
    [Google Scholar]
  48. 48. 
    Glob. Alliance Genomics Health 2016. A federated ecosystem for sharing genomic, clinical data. Science 352:1278–80
    [Google Scholar]
  49. 49. 
    Malone J, Stevens R, Jupp S, Hancocks T, Parkinson H, Brooksbank C 2016. Ten simple rules for selecting a bio-ontology. PLOS Comput. Biol. 12:e1004743
    [Google Scholar]
  50. 50. 
    Odell SG, Lazo GR, Woodhouse MR, Hane DL, Sen TZ 2017. The art of curation at a biological database: principles and application. Curr. Plant Biol. 11–12:2–11
    [Google Scholar]
  51. 51. 
    Venkatesan A, Kim JH, Talo F, Ide-Smith M, Gobeill J et al. 2017. SciLite: a platform for displaying text-mined annotations as a means to link research articles with biological data. Wellcome Open Res 1:25 https://doi.org/10.12688/wellcomeopenres.10210.2
    [Crossref] [Google Scholar]
  52. 52. 
    Orchard S, Ammari M, Aranda B, Breuza L, Briganti L et al. 2014. The MIntAct project–IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 42:D358–63
    [Google Scholar]
  53. 53. 
    Poux S, Arighi CN, Magrane M, Bateman A, Wei C-H et al. 2017. On expert curation and scalability: UniProtKB/Swiss-Prot as a case study. Bioinformatics 33:3454–60
    [Google Scholar]
  54. 54. 
    Williams E, Moore J, Li SW, Rustici G, Tarkowska A et al. 2017. The Image Data Resource: a bioimage data integration and publication platform. Nat. Methods 14:775–81
    [Google Scholar]
  55. 55. 
    Chojnacki S, Cowley A, Lee J, Foix A, Lopez R 2017. Programmatic access to bioinformatics tools from EMBL-EBI update: 2017. Nucleic Acids Res 45:W550–53
    [Google Scholar]
  56. 56. 
    Pavelin K, Cham JA, de Matos P, Brooksbank C, Cameron G, Steinbeck C 2012. Bioinformatics meets user-centred design: a perspective. PLOS Comput. Biol. 8:e1002554
    [Google Scholar]
  57. 57. 
    Javahery H, Seffah A, Radhakrishnan T 2004. Beyond power: making bioinformatics tools user-centered. Commun. ACM 47:58–63
    [Google Scholar]
  58. 58. 
    Bolchini D, Finkelstein A, Perrone V, Nagl S 2009. Better bioinformatics through usability analysis. Bioinformatics 25:406–12
    [Google Scholar]
  59. 59. 
    Koscielny G, An P, Carvalho-Silva D, Cham JA, Fumis L et al. 2017. Open Targets: a platform for therapeutic target identification and validation. Nucleic Acids Res 45:D985–94
    [Google Scholar]
  60. 60. 
    Karamanis N, Pignatelli M, Carvalho-Silva D, Rowland F, Cham JA, Dunham I 2018. Designing an intuitive web application for drug discovery scientists. Drug Discov. Today 23:1169–74
    [Google Scholar]
  61. 61. 
    Côté RG, Jones P, Martens L, Apweiler R, Hermjakob H 2008. The Ontology Lookup Service: more data and better tools for controlled vocabulary queries. Nucleic Acids Res 36:W372–76
    [Google Scholar]
  62. 62. 
    Regev A, Teichmann SA, Lander ES, Amit I, Benoist C et al. 2017. The Human Cell Atlas. eLife 6:e27041
    [Google Scholar]
  63. 63. 
    Ellenberg J, Swedlow JR, Barlow M, Cook CE, Sarkans U et al. 2018. A call for public archives for biological image data. Nat. Methods 15:849–54
    [Google Scholar]
  64. 64. 
    Marx V. 2013. The big challenges of big data. Nature 498:255–60
    [Google Scholar]
  65. 65. 
    Durinx C, McEntyre J, Appel R, Apweiler R, Barlow M et al. 2017. Identifying ELIXIR Core Data Resources. F1000Research 5:2422
    [Google Scholar]
  66. 66. 
    Stark Z, Dolman D, Manolio TA, Ozenberger B, Hill SL et al. 2019. Integrating genomics into healthcare: a global responsibility. Am. J. Hum. Genet. 104:113–20
    [Google Scholar]
  67. 67. 
    Birney E, Vamathevan J, Goodhand P 2017. Genomics in healthcare: GA4GH looks to. 2022 bioRxiv 203554. https://doi.org/10.1101/203554
    [Crossref]
  68. 68. 
    Burton PR, Hansell AL, Fortier I, Manolio TA, Khoury MJ et al. 2009. Size matters: Just how big is BIG? Quantifying realistic sample size requirements for human genome epidemiology. Int. J. Epidemiol. 38:263–73
    [Google Scholar]
  69. 69. 
    Philippakis AA, Azzariti DR, Beltran S, Brookes AJ, Brownstein CA et al. 2015. The Matchmaker Exchange: a platform for rare disease gene discovery. Hum. Mutat. 36:915–21
    [Google Scholar]
  70. 70. 
    Holub P, Swertz M, Reihs R, van Enckevort D, Müller H, Litton J-E 2016. BBMRI-ERIC Directory: 515 biobanks with over 60 million biological samples. Biopreserv. Biobank. 14:559–62
    [Google Scholar]
  71. 71. 
    Sudlow C, Gallacher J, Allen N, Beral V, Burton P et al. 2015. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Med 12:e1001779
    [Google Scholar]
  72. 72. 
    Spjuth O, Krestyaninova M, Hastings J, Shen H-Y, Heikkinen J et al. 2016. Harmonising and linking biomedical and clinical data across disparate data archives to enable integrative cross-biobank research. Eur. J. Hum. Genet. 24:521–28
    [Google Scholar]
  73. 73. 
    ENCODE Proj. Consort 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74
    [Google Scholar]
  74. 74. 
    1000 Genomes Proj. Consort 2015. A global reference for human genetic variation. Nature 526:68–74
    [Google Scholar]
  75. 75. 
    Stunnenberg HG, Abrignani S, Adams D, de Almeida M, Altucci L et al. The International Human Epigenome Consortium: a blueprint for scientific collaboration and discovery. Cell 167:1145–49
    [Google Scholar]
  76. 76. 
    Church DM, Schneider VA, Steinberg KM, Schatz MC, Quinlan AR et al. 2015. Extending reference assembly models. Genome Biol 16:13
    [Google Scholar]
  77. 77. 
    Wright CF, Middleton A, Burton H, Cunningham F, Humphries SE et al. 2013. Policy challenges of clinical genome sequencing. BMJ 347:f6845
    [Google Scholar]
  78. 78. 
    McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS et al. 2016. The Ensembl Variant Effect Predictor. Genome Biol 17:122
    [Google Scholar]
  79. 79. 
    Knoppers BM, Joly Y. 2018. Introduction: the why and whither of genomic data sharing. Hum. Genet. 137:569–74
    [Google Scholar]
  80. 80. 
    Dyke SOM, Linden M, Lappalainen I, De Argila JR, Carey K et al. 2018. Registered access: authorizing data access. Eur. J. Hum. Genet. 26:1721–31
    [Google Scholar]
  81. 81. 
    Ormondroyd E, Mackley MP, Blair E, Craft J, Knight JC et al. 2017. Insights from early experience of a Rare Disease Genomic Medicine Multidisciplinary Team: a qualitative study. Eur. J. Hum. Genet. 25:680–86
    [Google Scholar]
  82. 82. 
    Turnbull C, Scott RH, Thomas E, Jones L, Murugaesu N et al. 2018. The 100000 Genomes Project: bringing whole genome sequencing to the NHS. BMJ 361:k1687
    [Google Scholar]
  83. 83. 
    Shameer K, Badgeley MA, Miotto R, Glicksberg BS, Morgan JW, Dudley JT 2016. Translational bioinformatics in the era of real-time biomedical, health care and wellness data streams. Brief. Bioinform. 18:105–24
    [Google Scholar]
/content/journals/10.1146/annurev-biodatasci-072018-021321
Loading
/content/journals/10.1146/annurev-biodatasci-072018-021321
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error