1932

Abstract

Biomedical data science has experienced an explosion of new data over the past decade. Abundant genetic and genomic data are increasingly available in large, diverse data sets due to the maturation of modern molecular technologies. Along with these molecular data, dense, rich phenotypic data are also available on comprehensive clinical data sets from health care provider organizations, clinical trials, population health registries, and epidemiologic studies. The methods and approaches for interrogating these large genetic/genomic and clinical data sets continue to evolve rapidly, as our understanding of the questions and challenges continue to emerge. In this review, the state-of-the-art methodologies for genetic/genomic analysis along with complex phenomics will be discussed. This field is changing and adapting to the novel data types made available, as well as technological advances in computation and machine learning. Thus, I will also discuss the future challenges in this exciting and innovative space. The promises of precision medicine rely heavily on the ability to marry complex genetic/genomic data with clinical phenotypes in meaningful ways.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-biodatasci-080917-013508
2018-07-20
2024-05-22
Loading full text...

Full text loading...

/deliver/fulltext/biodatasci/1/1/annurev-biodatasci-080917-013508.html?itemId=/content/journals/10.1146/annurev-biodatasci-080917-013508&mimeType=html&fmt=ahah

Literature Cited

  1. 1. 1000 Genomes Proj. Consort. 2010. A map of human genome variation from population-scale sequencing. Nature 467:73191061–73
    [Google Scholar]
  2. 2. 1000 Genomes Proj. Consort. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491:742256–65
    [Google Scholar]
  3. 3. Haplotype Ref. Consort. 2016. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48:101279–83
    [Google Scholar]
  4. 4.  Welter D, MacArthur J, Morales J, Burdett T, Hall P et al. 2014. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 42:D1001–6
    [Google Scholar]
  5. 5.  MacArthur J, Bowler E, Cerezo M, Gil L, Hall P et al. 2017. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res 45:D1D896–901
    [Google Scholar]
  6. 6.  Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP et al. 2009. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. PNAS 106:239362–67
    [Google Scholar]
  7. 7.  Meadows JRS, Lindblad-Toh K 2017. Dissecting evolution and disease using comparative vertebrate genomics. Nat. Rev. Genet. 18:10624–36
    [Google Scholar]
  8. 8. Univ. San Franc. Health Online. 2017. Federal mandates for healthcare: digital record-keeping requirements for public and private healthcare providers USF Health Online Blog. https://www.usfhealthonline.com/resources/healthcare/electronic-medical-records-mandate/
  9. 9.  Häyrinen K, Saranto K, Nykänen P 2008. Definition, structure, content, use and impacts of electronic health records: a review of the research literature. Int. J. Med. Inf. 77:5291–304
    [Google Scholar]
  10. 10.  Ritchie MD, Denny JC, Crawford DC, Ramirez AH, Weiner JB et al. 2010. Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. Am. J. Hum. Genet. 86:4560–72
    [Google Scholar]
  11. 11.  Kho AN, Pacheco JA, Peissig PL, Rasmussen L, Newton KM et al. 2011. Electronic medical records for genetic research: results of the eMERGE consortium. Sci. Transl. Med. 3:7979re1
    [Google Scholar]
  12. 12.  Wilke RA, Xu H, Denny JC, Roden DM, Krauss RM et al. 2011. The emerging role of electronic medical records in pharmacogenomics. Clin. Pharmacol. Ther. 89:3379–86
    [Google Scholar]
  13. 13.  Roque FS, Jensen PB, Schmock H, Dalgaard M, Andreatta M et al. 2011. Using electronic patient records to discover disease correlations and stratify patient cohorts. PLOS Comput. Biol. 7:8e1002141
    [Google Scholar]
  14. 14.  Casey JA, Schwartz BS, Stewart WF, Adler NE 2016. Using electronic health records for population health research: a review of methods and applications. Annu. Rev. Public Health 37:61–81
    [Google Scholar]
  15. 15.  Boomsma DI, Wijmenga C, Slagboom EP, Swertz MA, Karssen LC et al. 2014. The Genome of the Netherlands: design, and project goals. Eur. J. Hum. Genet. 22:2221–27
    [Google Scholar]
  16. 16.  Fry A, Littlejohns TJ, Sudlow C, Doherty N, Adamska L et al. 2017. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. Am. J. Epidemiol. 186:91026–34
    [Google Scholar]
  17. 17.  Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS et al. 2005. Complement factor H polymorphism in age-related macular degeneration. Science 308:5720385–89
    [Google Scholar]
  18. 18.  Edwards AO, Ritter R, Abel KJ, Manning A, Panhuysen C, Farrer LA 2005. Complement factor H polymorphism and age-related macular degeneration. Science 308:5720421–24
    [Google Scholar]
  19. 19.  Haines JL, Hauser MA, Schmidt S, Scott WK, Olson LM et al. 2005. Complement factor H variant increases the risk of age-related macular degeneration. Science 308:5720419–21
    [Google Scholar]
  20. 20.  St. Laurent G, Vyatkin Y, Kapranov P 2014. Dark matter RNA illuminates the puzzle of genome-wide association studies. BMC Med 12:97
    [Google Scholar]
  21. 21.  Zhang F, Lupski JR 2015. Non-coding genetic variants in human disease. Hum. Mol. Genet. 24:R1R102–10
    [Google Scholar]
  22. 22.  Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA et al. 2009. Finding the missing heritability of complex diseases. Nature 461:7265747–53
    [Google Scholar]
  23. 23.  Hall MA, Moore JH, Ritchie MD 2016. Embracing complex associations in common traits: critical considerations for precision medicine. Trends Genet 32:8470–84
    [Google Scholar]
  24. 24.  Ohno S 1972. So much “junk” DNA in our genome. Brookhaven Symp. Biol. 23:366–70
    [Google Scholar]
  25. 25. ENCODE Proj. Consort. 2004. The ENCODE (Encyclopedia Of DNA Elements) Project. Science 306:5696636–40
    [Google Scholar]
  26. 26. ENCODE Proj. Consort. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489:741457–74
    [Google Scholar]
  27. 27.  Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M et al. 2012. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res 22:91760–74
    [Google Scholar]
  28. 28. Roadmap Epigenom. Consort. 2015. Integrative analysis of 111 reference human epigenomes. Nature 518:7539317–30
    [Google Scholar]
  29. 29.  Gusev A, Ko A, Shi H, Bhatia G, Chung W et al. 2016. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 48:3245–52
    [Google Scholar]
  30. 30.  Mancuso N, Shi H, Goddard P, Kichaev G, Gusev A, Pasaniuc B 2017. Integrating gene expression with summary association statistics to identify genes associated with 30 complex traits. Am. J. Hum. Genet. 100:3473–87
    [Google Scholar]
  31. 31.  Xu Z, Wu C, Wei P, Pan W 2017. A powerful framework for integrating eQTL and GWAS summary data. Genetics 207:3893–902
    [Google Scholar]
  32. 32.  Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K et al. 2015. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47:91091–98
    [Google Scholar]
  33. 33.  Wheeler HE, Shah KP, Brenner J, Garcia T, Aquino-Michaels K et al. 2016. Survey of the heritability and sparse architecture of gene expression traits across human tissues. PLOS Genet 12:11e1006423
    [Google Scholar]
  34. 34.  Li B, Verma SS, Veturi YC, Verma A, Bradford Y et al. 2018. Evaluation of PrediXcan for prioritizing GWAS associations and predicting gene expression. Pac. Symp. Biocomput. 23:448–59
    [Google Scholar]
  35. 35.  Liu DJ, Peloso GM, Yu H, Butterworth AS, Wang X et al. 2017. Exome-wide association study of plasma lipids in >300,000 individuals. Nat. Genet. 49:121758–66
    [Google Scholar]
  36. 36.  de Kovel CG, Mulder F, van Setten J, van ‘t Slot R, Al-Rubaish A et al. 2016. Exome-wide association analysis of coronary artery disease in the kingdom of Saudi Arabia population. PLOS ONE 11:2e0146502
    [Google Scholar]
  37. 37.  Eom S-Y, Hwang MS, Lim J-A, Choi B-S, Kwon H-J et al. 2017. Exome-wide association study identifies genetic polymorphisms of C12orf51, MYL2, and ALDH2 associated with blood lead levels in the general Korean population. Environ. Health 16:11
    [Google Scholar]
  38. 38.  Esslinger U, Garnier S, Korniat A, Proust C, Kararigas G et al. 2017. Exome-wide association study reveals novel susceptibility genes to sporadic dilated cardiomyopathy. PLOS ONE 12:3e0172995
    [Google Scholar]
  39. 39.  Tang CS, Zhang H, Cheung CYY, Xu M, Ho JCY et al. 2015. Exome-wide association analysis reveals novel coding sequence variants associated with lipid traits in Chinese. Nat. Commun. 6:10206
    [Google Scholar]
  40. 40.  Lee S, Abecasis GR, Boehnke M, Lin X 2014. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95:15–23
    [Google Scholar]
  41. 41.  Wagner MJ 2013. Rare-variant genome-wide association studies: a new frontier in genetic analysis of complex traits. Pharmacogenomics 14:4413–24
    [Google Scholar]
  42. 42.  Jeng XJ, Daye JD, Tzeng J-Y 2016. Rare variants association analysis in large-scale sequencing studies at the single locus level. PLOS Comput. Biol 12:6e1004993
    [Google Scholar]
  43. 43.  Auer PL, Lettre G 2015. Rare variant association studies: considerations, challenges and opportunities. Genome Med 7:16
    [Google Scholar]
  44. 44.  Nicolae DL 2016. Association tests for rare variants. Annu. Rev. Genom. Hum. Genet. 17:117–30
    [Google Scholar]
  45. 45.  Basile AO, Wallace JR, Peissig P, Mccarty CA, Brilliant M, Ritchie MD 2016. Knowledge driven binning and PheWAS analysis in Marshfield Personalized Medicine Research Project using BioBin. Pac. Symp. Biocomput. 21:249–60
    [Google Scholar]
  46. 46.  Moutsianas L, Morris AP 2014. Methodology for the analysis of rare genetic variation in genome-wide association and re-sequencing studies of complex human traits. Brief. Funct. Genom. 13:5362–70
    [Google Scholar]
  47. 47.  Moore CCB, Basile AO, Wallace JR, Frase AT, Ritchie MD 2016. A biologically informed method for detecting rare variant associations. BioData Min 9:127
    [Google Scholar]
  48. 48.  Mueller SG, Weiner MW, Thal LJ, Petersen RC, Jack C et al. 2005. The Alzheimer's disease neuroimaging initiative. Neuroimaging Clin. N. Am. 15:4869–77
    [Google Scholar]
  49. 49.  Brody JA, Morrison AC, Bis JC, O'Connell JR, Brown MR et al. 2017. Analysis commons, a team approach to discovery in a big-data environment for genetic epidemiology. Nat. Genet. 49:1560–63
    [Google Scholar]
  50. 50.  GTEx Consort 2013. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45:6580–85
    [Google Scholar]
  51. 51.  GTEx Consort 2017. Genetic effects on gene expression across human tissues. Nature 550:7675204–13
    [Google Scholar]
  52. 52.  Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D 2015. Methods of integrating data to uncover genotype–phenotype interactions. Nat. Rev. Genet. 16:285–97
    [Google Scholar]
  53. 53.  Huang RS, Duan S, Bleibel WK, Kistner EO, Zhang W et al. 2007. A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity. PNAS 104:239758–63
    [Google Scholar]
  54. 54.  Huang RS, Duan S, Kistner EO, Hartford CM, Dolan ME 2008. Genetic variants associated with carboplatin-induced cytotoxicity in cell lines derived from Africans. Mol. Cancer Ther. 7:93038–46
    [Google Scholar]
  55. 55.  Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E et al. 2013. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat. Biotechnol. 31:2142–47
    [Google Scholar]
  56. 56.  Schadt EE, Lamb J, Yang X, Zhu J, Edwards S et al. 2005. An integrative genomics approach to infer causal associations between gene expression and disease. Nat. Genet. 37:7710–17
    [Google Scholar]
  57. 57.  Mankoo PK, Shen R, Schultz N, Levine DA, Sander C 2011. Time to recurrence and survival in serous ovarian tumors predicted from integrated genomic profiles. PLOS ONE 6:11e24709
    [Google Scholar]
  58. 58.  Kim D, Joung J-G, Sohn K-A, Shin H, Park YR et al. 2014. Knowledge boosting: a graph-based integration approach with multi-omics data and genomic knowledge for cancer clinical outcome prediction. J. Am. Med. Inform. Assoc. 22:1109–20
    [Google Scholar]
  59. 59.  Holzinger ER, Dudek SM, Frase AT, Krauss RM, Medina MW, Ritchie MD 2013. ATHENA: a tool for meta-dimensional analysis applied to genotypes and gene expression data to predict HDL cholesterol levels. Pac. Symp. Biocomput. 2013:385–96
    [Google Scholar]
  60. 60.  Holzinger ER, Dudek SM, Frase AT, Pendergrass SA, Ritchie MD 2014. ATHENA: the analysis tool for heritable and environmental network associations. Bioinformatics 30:5698–705
    [Google Scholar]
  61. 61.  Kim D, Li R, Dudek SM, Ritchie MD 2015. Predicting censored survival data based on the interactions between meta-dimensional omics data in breast cancer. J. Biomed. Inform. 56:220–28
    [Google Scholar]
  62. 62.  Verma A, Bradford Y, Verma SS, Pendergrass SA, Daar ES et al. 2017. Multiphenotype association study of patients randomized to initiate antiretroviral regimens in AIDS Clinical Trials Group protocol A5202. Pharmacogenet. Genom. 27:3101–11
    [Google Scholar]
  63. 63.  Kho AN, Hayes MG, Rasmussen-Torvik L, Pacheco JA, Thompson WK et al. 2012. Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study. J. Am. Med. Inform. Assoc. 19:2212–18
    [Google Scholar]
  64. 64.  Peissig PL, Rasmussen LV, Berg RL, Linneman JG, McCarty CA et al. 2012. Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J. Am. Med. Inform. Assoc. 19:2225–34
    [Google Scholar]
  65. 65.  Kirby JC, Speltz P, Rasmussen LV, Basford M, Gottesman O et al. 2016. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J. Am. Med. Inform. Assoc. 23:61046–52
    [Google Scholar]
  66. 66.  Dumitrescu L, Ritchie MD, Denny JC, El Rouby NM, McDonough CW et al. 2017. Genome-wide study of resistant hypertension identified from electronic health records. PLOS ONE 12:2e0171745
    [Google Scholar]
  67. 67.  Crosslin DR, McDavid A, Weston N, Nelson SC, Zheng X et al. 2012. Genetic variants associated with the white blood cell count in 13,923 subjects in the eMERGE Network. Hum. Genet. 131:4639–52
    [Google Scholar]
  68. 68.  Ritchie MD, Verma SS, Hall MA, Goodloe RJ, Berg RL et al. 2014. Electronic medical records and genomics (eMERGE) network exploration in cataract: several new potential susceptibility loci. Mol. Vis. 20:1281–95
    [Google Scholar]
  69. 69.  Heit JA, Armasu SM, McCauley BM, Kullo IJ, Sicotte H et al. 2017. Identification of unique venous thromboembolism-susceptibility variants in African-Americans. Thromb. Haemost. 117:4758–68
    [Google Scholar]
  70. 70.  Robinson PN, Köhler S, Bauer S, Seelow D, Horn D, Mundlos S 2008. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet. 83:5610–15
    [Google Scholar]
  71. 71.  Randorff Højen A, Rosenbeck Gøeg K 2012. SNOMED CT implementation: mapping guidelines facilitating reuse of data. Methods Inf. Med. 51:6529–38
    [Google Scholar]
  72. 72.  Vreeman DJ, McDonald CJ, Huff SM 2010. LOINC®: a universal catalog of individual clinical observations and uniform representation of enumerated collections. Int. J. Funct. Inform. Pers. Med. 3:4273–91
    [Google Scholar]
  73. 73.  Turner AM, Tamasi L, Schleich F, Hoxha M, Horvath I et al. 2015. Clinically relevant subgroups in COPD and asthma. Eur. Respir. Rev. 24:136283–98
    [Google Scholar]
  74. 74.  Li L, Cheng W-Y, Glicksberg BS, Gottesman O, Tamler R et al. 2015. Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Sci. Transl. Med. 7:311311ra174
    [Google Scholar]
  75. 75.  Zheng T, Xie W, Xu L, He X, Zhang Y et al. 2017. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int. J. Med. Inf. 97:120–27
    [Google Scholar]
  76. 76.  Gustafson E, Pacheco J, Wehbe F, Silverberg J, Thompson W 2017. A machine learning algorithm for identifying atopic dermatitis in adults from electronic health records. Proc. IEEE Int. Conf. Healthc. Inform., Park City, Utah, 23–26 April83–90 New York: IEEE
    [Google Scholar]
  77. 77.  Zhou S-M, Fernandez-Gutierrez F, Kennedy J, Cooksey R, Atkinson M et al. 2016. Defining disease phenotypes in primary care electronic health records by a machine learning approach: a case study in identifying rheumatoid arthritis. PLOS ONE 11:5e0154515
    [Google Scholar]
  78. 78.  Denny JC, Ritchie MD, Basford MA, Pulley JM, Bastarache L et al. 2010. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics 26:91205–10
    [Google Scholar]
  79. 79.  Pendergrass SA, Brown-Gentry K, Dudek SM, Torstenson ES, Ambite JL et al. 2011. The use of phenome-wide association studies (PheWAS) for exploration of novel genotype-phenotype relationships and pleiotropy discovery. Genet. Epidemiol. 35:5410–22
    [Google Scholar]
  80. 80.  Bush WS, Oetjens MT, Crawford DC 2016. Unravelling the human genome–phenome relationship using phenome-wide association studies. Nat. Rev. Genet. 17:3129–45
    [Google Scholar]
  81. 81.  Verma A, Ritchie MD 2017. Current scope and challenges in phenome-wide association studies. Curr. Epidemiol. Rep. 4:4321–29
    [Google Scholar]
  82. 82.  Maher B 2008. Personal genomes: the case of the missing heritability. Nature 456:721818–21
    [Google Scholar]
  83. 83.  Costanzo M, VanderSluis B, Koch EN, Baryshnikova A, Pons C et al. 2016. A global genetic interaction network maps a wiring diagram of cellular function. Science 353:6306aaf1420
    [Google Scholar]
  84. 84.  Usaj M, Tan Y, Wang W, VanderSluis B, Zou A et al. 2017. TheCellMap.org: a web-accessible database for visualizing and mining the global yeast genetic interaction network. G3 7:51539–49
    [Google Scholar]
  85. 85.  Wang W, Xu ZZ, Costanzo M, Boone C, Lange CA, Myers CL 2017. Pathway-based discovery of genetic interactions in breast cancer. PLOS Genet 13:9e1006973
    [Google Scholar]
  86. 86.  Hasin Y, Seldin M, Lusis A 2017. Multi-omics approaches to disease. Genome Biol 18:83
    [Google Scholar]
/content/journals/10.1146/annurev-biodatasci-080917-013508
Loading
/content/journals/10.1146/annurev-biodatasci-080917-013508
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error