1932

Abstract

With the widespread adoption of electronic health records (EHRs), large repositories of structured and unstructured patient data are becoming available to conduct observational studies. Finding patients with specific conditions or outcomes, known as phenotyping, is one of the most fundamental research problems encountered when using these new EHR data. Phenotyping forms the basis of translational research, comparative effectiveness studies, clinical decision support, and population health analyses using routinely collected EHR data. We review the evolution of electronic phenotyping, from the early rule-based methods to the cutting edge of supervised and unsupervised machine learning models. We aim to cover the most influential papers in commensurate detail, with a focus on both methodology and implementation. Finally, future research directions are explored.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-biodatasci-080917-013315
2018-07-20
2024-03-29
Loading full text...

Full text loading...

/deliver/fulltext/biodatasci/1/1/annurev-biodatasci-080917-013315.html?itemId=/content/journals/10.1146/annurev-biodatasci-080917-013315&mimeType=html&fmt=ahah

Literature Cited

  1. 1.  Pathak J, Kho AN, Denny JC 2013. Electronic health records-driven phenotyping: challenges, recent advances, and perspectives. J. Am. Med. Inform. Assoc. 20:e2e206–11
    [Google Scholar]
  2. 2.  Wilcox AB 2015. Leveraging electronic health records for phenotyping. Translational Informatics, ed. PRO Payne, PJ Embi61–74 London: Springer-Verlag
    [Google Scholar]
  3. 3.  Mathias JS, Gossett D, Baker DW 2012. Use of electronic health record data to evaluate overuse of cervical cancer screening. J. Am. Med. Inform. Assoc. 19:e1e96–101
    [Google Scholar]
  4. 4.  Dubberke ER, Nyazee HA, Yokoe DS, Mayer J, Stevenson KB et al. 2012. Implementing automated surveillance for tracking Clostridiumdifficile infection at multiple healthcare facilities. Infect. Control Hosp. Epidemiol. 33:3305–8
    [Google Scholar]
  5. 5.  Kaelber DC, Foster W, Gilder J, Love TE, Jain AK 2012. Patient characteristics associated with venous thromboembolic events: a cohort study using pooled electronic health record data. J. Am. Med. Inform. Assoc. 19:6965–72
    [Google Scholar]
  6. 6.  Lependu P, Iyer SV, Fairon C, Shah NH 2012. Annotation analysis for testing drug safety signals using unstructured clinical notes. J. Biomed. Semant. 3:Suppl. 1S5
    [Google Scholar]
  7. 7.  Leeper NJ, Bauer-Mehren A, Iyer SV, Lependu P, Olson C, Shah NH 2013. Practice-based evidence: profiling the safety of cilostazol by text-mining of clinical notes. PLOS ONE 8:5e63499
    [Google Scholar]
  8. 8.  Manion FJ, Harris MR, Buyuktur AG, Clark PM, An LC, Hanauer DA 2012. Leveraging EHR data for outcomes and comparative effectiveness research in oncology. Curr. Oncol. Rep. 14:6494–501
    [Google Scholar]
  9. 9.  Cholleti S, Post A, Gao J, Lin X, Bornstein W et al. 2012. Leveraging derived data elements in data analytic models for understanding and predicting hospital readmissions. Proc. AMIA Annu. Symp 2012:103–11
    [Google Scholar]
  10. 10.  Longhurst CA, Harrington RA, Shah NH 2014. A “green button” for using aggregate patient data at the point of care. Health Aff 33:71229–35
    [Google Scholar]
  11. 11.  Freimer N, Sabatti C 2003. The human phenome project. Nat. Genet. 34:115–21
    [Google Scholar]
  12. 12.  Wei W-Q, Denny JC 2015. Extracting research-quality phenotypes from electronic health records to support precision medicine. Genome Med 7:141
    [Google Scholar]
  13. 13.  Shah NH 2013. Mining the ultimate phenome repository. Nat. Biotechnol. 31:121095–97
    [Google Scholar]
  14. 14.  Richesson RL, Hammond WE, Nahm M, Wixted D, Simon GE et al. 2013. Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH Health Care Systems Collaboratory. J. Am. Med. Inform. Assoc. 20:e2e226–31
    [Google Scholar]
  15. 15.  Angus DC 2015. Fusing randomized trials with big data: the key to self-learning health care systems?. JAMA 314:8767–68
    [Google Scholar]
  16. 16.  Weiskopf NG, Weng C 2013. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J. Am. Med. Inform. Assoc. 20:1144–51
    [Google Scholar]
  17. 17.  Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N et al. 2014. A review of approaches to identifying patient phenotype cohorts using electronic health records. J. Am. Med. Inform. Assoc. 21:2221–30
    [Google Scholar]
  18. 18.  Richesson RL, Horvath MM, Rusincovitch SA 2014. Clinical research informatics and electronic health record data. Yearb. Med. Inform. 9:215–23
    [Google Scholar]
  19. 19.  Hogan WR, Wagner MM 1997. Accuracy of data in computer-based patient records. J. Am. Med. Inform. Assoc. 4:5342–55
    [Google Scholar]
  20. 20.  Hripcsak G, Albers DJ 2013. Next-generation phenotyping of electronic health records. J. Am. Med. Inform. Assoc. 20:1117–21
    [Google Scholar]
  21. 21.  Kho AN, Hayes MG, Rasmussen-Torvik L, Pacheco JA, Thompson WK et al. 2012. Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study. J. Am. Med. Inform. Assoc. 19:2212–18
    [Google Scholar]
  22. 22.  Petersen LA, Wright S, Normand SL, Daley J 1999. Positive predictive value of the diagnosis of acute myocardial infarction in an administrative database. J. Gen. Intern. Med. 14:9555–58
    [Google Scholar]
  23. 23.  Esteban S, Rodríguez Tablado M, Ricci RI, Terrasa S, Kopitowski K 2017. A rule-based electronic phenotyping algorithm for detecting clinically relevant cardiovascular disease cases. BMC Res. Notes 10:1281
    [Google Scholar]
  24. 24.  Fan J, Arruda-Olson AM, Leibson CL, Smith C, Liu G et al. 2013. Billing code algorithms to identify cases of peripheral artery disease from administrative data. J. Am. Med. Inform. Assoc. 20:e2e349–54
    [Google Scholar]
  25. 25.  Morley KI, Wallace J, Denaxas SC, Hunter RJ, Patel RS et al. 2014. Defining disease phenotypes using national linked electronic health records: a case study of atrial fibrillation. PLOS ONE 9:11e110900
    [Google Scholar]
  26. 26.  Nicholson A, Ford E, Davies KA, Smith HE, Rait G et al. 2013. Optimising use of electronic health records to describe the presentation of rheumatoid arthritis in primary care: a strategy for developing code lists. PLOS ONE 8:2e54878
    [Google Scholar]
  27. 27.  Lingren T, Thaker V, Brady C, Namjou B, Kennebeck S et al. 2016. Developing an algorithm to detect early childhood obesity in two tertiary pediatric medical centers. Appl. Clin. Inform. 7:3693–706
    [Google Scholar]
  28. 28.  Wei W-Q, Teixeira PL, Mo H, Cronin RM, Warner JL, Denny JC 2016. Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance. J. Am. Med. Inform. Assoc. 23:e1e20–27
    [Google Scholar]
  29. 29.  Schmiedeskamp M, Harpe S, Polk R, Oinonen M, Pakyz A 2009. Use of International Classification of Diseases, Ninth Revision, Clinical Modification codes and medication use data to identify nosocomial Clostridiumdifficile infection. Infect. Control Hosp. Epidemiol. 30:111070–76
    [Google Scholar]
  30. 30.  Gottesman O, Kuivaniemi H, Tromp G, Faucett WA, Li R et al. 2013. The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future. Genet. Med. 15:10761–71
    [Google Scholar]
  31. 31.  Denny JC, Crawford DC, Ritchie MD, Bielinski SJ, Basford MA et al. 2011. Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. Am. J. Hum. Genet. 89:4529–42
    [Google Scholar]
  32. 32.  Ritchie MD, Denny JC, Crawford DC, Ramirez AH, Weiner JB et al. 2010. Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. Am. J. Hum. Genet. 86:4560–72
    [Google Scholar]
  33. 33.  Kirby JC, Speltz P, Rasmussen LV, Basford M, Gottesman O et al. 2016. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J. Am. Med. Inform. Assoc. 23:61046–52
    [Google Scholar]
  34. 34.  Yahi A, Tatonetti NP 2015. A knowledge-based, automated method for phenotyping in the EHR using only clinical pathology reports. Proc. AMIA Jt. Summits Transl. Sci. 2015:64–68
    [Google Scholar]
  35. 35.  Kern EFO, Maney M, Miller DR, Tseng C-L, Tiwari A et al. 2006. Failure of ICD-9-CM codes to identify patients with comorbid chronic kidney disease in diabetes. Health Serv. Res. 41:2564–80
    [Google Scholar]
  36. 36.  Wei W-Q, Leibson CL, Ransom JE, Kho AN, Caraballo PJ et al. 2012. Impact of data fragmentation across healthcare centers on the accuracy of a high-throughput clinical phenotyping algorithm for specifying subjects with type 2 diabetes mellitus. J. Am. Med. Inform. Assoc. 19:2219–24
    [Google Scholar]
  37. 37.  Martin-Sanchez F, Verspoor K 2014. Big data in medicine is driving big changes. Yearb. Med. Inform. 9:14–20
    [Google Scholar]
  38. 38.  Hersch WR, Greenes RA 1990. SAPHIRE: an information retrieval system featuring concept-matching, automatic indexing and probabilistic retrieval. Comput. Biomed. Res. 23:405–20
    [Google Scholar]
  39. 39.  Liu H, Bielinski SJ, Sohn S, Murphy S, Wagholikar KB et al. 2013. An information extraction framework for cohort identification using electronic health records. Proc. AMIA Jt. Summits Transl. Sci. 2013:149–53
    [Google Scholar]
  40. 40.  Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG 2001. Evaluation of negation phrases in narrative clinical reports. Proc. AMIA Annu. Symp. 2001:105–9
    [Google Scholar]
  41. 41.  Nadkarni PM, Ohno-Machado L, Chapman WW 2011. Natural language processing: an introduction. J. Am. Med. Inform. Assoc. 18:5544–51
    [Google Scholar]
  42. 42.  Kreimeyer K, Foster M, Pandey A, Arya N, Halford G et al. 2017. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J. Biomed. Inform. 73:14–29
    [Google Scholar]
  43. 43.  Friedman C, Hripcsak G, DuMouchel W, Johnson SB, Clayton PD 1995. Natural language processing in an operational clinical information system. Nat. Lang. Eng. 1:183–108
    [Google Scholar]
  44. 44.  Zeng QT, Goryachev S, Weiss S, Sordo M, Murphy SN, Lazarus R 2006. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Med. Inform. Decis. Mak. 6:30
    [Google Scholar]
  45. 45.  Afzal N, Sohn S, Abram S, Scott CG, Chaudhry R et al. 2017. Mining peripheral arterial disease cases from narrative clinical notes using natural language processing. J. Vasc. Surg. 65:61753–61
    [Google Scholar]
  46. 46.  Savova GK, Fan J, Ye Z, Murphy SP, Zheng J et al. 2010. Discovering peripheral arterial disease cases from radiology notes using natural language processing. Proc. AMIA Annu. Symp. 2010:722–26
    [Google Scholar]
  47. 47.  Tao C, Jiang G, Oniki TA, Freimuth RR, Zhu Q et al. 2013. A semantic-web oriented representation of the clinical element model for secondary use of electronic health records data. J. Am. Med. Inform. Assoc. 20:3554–62
    [Google Scholar]
  48. 48.  Albright D, Lanfranchi A, Fredriksen A, Styler WF, Warner C et al. 2013. Towards comprehensive syntactic and semantic annotations of the clinical narrative. J. Am. Med. Inform. Assoc. 20:5922–30
    [Google Scholar]
  49. 49.  Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S et al. 2010. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J. Am. Med. Inform. Assoc. 17:5507–13
    [Google Scholar]
  50. 50.  Masanz J, Pakhomov SV, Xu H, Wu ST, Chute CG, Liu H 2014. Open source clinical NLP—more than any single system. Proc. AMIA Jt. Summits Transl. Sci. 2014:76–82
    [Google Scholar]
  51. 51.  Xu H, Stenner SP, Doan S, Johnson KB, Waitman LR, Denny JC 2010. MedEx: a medication information extraction system for clinical narratives. J. Am. Med. Inform. Assoc. 17:119–24
    [Google Scholar]
  52. 52. IBM. The MedKAT pipeline User Guide. http://ohnlp.sourceforge.net/MedKATp/
  53. 53. OHNLP (Open Health Nat. Lang. Process. Consort.) MedTime Project Page User Guide, updated Nov. 18, 2013. http://ohnlp.org/index.php/MedTime_Project_Page
  54. 54.  Deléger L, Campillos L, Ligozat A-L, Névéol A 2017. Design of an extensive information representation scheme for clinical narratives. J. Biomed. Semant. 8:137
    [Google Scholar]
  55. 55.  Carroll RJ, Eyler AE, Denny JC 2011. Naïve electronic health record phenotype identification for rheumatoid arthritis. Proc. AMIA Annu. Symp. 2011:189–96
    [Google Scholar]
  56. 56.  Liao KP, Cai T, Savova GK, Murphy SN, Karlson EW et al. 2015. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ 350:h1885
    [Google Scholar]
  57. 57.  Huang Y, McCullagh P, Black N, Harper R 2007. Feature selection and classification model construction on type 2 diabetic patients’ data. Artif. Intell. Med. 41:3251–62
    [Google Scholar]
  58. 58.  Huang Y, McCullagh PJ, Black ND 2004. Feature selection via supervised model construction. Proc. IEEE Int. Conf. Data Min., 4th, Brighton, U.K., 1–4 Nov R Rastogi, K Morik, M Bramer, X Wu 411–14 New York: IEEE
    [Google Scholar]
  59. 59.  John GH, Langley P 1995. Estimating continuous distributions in Bayesian classifiers. Proc. Conf. Uncertain. Artif. Intell., 11th, Montr., Can., 18–20 Aug P Besnard, S Hanks 338–45 San Francisco: Morgan Kaufmann
    [Google Scholar]
  60. 60.  Quinlan JR 1993. C4.5: Programs for Machine Learning San Francisco: Morgan Kaufmann
  61. 61.  Aha D, Kibler D 1991. Instance-based learning algorithms. Mach. Learn. 6:37–66
    [Google Scholar]
  62. 62.  Cortes C, Vapnik V 1995. Support-vector networks. Mach. Learn. 20:3273–97
    [Google Scholar]
  63. 63.  Li D, Simon G, Chute CG, Pathak J 2013. Using association rule mining for phenotype extraction from electronic health records. Proc. AMIA Jt. Summ. Transl. Sci. 2013:142–46
    [Google Scholar]
  64. 64.  Peissig PL, Santos Costa V, Caldwell MD, Rottscheit C, Berg RL et al. 2014. Relational machine learning for electronic health record-driven phenotyping. J. Biomed. Inform. 52:260–70
    [Google Scholar]
  65. 65.  Chen Y, Ghosh J, Bejan CA, Gunter CA, Gupta S et al. 2015. Building bridges across electronic health record systems through inferred phenotypic topics. J. Biomed. Inform. 55:82–93
    [Google Scholar]
  66. 66.  Blei DM, Ng AY, Jordan MI 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3:993–1022
    [Google Scholar]
  67. 67.  Denny JC, Bastarache L, Ritchie MD, Carroll RJ, Zink R et al. 2013. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31:121102–10
    [Google Scholar]
  68. 68.  Halpern Y, Choi Y, Horng S, Sontag D 2014. Using anchors to estimate clinical state without labeled data. Proc. AMIA Annu. Symp. 2014:606–15
    [Google Scholar]
  69. 69.  Agarwal V, LePendu P, Podchiyska T, Barber R, Boland MR et al. Using narratives as a source to automatically learn phenotype models. Proc. Workshop Data Min. Med. Inform., 1st, Wash., D.C., 15 Nov. http://www.dmmh.org/dmmi2014_submission_4.pdf
  70. 70.  Simon HU 1996. General bounds on the number of examples needed for learning probabilistic concepts. J. Comput. Syst. Sci. 52:2239–54
    [Google Scholar]
  71. 71.  Aslam JA, Decatur SE 1996. On the sample complexity of noise-tolerant learning. Inf. Process. Lett. 57:4189–95
    [Google Scholar]
  72. 72.  Halpern Y, Horng S, Choi Y, Sontag D 2016. Electronic medical record phenotyping using the anchor and learn framework. J. Am. Med. Inform. Assoc. 23:4731–40
    [Google Scholar]
  73. 73.  Agarwal V, Podchiyska T, Banda JM, Goel V, Leung TI et al. 2016. Learning statistical models of phenotypes using noisy labeled training data. J. Am. Med. Inform. Assoc. 23:61166–73
    [Google Scholar]
  74. 74.  Banda JM, Halpern Y, Sontag D, Shah NH 2017. Electronic phenotyping with APHRODITE and the Observational Health Sciences and Informatics (OHDSI) data network. Proc. AMIA Jt. Summ. Transl. Sci. 2017:48–57
    [Google Scholar]
  75. 75.  Yu S, Liao KP, Shaw SY, Gainer VS, Churchill SE et al. 2015. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. J. Am. Med. Inform. Assoc. 22:5993–1000
    [Google Scholar]
  76. 76.  Yu S, Chakrabortty A, Liao KP, Cai T, Ananthakrishnan AN et al. 2017. Surrogate-assisted feature extraction for high-throughput phenotyping. J. Am. Med. Inform. Assoc. 24:e1e143–49
    [Google Scholar]
  77. 77.  Ho JC, Ghosh J, Sun J 2014. Extracting phenotypes from patient claim records using nonnegative tensor factorization. Proc. Int. Conf. Brain Inform. Health, Wars., Pol., 11–14 Aug D Ślzak, A-H Tan, JF Peters, L Schwabe 142–51 Cham, Switz: Springer Int.
    [Google Scholar]
  78. 78.  Ho JC, Ghosh J, Steinhubl SR, Stewart WF, Denny JC et al. 2014. Limestone: high-throughput candidate phenotype generation via tensor factorization. J. Biomed. Inform. 52:199–211
    [Google Scholar]
  79. 79.  Ho JC, Ghosh J, Sun J 2014. Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization. Proc. SIGKDD Int. Conf. Knowl. Discov. Data Min., 20th, New York, N.Y., 24–27 Aug115–24 New York: ACM
    [Google Scholar]
  80. 80.  Wang Y, Chen R, Ghosh J, Denny JC, Kho A et al. 2015. Rubik: knowledge guided tensor factorization and completion for health data analytics. Proc. SIGKDD Int. Conf. Knowl. Discov. Data Min., 21st, Sydney, Aust., 10–13 Aug1265–74 New York: ACM
    [Google Scholar]
  81. 81.  Henderson J, Ho JC, Kho AN, Denny JC, Malin BA et al. 2017. Granite: diversified, sparse tensor factorization for electronic health record-based phenotyping. Proc. IEEE Int. Conf. Healthc. Inform., Park City, Utah, 23–26 Aug214–23 New York: IEEE
    [Google Scholar]
  82. 82.  Henderson J, Bridges R, Ho JC, Wallace BC, Ghosh J 2017. PheKnow-Cloud: a tool for evaluating high-throughput phenotype candidates using online medical literature. Proc. AMIA Jt. Summ. Transl. Sci 2017149–57
    [Google Scholar]
  83. 83.  Zou H, Hastie T 2005. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67:301–20
    [Google Scholar]
  84. 84.  Bowton E, Field JR, Wang S, Schildcrout JS, Van Driest SL et al. 2014. Biobanks and electronic medical records: enabling cost-effective research. Sci. Transl. Med. 6:234234cm3
    [Google Scholar]
  85. 85.  Boland MR, Hripcsak G, Shen Y, Chung WK, Weng C 2013. Defining a comprehensive verotype using electronic health records for personalized medicine. J. Am. Med. Inform. Assoc. 20:e2e232–38
    [Google Scholar]
  86. 86.  Gehrmann S, Dernoncourt F, Li Y, Carlson ET, Wu JT et al. 2017. Comparing rule-based and deep learning models for patient phenotyping. arXiv:1703.08705 [cs.CL]
  87. 87.  Chiu P-H, Hripcsak G 2017. EHR-based phenotyping: bulk learning and evaluation. J. Biomed. Inform. 70:35–51
    [Google Scholar]
  88. 88.  Beaulieu-Jones BK, Greene CS 2016. Semi-supervised learning of the electronic health record for phenotype stratification. J. Biomed. Inform. 64:168–78
    [Google Scholar]
  89. 89.  Miotto R, Li L, Kidd BA, Dudley JT 2016. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci. Rep. 6:26094
    [Google Scholar]
  90. 90.  Richesson RL, Sun J, Pathak J, Kho AN, Denny JC 2016. Clinical phenotyping in selected national networks: demonstrating the need for high-throughput, portable, and computational methods. Artif. Intell. Med. 71:57–61
    [Google Scholar]
  91. 91.  Califf RM 2014. The Patient-Centered Outcomes Research Network: a national infrastructure for comparative effectiveness research. N.C. Med. J. 75:3204–10
    [Google Scholar]
  92. 92.  Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC et al. 2010. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med. Inform. Assoc. 17:2124–30
    [Google Scholar]
  93. 93.  McGraw D, Rosati K, Evans B 2012. A policy framework for public health uses of electronic health data. Pharmacoepidemiol. Drug Saf. 21:18–22
    [Google Scholar]
  94. 94.  Hripcsak G, Duke JD, Shah NH, Reich CG, Huser V et al. 2015. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. Stud. Health Technol. Inform. 216:574–78
    [Google Scholar]
/content/journals/10.1146/annurev-biodatasci-080917-013315
Loading
/content/journals/10.1146/annurev-biodatasci-080917-013315
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error