1932

Abstract

Machine learning approaches to modeling of epidemiologic data are becoming increasingly more prevalent in the literature. These methods have the potential to improve our understanding of health and opportunities for intervention, far beyond our past capabilities. This article provides a walkthrough for creating supervised machine learning models with current examples from the literature. From identifying an appropriate sample and selecting features through training, testing, and assessing performance, the end-to-end approach to machine learning can be a daunting task. We take the reader through each step in the process and discuss novel concepts in the area of machine learning, including identifying treatment effects and explaining the output from machine learning models.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-publhealth-040119-094437
2020-04-01
2024-04-18
Loading full text...

Full text loading...

/deliver/fulltext/publhealth/41/1/annurev-publhealth-040119-094437.html?itemId=/content/journals/10.1146/annurev-publhealth-040119-094437&mimeType=html&fmt=ahah

Literature Cited

  1. 1. 
    Alghamdi M, Al-Mallah M, Keteyian S, Brawner C, Ehrman J, Sakr S 2017. Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford ExercIse Testing (FIT) project. PLOS ONE 12:7e0179805
    [Google Scholar]
  2. 2. 
    Athey S, Imbens G. 2016. Recursive partitioning for heterogeneous causal effects. PNAS 113:277353–60
    [Google Scholar]
  3. 3. 
    Bandhary S, Contreras-Mora BY, Gupta R, Fernandez P, Jimenez P et al. 2017. Clinical outcomes of community-acquired pneumonia in patients with diabetes mellitus. J. Respir. Infect. 1:123–28
    [Google Scholar]
  4. 4. 
    Baum A, Scarpa J, Bruzelius E, Tamler R, Basu S, Faghmous J 2017. Targeting weight loss interventions to reduce cardiovascular complications of type 2 diabetes: a machine learning-based post-hoc analysis of heterogeneous treatment effects in the Look AHEAD trial. Lancet Diabetes Endocrinol 5:10808–15
    [Google Scholar]
  5. 5. 
    Beauclair R, Hens N, Delva W 2018. The role of age-mixing patterns in HIV transmission dynamics: novel hypotheses from a field study in Cape Town, South Africa. Epidemics 25:61–71
    [Google Scholar]
  6. 6. 
    Beleites C, Neugebauer U, Bocklitz T, Krafft C, Popp J 2013. Sample size planning for classification models. Anal. Chim. Acta 760:25–33
    [Google Scholar]
  7. 7. 
    Bellman R. 2015. Adaptive Control Processes: A Guided Tour Princeton, NJ: Princeton Univ. Press
  8. 8. 
    Bergstra J, Bengio Y. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13:281–305
    [Google Scholar]
  9. 9. 
    Blum AL, Langley P. 1997. Selection of relevant features and examples in machine learning. Artif. Intell. 97:1245–71
    [Google Scholar]
  10. 10. 
    Bonetti M, Gelber RD. 2004. Patterns of treatment effects in subsets of patients in clinical trials. Biostatistics 5:3465–81
    [Google Scholar]
  11. 11. 
    Breiman L. 2001. Random forests. Mach. Learn. 45:5–32
    [Google Scholar]
  12. 12. 
    Breiman L. 2001. Statistical modeling: the two cultures. Stat. Sci. 16:3199–215
    [Google Scholar]
  13. 13. 
    Büssing A, Falkenberg Z, Schoppe C, Recchia DR, Poier D 2017. Work stress associated cool down reactions among nurses and hospital physicians and their relation to burnout symptoms. BMC Health Serv. Res. 17:1551
    [Google Scholar]
  14. 14. 
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP 2002. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 16:321–57
    [Google Scholar]
  15. 15. 
    Chen W, Li W, Dong X, Pei J 2018. A review of biological image analysis. Curr. Bioinform. 13:337–43
    [Google Scholar]
  16. 16. 
    Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C et al. 2016. Double/debiased machine learning for treatment and causal parameters. arXiv:160800060 [Econ. Stat.]
  17. 17. 
    Chicco D, Rovelli C. 2019. Computational prediction of diagnosis and feature selection on mesothelioma patient health records. PLOS ONE 14:1e0208737
    [Google Scholar]
  18. 18. 
    De La Fuente J, Villar M, Estrada-Peña A, Olivas JA 2018. High throughput discovery and characterization of tick and pathogen vaccine protective antigens using vaccinomics with intelligent Big Data analytic techniques. Expert Rev. Vaccines 17:7569–76
    [Google Scholar]
  19. 19. 
    Delahanty RJ, Alvarez J, Flynn LM, Sherwin RL, Jones SS 2019. Development and evaluation of a machine learning model for the early identification of patients at risk for sepsis. Ann. Emerg. Med. 73:334–44
    [Google Scholar]
  20. 20. 
    Fang G, Annis IE, Elson-Lafata J, Cykert S 2019. Applying machine learning to predict real-world individual treatment effects: insights from a virtual patient cohort. J. Am. Med. Inf. Assoc. 26:10977–88
    [Google Scholar]
  21. 21. 
    Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH 2012. Predicting sample size required for classification performance. BMC Med. Inform. Decis. Mak. 12:18
    [Google Scholar]
  22. 22. 
    Flaxman AD, Vos T. 2018. Machine learning in population health: opportunities and threats. PLOS Med 15:11e1002702
    [Google Scholar]
  23. 23. 
    Forbes 2018. The rise in computing power: why ubiquitous artificial intelligence is now a reality. Forbes July 17. https://www.forbes.com/sites/intelai/2018/07/17/the-rise-in-computing-power-why-ubiquitous-artificial-intelligence-is-now-a-reality/#22a73011d3f3
    [Google Scholar]
  24. 24. 
    Fotouhi S, Asadi S, Kattan MW 2019. A comprehensive data level analysis for cancer diagnosis on imbalanced data. J. Biomed. Inform. 90:103089
    [Google Scholar]
  25. 25. 
    Frérot M, Lefebvre A, Aho S, Callier P, Astruc K, Aho Glélé LS 2018. What is epidemiology? Changing definitions of epidemiology 1978–2017. PLOS ONE 13:12e0208442
    [Google Scholar]
  26. 26. 
    Green DP, Kern HL. 2012. Modeling heterogeneous treatment effects in survey experiments with Bayesian additive regression trees. Public Opin. Q. 76:3491–511
    [Google Scholar]
  27. 27. 
    Greenland S, Poole C. 2013. Living with p values: resurrecting a Bayesian perspective on frequentist statistics. Epidemiology 24:162–68
    [Google Scholar]
  28. 28. 
    Hastie T, Tibshirani R, Friedman JH 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction New York: Springer, 2nd ed..
  29. 29. 
    Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER 2005. Optimal number of features as a function of sample size for various classification rules. Bioinformatics 21:81509–15
    [Google Scholar]
  30. 30. 
    Imai K, Ratkovic M. 2013. Estimating treatment effect heterogeneity in randomized program evaluation. Ann. Appl. Stat. 7:1443–70
    [Google Scholar]
  31. 31. 
    Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J et al. 2017. Population based training of neural networks. arXiv:1711.09846 [Cs]
  32. 32. 
    Jensen PB, Jensen LJ, Brunak S 2012. Mining electronic health records: towards better research applications and clinical care. Nat. Rev. Genet. 13:6395–405
    [Google Scholar]
  33. 33. 
    Kanter JM, Veeramachaneni K. 2015. Deep feature synthesis: towards automating data science endeavors. 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Paris1–10 New York: IEEE
    [Google Scholar]
  34. 34. 
    Keefe JR, Wiltsey Stirman S, Cohen ZD, DeRubeis RJ, Smith BN, Resick PA 2018. In rape trauma PTSD, patient characteristics indicate which trauma-focused treatment they are most likely to complete. Depress. Anxiety 35:4330–38
    [Google Scholar]
  35. 35. 
    Kind AJH, Jencks S, Brock J, Yu M, Bartels C et al. 2014. Neighborhood socioeconomic disadvantage and 30-day rehospitalization: a retrospective cohort study. Ann. Intern. Med. 161:11765–74
    [Google Scholar]
  36. 36. 
    Künzel SR, Sekhon JS, Bickel PJ, Yu B 2019. Metalearners for estimating heterogeneous treatment effects using machine learning. PNAS 116:4156–65
    [Google Scholar]
  37. 37. 
    Maier-Hein L, Eisenmann M, Reinke A, Onogur S, Stankovic M et al. 2018. Why rankings of biomedical image analysis competitions should be interpreted with care. Nat. Commun. 9:15217
    [Google Scholar]
  38. 38. 
    Messier KP, Wheeler DC, Flory AR, Jones RR, Patel D et al. 2019. Modeling groundwater nitrate exposure in private wells of North Carolina for the Agricultural Health Study. Sci. Total Environ. 655:512–19
    [Google Scholar]
  39. 39. 
    Mocanu DC, Mocanu E, Stone P, Nguyen PH, Gibescu M, Liotta A 2018. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nat. Commun. 9:12383
    [Google Scholar]
  40. 40. 
    Motsinger-Reif AA, Dudek SM, Hahn LW, Ritchie MD 2008. Comparison of approaches for machine-learning optimization of neural networks for detecting gene-gene interactions in genetic epidemiology. Genet. Epidemiol. 32:4325–40
    [Google Scholar]
  41. 41. 
    Nat. Commun. Editors 2018. Epidemiology is a science of high importance. Nat. Commun 9:11703
    [Google Scholar]
  42. 42. 
    Penone C, Davidson AD, Shoemaker KT, Di Marco M, Rondinini C et al. 2014. Imputation of missing data in life-history trait datasets: Which approach performs the best?. Methods Ecol. Evol. 5:9961–70
    [Google Scholar]
  43. 43. 
    Pereira S, Meier R, McKinley R, Wiest R, Alves V et al. 2018. Enhancing interpretability of automatically extracted machine learning features: application to a RBM-random forest system on brain lesion segmentation. Med. Image Anal. 44:228–44
    [Google Scholar]
  44. 44. 
    Powers S, Qian J, Jung K, Schuler A, Shah NH et al. 2018. Some methods for heterogeneous treatment effect estimation in high dimensions. Stat. Med. 37:111767–87
    [Google Scholar]
  45. 45. 
    R. Soc. (G. B.) 2017. Machine learning: the power and promise of computers that learn by example Rep. DES4702, R. Soc. G. B London: https://royalsociety.org/-/media/policy/projects/machine-learning/publications/machine-learning-report.pdf?
  46. 46. 
    Ramaswami R, Bayer R, Galea S 2018. Precision medicine from a public health perspective. Annu. Rev. Public Health 39:153–68
    [Google Scholar]
  47. 47. 
    Raudys SJ, Jain AK. 1991. Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans. Pattern Anal. Mach. Intell. 13:3252–64
    [Google Scholar]
  48. 48. 
    Ribeiro MT, Singh S, Guestrin C 2016. “Why should I trust you?”: explaining the predictions of any classifier. KDD ’16 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining1135–44 New York: ACM
    [Google Scholar]
  49. 49. 
    Ronca E, Scheel-Sailer A, Koch HG, Gemperli A, Group SwiSCI Study et al. 2017. Health care utilization in persons with spinal cord injury: part 2—determinants, geographic variation and comparison with the general population. Spinal Cord 55:9828–33
    [Google Scholar]
  50. 50. 
    Rosenblatt F. 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65:6386–408
    [Google Scholar]
  51. 51. 
    Sadilek A, Caty S, DiPrete L, Mansour R, Schenk T Jr. et al. 2018. Machine-learned epidemiology: real-time detection of foodborne illness at scale. npj Digit. Med. 1:136
    [Google Scholar]
  52. 52. 
    Saito T, Rehmsmeier M. 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE 10:30118432
    [Google Scholar]
  53. 53. 
    Samuel AL. 1959. Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 3:3210–29
    [Google Scholar]
  54. 54. 
    Seligman B, Tuljapurkar S, Rehkopf D 2018. Machine learning approaches to the social determinants of health in the health and retirement study. SSM - Popul. Health 4:95–99
    [Google Scholar]
  55. 55. 
    Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H 2014. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. Am. J. Epidemiol. 179:6764–74
    [Google Scholar]
  56. 56. 
    Shickel B, Loftus TJ, Adhikari L, Ozrazgat-Baslanti T, Bihorac A, Rashidi P 2019. DeepSOFA: a continuous acuity score for critically ill patients using clinically interpretable deep learning. Sci. Rep. 9:11879
    [Google Scholar]
  57. 57. 
    Shmueli G. 2010. To explain or to predict?. Stat. Sci. 25:3289–310
    [Google Scholar]
  58. 58. 
    Singh GK. 2003. Area deprivation and widening inequalities in US mortality, 1969–1998. Am. J. Public Health 93:71137–43
    [Google Scholar]
  59. 59. 
    Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15:1929–58
    [Google Scholar]
  60. 60. 
    Stekhoven DJ, Buhlmann P. 2012. MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28:1112–18
    [Google Scholar]
  61. 61. 
    Tessmer HL, Ito K, Omori R 2018. Can machines learn respiratory virus epidemiology?: A comparative study of likelihood-free methods for the estimation of epidemiological dynamics. Front. Microbiol. 9:343
    [Google Scholar]
  62. 62. 
    Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58:1267–88
    [Google Scholar]
  63. 63. 
    Tsai C-F, Eberle W, Chu C-Y 2013. Genetic algorithms in feature and instance selection. Knowl.-Based Syst. 39:240–47
    [Google Scholar]
  64. 64. 
    Turing AM. 1950. Computing machinery and intelligence. Mind 59:236433–60
    [Google Scholar]
  65. 65. 
    van der Ploeg T, Austin PC, Steyerberg EW 2014. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med. Res. Methodol. 14:1137
    [Google Scholar]
  66. 66. 
    Wager S, Athey S. 2018. Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. 113:1228–42
    [Google Scholar]
  67. 67. 
    Wang Y, Wang D, Ye X, Wang Y, Yin Y, Jin Y 2019. A tree ensemble-based two-stage model for advanced-stage colorectal cancer survival prediction. Inf. Sci. 474:106–24
    [Google Scholar]
  68. 68. 
    Wiemken TL, Carrico RM, Furmanek SP, Guinn BE, Mattingly WA et al. The impact of socioeconomic position on the incidence, severity, and clinical outcomes of hospitalized patients with community-acquired pneumonia. Public Health Rep In press
    [Google Scholar]
  69. 69. 
    Wiemken TL, Furmanek SP, Mattingly WA, Guinn BE, Cavallazzi R et al. 2017. Predicting 30-day mortality in hospitalized patients with community-acquired pneumonia using statistical and machine learning approaches. J. Respir. Infect. 1:350–56
    [Google Scholar]
  70. 70. 
    Wiemken TL, Kelley RR, Fernandez-Botran R, Mattingly WA, Arnold FW et al. 2017. Using cluster analysis of cytokines to identify patterns of inflammation in hospitalized patients with community-acquired pneumonia: a pilot study. J. Respir. Infect. 1:13–11
    [Google Scholar]
  71. 71. 
    Wiemken TL, Kelley RR, Mattingly WA, Ramirez JA 2019. Clinical research in pneumonia: role of artificial intelligence. J. Respir. Infect. 3:11–4
    [Google Scholar]
/content/journals/10.1146/annurev-publhealth-040119-094437
Loading
/content/journals/10.1146/annurev-publhealth-040119-094437
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error