The digital world is generating data at a staggering and still increasing rate. While these “big data” have unlocked novel opportunities to understand public health, they hold still greater potential for research and practice. This review explores several key issues that have arisen around big data. First, we propose a taxonomy of sources of big data to clarify terminology and identify threads common across some subtypes of big data. Next, we consider common public health research and practice uses for big data, including surveillance, hypothesis-generating research, and causal inference, while exploring the role that machine learning may play in each use. We then consider the ethical implications of the big data revolution with particular emphasis on maintaining appropriate care for privacy in a world in which technology is rapidly changing social norms regarding the need for (and even the meaning of) privacy. Finally, we make suggestions regarding structuring teams and training to succeed in working with big data in research and practice.


Article metrics loading...

Loading full text...

Full text loading...


Literature Cited

  1. Aiello AE, Simanek AM, Eisenberg MC, Walsh AR, Davis B. 1.  et al. 2016. Design and methods of a social network isolation study for reducing respiratory infection transmission: the eX-FLU cluster randomized trial. Epidemics 15:38–55 [Google Scholar]
  2. Alaa AM, van der Schaar M. 2.  2017. Bayesian inference of individualized treatment effects using multi-task Gaussian processes. arXiv:1704.02801 [cs.LG]
  3. Alter O, Brown PO, Botstein D. 3.  2000. Singular value decomposition for genome-wide expression data processing and modeling. PNAS 97:10101–6 [Google Scholar]
  4. Anderson TK. 4.  2009. Kernel density estimation and K-means clustering to profile road accident hotspots. Accid. Anal. Prev. 41:359–64 [Google Scholar]
  5. Angus DC. 5.  2015. Fusing randomized trials with big data: the key to self-learning health care systems?. JAMA 314:767–68 [Google Scholar]
  6. Aramaki E, Maskawa S, Morita M. 6.  2011. Twitter catches the flu: detecting influenza epidemics using Twitter. Proc. Conf. Empir. Methods Nat. Lang. Process., Edinburgh1568–76 Stroudsburg, PA: Assoc. Comput. Linguist. [Google Scholar]
  7. Arnold BF, Ercumen A, Benjamin-Chung J, Colford JM Jr. 7.  2016. Brief report: negative controls to detect selection bias and measurement bias in epidemiologic studies. Epidemiology 27:637–41 [Google Scholar]
  8. Bachur RG, Harper MB. 8.  2001. Predictive model for serious bacterial infections among infants younger than 3 months of age. Pediatrics 108:311–16 [Google Scholar]
  9. Bader MD, Mooney SJ, Rundle AG. 9.  2016. Protecting personally identifiable information when using online geographic tools for public health research. Am. J. Public Health 106:206–8 [Google Scholar]
  10. Bansal S, Chowell G, Simonsen L, Vespignani A, Viboud C. 10.  2016. Big data for infectious disease surveillance and modeling. J. Infect. Dis. 214:S375–79 [Google Scholar]
  11. Barakat NH, Bradley AP, Barakat MNH. 11.  2010. Intelligible support vector machines for diagnosis of diabetes mellitus. IEEE Trans. Inf. Technol. Biomed. 14:1114–20 [Google Scholar]
  12. Bates DW, Saria S, Ohno-Machado L, Shah A, Escobar G. 12.  2014. Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Aff 33:1123–31 [Google Scholar]
  13. Bellman RE. 13.  2015. Adaptive Control Processes: A Guided Tour Princeton, NJ: Princeton Univ. Press
  14. Bernau C, Riester M, Boulesteix A-L, Parmigiani G, Huttenhower C. 14.  et al. 2014. Cross-study validation for the assessment of prediction algorithms. Bioinformatics 30:i105–12 [Google Scholar]
  15. Bohensky MA, Jolley D, Sundararajan V, Evans S, Pilcher DV. 15.  et al. 2010. Data linkage: a powerful research tool with potential problems. BMC Health Serv. Res. 10:346 [Google Scholar]
  16. Bolukbasi T, Chang K-W, Zou J, Saligrama V, Kalai A. 16.  2016. Quantifying and reducing stereotypes in word embeddings. arXiv:1606.06121 [cs.CL]
  17. Bougoudis I, Demertzis K, Iliadis L, Anezakis V-D, Papaleonidas A. 17.  2016. Semi-supervised hybrid modeling of atmospheric pollution in urban centers. Proc. Int. Conf. Eng. Appl. Neural Netw 62951–63 Cham, Switz.: Springer [Google Scholar]
  18. Bowden J, Davey Smith G, Burgess S. 18.  2015. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int. J. Epidemiol. 44:512–25 [Google Scholar]
  19. Braun R, Rowe W, Schaefer C, Zhang J, Buetow K. 19.  2009. Needles in the haystack: identifying individuals present in pooled genomic data. PLOS Genet 5:e1000668 [Google Scholar]
  20. Caliskan A, Bryson JJ, Narayanan A. 20.  2017. Semantics derived automatically from language corpora contain human-like biases. Science 356:183–86 [Google Scholar]
  21. Calvo B, Larrañaga P, Lozano JA. 21.  2007. Learning Bayesian classifiers from positive and unlabeled examples. Pattern Recognit. Lett. 28:2375–84 [Google Scholar]
  22. Castaldi PJ, Dahabreh IJ, Ioannidis JP. 22.  2011. An empirical assessment of validation practices for molecular classifiers. Brief. Bioinform. 12:189–202 [Google Scholar]
  23. Chandrashekar G, Sahin F. 23.  2014. A survey on feature selection methods. Comput. Electr. Eng. 40:16–28 [Google Scholar]
  24. Davey Smith G. 24.  2012. Negative control exposures in epidemiologic studies. Epidemiology 23:350–51 [Google Scholar]
  25. Davis B, Carpenter C. 25.  2009. Proximity of fast-food restaurants to schools and adolescent obesity. Am. J. Public Health 99:505–10 [Google Scholar]
  26. Davis HT, Aelion CM, McDermott S, Lawson AB. 26.  2009. Identifying natural and anthropogenic sources of metals in urban and rural soils using GIS-based data, PCA, and spatial interpolation. Environ. Pollut. 157:2378–85 [Google Scholar]
  27. De Choudhury M, Gamon M, Counts S, Horvitz E. 27.  2013. Predicting depression via social media. Proc. Int. AAAI Conf. Weblogs Soc. Media (ICWSM), 7th, Boston128–37 Palo Alto, CA: Assoc. Adv. Artif. Intell. (AAAI) [Google Scholar]
  28. De Choudhury M, Kiciman E, Dredze M, Coppersmith G, Kumar M. 28.  2016. Discovering shifts to suicidal ideation from mental health content in social media. Proc. 2016 CHI Conf. Hum. Factors Comput. Syst., San Jose, Calif.2098–110 New York: Assoc. Comput. Mach. (ACM) [Google Scholar]
  29. Deng H, Runger G. 29.  2012. Feature selection via regularized trees. Proc. 2012 Int. Joint Conf. Neural. Netw. (IJCNN), Brisbane, Aust.1–8 New York: IEEE [Google Scholar]
  30. Efroymson M. 30.  1960. Multiple regression analysis. Mathematical Methods for Digital Computers A Ralston, HS Wilf 191–203 New York: Wiley [Google Scholar]
  31. Eftekhar B, Mohammad K, Ardebili HE, Ghodsi M, Ketabchi E. 31.  2005. Comparison of artificial neural network and logistic regression models for prediction of mortality in head trauma based on initial clinical data. BMC Med. Inf. Decis. Making 5:3 [Google Scholar]
  32. Egger ME, Squires MH 3rd, Kooby DA, Maithel SK, Cho CS. 32.  et al. 2015. Risk stratification for readmission after major hepatectomy: development of a readmission risk score. J. Am. Coll. Surg. 220:640–48 [Google Scholar]
  33. Fahmi P, Viet V, Deok-Jai C. 33.  2012. Semi-supervised fall detection algorithm using fall indicators in smartphone. Proc. Int. Conf. Ubiquitous Inf. Manag. Commun., 6th, Kuala Lumpur, Malaysia Art. 122 New York: Assoc. Comput. Mach. (ACM) [Google Scholar]
  34. Fisichella M, Stewart A, Denecke K, Nejdl W. 34.  2010. Unsupervised public health event detection for epidemic intelligence. Proc. ACM Int. Conf. Inf. Knowledge Manag., 19th, Toronto1881–84 New York: Assoc. Comput. Mach. (ACM) [Google Scholar]
  35. Gardner MJ, Altman DG. 35.  1986. Confidence intervals rather than P values: estimation rather than hypothesis testing. BMJ Clin. Res. Ed. 292:746–50 [Google Scholar]
  36. Glymour MM, Tchetgen Tchetgen EJ, Robins JM. 36.  2012. Credible Mendelian randomization studies: approaches for evaluating the instrumental variable assumptions. Am. J. Epidemiol. 175:332–39 [Google Scholar]
  37. Goldsmith J, Liu X, Jacobson JS, Rundle A. 37.  2016. New insights into activity patterns in children, found using functional data analyses. Med. Sci. Sports Exerc. 48:1723–29 [Google Scholar]
  38. Gomide J, Veloso A, Meira W Jr., Almeida V, Benevenuto F. 38.  et al. 2011. Dengue surveillance based on a computational model of spatio-temporal locality of Twitter. Proc. Int. Web Sci. Conf., 3rd, Koblenz, Ger. Art. 3 New York: Assoc. Comput. Mach. (ACM) [Google Scholar]
  39. Graham DJ, Hipp JA. 39.  2014. Emerging technologies to promote and evaluate physical activity: cutting-edge research and future directions. Front. Public Health 2:66 [Google Scholar]
  40. Greene AC, Giffin KA, Greene CS, Moore JH. 40.  2016. Adapting bioinformatics curricula for big data. Brief. Bioinform. 17:43–50 [Google Scholar]
  41. Grover S, Pea R. 41.  2013. Computational thinking in K–12: a review of the state of the field. Educ. Res. 42:38–43 [Google Scholar]
  42. Hafeman DM, Schwartz S. 42.  2009. Opening the Black Box: a motivation for the assessment of mediation. Int. J. Epidemiol. 38:838–45 [Google Scholar]
  43. Halevy A, Norvig P, Pereira F. 43.  2009. The unreasonable effectiveness of data. IEEE Intell. Syst. 24:8–12 [Google Scholar]
  44. Hasan O, Meltzer DO, Shaykevich SA, Bell CM, Kaboli PJ. 44.  et al. 2010. Hospital readmission in general medicine patients: a prediction model. J. Gen. Intern. Med. 25:211–19 [Google Scholar]
  45. Hernán MA, Robins JM. 45.  2006. Instruments for causal inference: an epidemiologist's dream?. Epidemiology 17:360–72 [Google Scholar]
  46. Hernán MA, Robins JM. 46.  2010. Causal Inference Boca Raton, FL: CRC
  47. Holmes E, Loo RL, Stamler J, Bictash M, Yap IK. 47.  et al. 2008. Human metabolic phenotype diversity and its association with diet and blood pressure. Nature 453:396–400 [Google Scholar]
  48. Homer N, Szelinger S, Redman M, Duggan D, Tembe W. 48.  et al. 2008. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLOS Genet 4:e1000167 [Google Scholar]
  49. Hunter RF, McAneney H, Davis M, Tully MA, Valente TW, Kee F. 49.  2015. “Hidden” social networks in behavior change interventions. Am. J. Public Health 105:513–16 [Google Scholar]
  50. Ioannidis JP. 50.  2013. Informed consent, big data, and the oxymoron of research that is not research. Am. J. Bioethics 13:40–42 [Google Scholar]
  51. Ivanov O, Wagner MM, Chapman WW, Olszewski RT. 51.  2002. Accuracy of three classifiers of acute gastrointestinal syndrome for syndromic surveillance. Proc. AMIA Symp. 2002:345–49 [Google Scholar]
  52. Jain S, White M, Radivojac P. 52.  2016. Estimating the class prior and posterior from noisy positives and unlabeled data. Proc. Adv. Neural Inf. Process. Syst. (NIPS), 29th DD Lee, M Sugiyama, UV Luxburg, I Guyon, R Garnett 2693–701 Barcelona: NIPS [Google Scholar]
  53. Jain S, White M, Radivojac P. 53.  2017. Recovering true classifier performance in positive-unlabeled learning. Proc. AAAI, 31st, San Francisco2066–72 Palo Alto, CA: Assoc. Adv. Artif. Intell. (AAAI) [Google Scholar]
  54. Jeste DV, Savla GN, Thompson WK, Vahia IV, Glorioso DK. 54.  et al. 2013. Association between older age and more successful aging: critical role of resilience and depression. Am. J. Psychiatry 170:188–96 [Google Scholar]
  55. Kang H, Zhang A, Cai TT, Small DS. 55.  2016. Instrumental variables estimation with some invalid instruments and its application to Mendelian randomization. J. Am. Stat. Assoc. 111:132–44 [Google Scholar]
  56. Kaplan RM, Chambers DA, Glasgow RE. 56.  2014. Big data and large sample size: a cautionary note on the potential for bias. Clin. Transl. Sci. 7:342–46 [Google Scholar]
  57. Kargupta H, Datta S, Wang Q, Sivakumar K. 57.  2003. On the privacy preserving properties of random data perturbation techniques. Proc. IEEE Int. Conf. Data Mining (ICDM), 3rd, Melbourne, Fla.99–106 New York: IEEE [Google Scholar]
  58. Kass NE. 58.  2001. An ethics framework for public health. Am. J. Public Health 91:1776–82 [Google Scholar]
  59. Khoury MJ, Ioannidis JPA. 59.  2014. Big data meets public health. Science 346:1054–55 [Google Scholar]
  60. Kochenderfer MJ. 60.  2015. Decision Making Under Uncertainty: Theory and Application Cambridge, MA: MIT Press
  61. Kononen DW, Flannagan CA, Wang SC. 61.  2011. Identification and validation of a logistic regression model for predicting serious injuries associated with motor vehicle crashes. Accid. Anal. Prev. 43:112–22 [Google Scholar]
  62. Kostkova P. 62.  2013. A roadmap to integrated digital public health surveillance: the vision and the challenges. Proc. Int. Conf. World Wide Web, 22nd, Rio de Janeiro687–94 New York: Assoc. Comput. Mach. (ACM) [Google Scholar]
  63. Kovalchik SA, Tammemagi M, Berg CD, Caporaso NE, Riley TL. 63.  et al. 2013. Targeting of low-dose CT screening according to the risk of lung-cancer death. N. Engl. J. Med. 369:245–54 [Google Scholar]
  64. Kwan M-P. 64.  2016. Algorithmic geographies: big data, algorithmic uncertainty, and the production of geographic knowledge. Ann. Am. Assoc. Geogr. 106:274–82 [Google Scholar]
  65. Larson T, Gould T, Simpson C, Liu LJ, Claiborn C, Lewtas J. 65.  2004. Source apportionment of indoor, outdoor, and personal PM2.5 in Seattle, Washington, using positive matrix factorization. J. Air Waste Manag. Assoc. 54:1175–87 [Google Scholar]
  66. Lasko TA, Denny JC, Levy MA. 66.  2013. Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data. PLOS ONE 8:e66341 [Google Scholar]
  67. Lazer D, Kennedy R, King G, Vespignani A. 67.  2014. The parable of Google Flu: traps in big data analysis. Science 343:1203–5 [Google Scholar]
  68. LeCun Y, Bengio Y, Hinton G. 68.  2015. Deep learning. Nature 521:436–44 [Google Scholar]
  69. Lee LM, Gostin LO. 69.  2009. Ethical collection, storage, and use of public health data: a proposal for a national privacy protection. JAMA 302:82–84 [Google Scholar]
  70. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B. 70.  et al. 2010. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11:733–39 [Google Scholar]
  71. Li B, Krishnan VG, Mort ME, Xin F, Kamati KK. 71.  et al. 2009. Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics 25:2744–50 [Google Scholar]
  72. Li H, Muralidhar K, Sarathy R, Luo XR. 72.  2014. Evaluating re-identification risks of data protected by additive data perturbation. J. Database Manag. 25:52–74 [Google Scholar]
  73. Li Y, Ngom A. 73.  2015. Data integration in machine learning. Proc. IEEE Int. Conf. Bioinform. Biomed. (BIBM)1665–71 New York: IEEE [Google Scholar]
  74. Lichtveld MY. 74.  2016. A Timely Reflection on the Public Health Workforce. J. Public Health Manag. Pract. 22:509–11 [Google Scholar]
  75. Link BG, Phelan J. 75.  1995. Social conditions as fundamental causes of disease. J. Health Soc. Behav. 1995:80–94 [Google Scholar]
  76. Lipsitch M, Tchetgen Tchetgen E, Cohen T. 76.  2010. Negative controls: a tool for detecting confounding and bias in observational studies. Epidemiology 21:383–88 [Google Scholar]
  77. Liu K, Giannella C, Kargupta H. 77.  2008. A survey of attack techniques on privacy-preserving data perturbation methods. Privacy-Preserving Data Mining. Advances in Database Systems CC Aggarwal, PS Yu 34359–81 Boston: Springer [Google Scholar]
  78. Lleras-Muney A. 78.  2005. The relationship between education and adult mortality in the United States. Rev. Econ. Stud. 72:189–221 [Google Scholar]
  79. Lochner K, Hummer RA, Bartee S, Wheatcroft G, Cox C. 79.  2008. The public-use National Health Interview Survey linked mortality files: methods of reidentification risk avoidance and comparative analysis. Am. J. Epidemiol. 168:336–44 [Google Scholar]
  80. Lord N. 80.  2017. The history of data breaches. Digital Guardian July 27. https://digitalguardian.com/blog/history-data-breaches
  81. Lundberg S, Lee S-I. 81.  2016. An unexpected unity among methods for interpreting model predictions. arXiv 1611.07478 [cs.AI]
  82. Lye SY, Koh JHL. 82.  2014. Review on teaching and learning of computational thinking through programming: What is next for K-12?. Comput. Hum. Behav. 41:51–61 [Google Scholar]
  83. Lynch SM, Mitra N, Ross M, Newcomb C, Dailey K. 83.  et al. 2017. A neighborhood-wide association study (NWAS): example of prostate cancer aggressiveness. PLOS ONE 12:e0174548 [Google Scholar]
  84. Mai J-E. 84.  2016. Big data privacy: the datafication of personal information. Inf. Soc. 32:192–99 [Google Scholar]
  85. Mai J-E. 85.  2016. Three models of privacy: new perspectives on informational privacy. Nordicom Rev 37:171–75 [Google Scholar]
  86. Mamiya H, Schwartzman K, Verma A, Jauvin C, Behr M, Buckeridge D. 86.  2015. Towards probabilistic decision support in public health practice: predicting recent transmission of tuberculosis from patient attributes. J. Biomed. Inform. 53:237–42 [Google Scholar]
  87. Mayer-Schönberger V, Cukier K. 87.  2013. Big Data: A Revolution That Will Transform How We Live, Work, and Think Boston: Houghton Mifflin Harcourt
  88. McKetta S, Hatzenbuehler ML, Pratt C, Bates L, Link BG, Keyes KM. 88.  2017. Does social selection explain the association between state-level racial animus and racial disparities in self-rated health in the United States?. Ann. Epidemiol. 27:485–92 [Google Scholar]
  89. Menon A, Rooyen BV, Ong CS, Williamson B. 89.  2015. Learning from corrupted binary labels via class-probability estimation. Proc. Int. Conf. Mach. Learn. (ICML-15), 32nd, Lille, France125–34
  90. Mooney SJ, Joshi S, Cerdá M, Kennedy GJ, Beard JR, Rundle AG. 90.  2017. Contextual correlates of physical activity among older adults: a neighborhood-environment wide association study (NE-WAS). Cancer Epidemiol. Biomarkers Prev. 26:495–504 [Google Scholar]
  91. Mooney SJ, Westreich DJ, El-Sayed AM. 91.  2015. Epidemiology in the era of big data. Epidemiology 26:390–94 [Google Scholar]
  92. Murdoch TB, Detsky AS. 92.  2013. The inevitable application of big data to health care. JAMA 309:1351–52 [Google Scholar]
  93. Myers J, Frieden TR, Bherwani KM, Henning KJ. 93.  2008. Ethics in public health research: privacy and public health at risk: public health confidentiality in the digital age. Am. J. Public Health 98:793–801 [Google Scholar]
  94. Naimi AI, Westreich DJ. 94.  2014. Big data: a revolution that will transform how we live, work, and think. Am. J. Epidemiol. 179:1143–44 [Google Scholar]
  95. Natarajan N, Dhillon IS, Ravikumar PK, Tewari A. 95.  2013. Learning with noisy labels. Proc. Adv. Neural Inf. Process. Syst. (NIPS), 26th CJC Burges, L. Bottou, M Welling, Z Ghahramani, KQ Weinberger 1196–204 Barcelona: NIPS [Google Scholar]
  96. Ness RB. 96.  2007. Influence of the HIPAA privacy rule on health research. JAMA 298:2164–70 [Google Scholar]
  97. Nguyen MN, Li X-L, Ng S-K. 97.  2011. Positive unlabeled learning for time series classification. Proc. Int. Jt. Conf. Artif. Intell. (IJCAI), 22nd, Barcelona1421–26 Menlo Park, CA: AAAI Press [Google Scholar]
  98. Ola O, Sedig K. 98.  2014. The challenge of big data in public health: an opportunity for visual analytics. Online J. Public Health Inform. 5:223 [Google Scholar]
  99. Otero P, Hersh W, Jai Ganesh AU. 99.  2014. Big data: are biomedical and health informatics training programs ready? Contribution of the IMIA Working Group for Health and Medical Informatics Education. Yearb. Med. Inform. 9:177–81 [Google Scholar]
  100. Papadopoulos A, Fotiadis DI, Likas A. 100.  2002. An automatic microcalcification detection system based on a hybrid neural network classifier. Artif. Intell. Med. 25:149–67 [Google Scholar]
  101. Parkka J, Ermes M, Korpipaa P, Mantyjarvi J, Peltola J, Korhonen I. 101.  2006. Activity classification using realistic data from wearable sensors. IEEE Trans. On Inf. Technol. Biomed. 10:119–28 [Google Scholar]
  102. Pejaver V, Urresti J, Lugo-Martinez J, Pagel KA, Lin GN. 102.  et al. 2017. MutPred2: inferring the molecular and phenotypic impact of amino acid variants. bioRxiv 134981
  103. Poole C. 103.  2001. Low P-values or narrow confidence intervals: Which are more durable?. Epidemiology 12:291–94 [Google Scholar]
  104. Psaty BM, Breckenridge AM. 104.  2014. Mini-Sentinel and regulatory science—big data rendered fit and functional. N. Engl. J. Med. 370:2165–67 [Google Scholar]
  105. Ribeiro MT, Singh S, Guestrin C. 105.  2016. Why should I trust you?: Explaining the predictions of any classifier. Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 22nd, San Francisco1135–44 New York: Assoc. Comput. Mach. (ACM) [Google Scholar]
  106. Riera C, Padilla N, de la Cruz X. 106.  2016. The complementarity between protein-specific and general pathogenicity predictors for amino acid substitutions. Hum. Mutat. 37:1013–24 [Google Scholar]
  107. Robins JM. 107.  2001. Data, design, and background knowledge in etiologic inference. Epidemiology 12:313–20 [Google Scholar]
  108. Rocke DM, Durbin B. 108.  2001. A model for measurement error for gene expression arrays. J. Comput. Biol. 8:557–69 [Google Scholar]
  109. Rost B, Radivojac P, Bromberg Y. 109.  2016. Protein function in precision medicine: deep understanding with machine learning. FEBS Lett 590:2327–41 [Google Scholar]
  110. Rothstein MA. 110.  2010. Is deidentification sufficient to protect health privacy in research?. Am. J. Bioethics 10:3–11 [Google Scholar]
  111. Sampson JN, Boca SM, Shu XO, Stolzenberg-Solomon RZ, Matthews CE. 111.  et al. 2013. Metabolomics in epidemiology: sources of variability in metabolite measurements and implications. Cancer Epidemiol. Biomark. Prev. 22:631–40 [Google Scholar]
  112. Santillana M, Zhang DW, Althouse BM, Ayers JW. 112.  2014. What can digital disease detection learn from (an external revision to) Google Flu Trends?. Am. J. Prev. Med. 47:341–47 [Google Scholar]
  113. Shah A, Gulati R. 113.  2016. Evaluating applicability of perturbation techniques for privacy preserving data mining by descriptive statistics. Proc. Int. Conf. Adv. Comput. Commun. Inform. (ICACCI), Jaipur, India607–13 New York: IEEE [Google Scholar]
  114. Smith GD, Ebrahim S. 114.  2004. Mendelian randomization: prospects, potentials, and limitations. Int. J. Epidemiol. 33:30–42 [Google Scholar]
  115. Smith KJ, Roberts MS. 115.  2002. Cost-effectiveness of newer treatment strategies for influenza. Am. J. Med. 113:300–7 [Google Scholar]
  116. Solove DJ. 116.  2008. Understanding Privacy Cambridge, MA: Harvard Univ. Press
  117. Spiegelman D. 117.  2016. Evaluating public health interventions: 2. Stepping up to routine public health evaluation with the stepped wedge design. Am. J. Public Health 106:453–57 [Google Scholar]
  118. Tan M, Tsang IW, Wang L. 118.  2014. Towards ultrahigh dimensional feature selection for big data. J. Mach. Learn. Res. 15:1371–429 [Google Scholar]
  119. Tavani HT. 119.  2007. Philosophical theories of privacy: implications for an adequate online privacy policy. Metaphilosophy 38:1–22 [Google Scholar]
  120. Teutsch SM, Churchill RE. 120.  2000. Principles and Practice of Public Health Surveillance Oxford, UK: Oxford Univ. Press
  121. Thomson JJ. 121.  1975. The right to privacy. Philos. Public Aff. 4:295–314 [Google Scholar]
  122. Tibshirani R. 122.  1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 73:267–88 [Google Scholar]
  123. Tilson H, Gebbie KM. 123.  2004. The public health workforce. Annu. Rev. Public Health 25:341–56 [Google Scholar]
  124. Titiunik R. 124.  2015. Can big data solve the fundamental problem of causal inference?. PS: Polit. Sci. Polit. 48:75–79 [Google Scholar]
  125. Tolich M. 125.  2004. Internal confidentiality: When confidentiality assurances fail relational informants. Qual. Sociol. 27:101–6 [Google Scholar]
  126. Trtica-Majnaric L, Zekic-Susac M, Sarlija N, Vitale B. 126.  2010. Prediction of influenza vaccination outcome by neural networks and logistic regression. J. Biomed. Inform. 43:774–81 [Google Scholar]
  127. Vacek JL, Vanga SR, Good M, Lai SM, Lakkireddy D, Howard PA. 127.  2012. Vitamin D deficiency and supplementation and relation to cardiovascular health. Am. J. Cardiol. 109:359–63 [Google Scholar]
  128. Van der Laan MJ, Polley EC, Hubbard AE. 128.  2007. Super learner. Stat. Appl. Genet. Mol. Biol. 6:Art. 25 [Google Scholar]
  129. VanderWeele T. 129.  2015. Explanation in Causal Inference: Methods for Mediation and Interaction Oxford, UK: Oxford Univ. Press
  130. VanderWeele TJ, Tchetgen Tchetgen EJ, Cornelis M, Kraft P. 130.  2014. Methodological challenges in Mendelian randomization. Epidemiology 25:427–35 [Google Scholar]
  131. Wager S, Athey S. 131.  2017. Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. http://dx.doi.org/10.1080/01621459.2017.1319839 [Crossref]
  132. Wang J, McMichael AJ, Meng B, Becker NG, Han W. 132.  et al. 2006. Spatial dynamics of an epidemic of severe acute respiratory syndrome in an urban area. Bull. World Health Organ. 84:965–68 [Google Scholar]
  133. Warren SD, Brandeis LD. 133.  1890. The right to privacy. Harvard Law Rev IV:193–220 [Google Scholar]
  134. Welch L, Lewitter F, Schwartz R, Brooksbank C, Radivojac P. 134.  et al. 2014. Bioinformatics curriculum guidelines: toward a definition of core competencies. PLOS Comput. Biol. 10:e1003496 [Google Scholar]
  135. Wesolowski A, Metcalf C, Eagle N, Kombich J, Grenfell BT. 135.  et al. 2015. Quantifying seasonal population fluxes driving rubella transmission dynamics using mobile phone data. PNAS 112:11114–19 [Google Scholar]
  136. Westin AF. 136.  1967. Special report: legal safeguards to insure privacy in a computer society. Commun. ACM 10:533–37 [Google Scholar]
  137. White A, Trump K-S. 137.  2016. The promises and pitfalls of 311 data. Urban Aff. Rev. http://dx.doi.org/10.1177/1078087416673202 [Crossref]
  138. Wiering M, Van Otterlo M. 138. , eds. 2012. Reinforcement Learning: State-of-the-Art Berlin/Heidelberg, Ger.: Springer
  139. Wing JM. 139.  2006. Computational thinking. Commun. ACM 49:33–35 [Google Scholar]
  140. Wright A, Chen ES, Maloney FL. 140.  2010. An automated technique for identifying associations between medications, laboratory results and problems. J. Biomed. Inform. 43:891–901 [Google Scholar]
  141. Xafis V. 141.  2015. The acceptability of conducting data linkage research without obtaining consent: lay people's views and justifications. BMC Med. Ethics 16:79 [Google Scholar]
  142. Yamada M, Tang J, Lugo-Martinez J, Hodzic E, Shrestha R. 142.  et al. 2016. Ultra high-dimensional nonlinear feature selection for big biological data. arXiv 1608.04048 [stat.ML]
  143. Yang M, Kiang M, Shang W. 143.  2015. Filtering big data from social media—building an early warning system for adverse drug reactions. J. Biomed. Inform. 54:230–40 [Google Scholar]
  144. Yang W, Mu L. 144.  2015. GIS analysis of depression among Twitter users. Appl. Geogr. 60:217–23 [Google Scholar]
  145. Zhao Y, Kong X, Philip SY. 145.  2011. Positive and unlabeled learning for graph classification. Proc. IEEE Int. Conf. Data Mining (ICDM), 11th, Vancouver962–71 New York: IEEE [Google Scholar]
  146. Zhu X. 146.  2005. Semi-supervised learning literature survey Tech. Rep. TR1530, Univ. Wis.-Madison. https://minds.wisconsin.edu/handle/1793/60444
  147. Zimmer M. 147.  2010. “But the data is already public”: on the ethics of research in Facebook. Ethics Inf. Technol. 12:313–25 [Google Scholar]
  148. Zou H, Hastie T. 148.  2005. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67:301–20 [Google Scholar]

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error