Big Data in Public Health: Terminology, Machine Learning, and Privacy

The digital world is generating data at a staggering and still increasing rate. While these ‘Big Data’ have unlocked novel opportunities to understand public health, they hold still greater potential for research and practice. This review explores several key issues arising around big data. First, we propose a taxonomy of sources of big data in order to clarify terminology and identify threads common across some subtypes of big data. Next, we consider common public health research and practice uses for big data, including surveillance, hypothesis-generating research, and causal inference, while exploring the role that machine learning may play in each use. We then consider the ethical implications of the big data revolution with particular emphasis on maintaining appropriate care for privacy in a world in which technology is rapidly changing social norms regarding the need for (and even the meaning of) privacy. Finally, we make suggestions regarding structuring teams and training to succeed in working with big data in research and practice.


Introduction
As measurement techniques, data storage equipment, and the technical capacity to link disparate datasets develop, increasingly large volumes of information are available for public health research and decision-making.(11) Numerous authors have described and made predictions about the role of this 'big data' in health care, (13; 93) epidemiology, (59; 92) surveillance, (62; 113) and other aspects of population health management. (88; 95) This review first describes types of big data, then describes methods appropriate for core functions of public health: surveillance, hypothesis-generating discovery, and causal inference, and finally addresses maintaining care for privacy and structuring teams and training to succeed in working with big data.

Taxonomy of Big Data in Public Health
Most big data used by public health researchers and practitioners fits one of five descriptions. Big public health datasets usually include one or more of a) measures of participant biology, as in genomic or metabolomic datasets, b) measures of participant context, as in geospatial analyses (84; 91), c) administratively collected medical record data that incorporates more participants than would be feasible in a study limited to primary data collection,(93; 104) d) participant measurements taken automatically at extremely frequent intervals as by a GPS device or FitBit,(39) or e) measures compiled from the 'data effluent' created by life in an electronic world, such as search term records,(67) social media postings, (7) or cell phone records. (1; 137) While data collection from each of these sources leverages emerging technologies to collect larger volumes of data than was available prior to the technological development, each form of data has fundamentally different implications for public health research and practice, as noted in Table 1. 'Wider' datasets (i.e. datasets in category (a) or (b), measuring many potential relevant aspects of each subject at each measurement time) typically require reducing the number of dimensions in the dataset to a more interpretable number, either selecting specific variables of greater interest for further analysis (as in selecting candidate biomarkers from a metabolomics dataset or identifying 'eigengenes'(3)) or by identifying variance patterns within these variables (as by a principal component analysis identifying patterns of gut bacteria). (125) By contrast, 'taller' datasets (i.e. categories (c) and (d)) may require more work to filter out irrelevant or low quality observations (e.g. health records of clinical visits unrelated to the hypothesis of interest) or to condense observations into a more tractable, yet information-rich summary.(37) Effluent data offers access to constructs that have heretofore been extremely difficult to measure directly, such as social network structure (1; 49) or racial animus. (89) Each subtype of data poses unique challenges. Biological data is subject to lab effects (where one or more observations may be strongly affected by lab procedures hidden from the analyst) and geospatial data is subject to auto-correlation (wherein spatial units near each other tend to be more correlated), electronic health record data is subject to potentially large standardization and quality-related challenges. 'Effluent data', wherein a hypothesis test focuses on analyzing data not originally collected for research purposes, may require substantial attention to the way the data were initially collected (e.g. using 311 records for noise or graffiti complaints as a marker of neighborhood characteristics requires careful understanding of the factors leading residents to call 311, and whether these factors are demographically patterned). (139) Broadly, data collected automatically, as in personal monitoring and effluent data, are often of interest to behavioral researchers, but typically obscure intention, frustrating attempts at truly understanding behavior.
While this taxonomy is intended to categorize sources of big data, a given dataset may of course include more than one, as when a hospital's data warehouse includes not only electronic medical records of a given patient's visits but also the results from sequencing her whole genome. Indeed, such merged datasets may be the key to identifying etiologic links that have heretofore perplexed researchers, such as gene-environment interactions.

Big Data Surveillance using Machine Learning
Public health surveillance systems monitor trends in disease incidence, health behaviors, and environmental conditions in order to allocate resources to maintain healthy populations. (121) While some of the highest profile uses of big data for surveillance relate to effluent data (e.g. Google Flu Trends), all five categories of big data may contribute to informing authorities about the state of public health. However, the scale of these novel sources of data poses analytic challenges as well. Within the data science field, the "curse of dimensionality"(14) associated with wide datasets has been somewhat alleviated through the adoption of machine learning models, particularly in contexts where prediction or hypothesis generation rather than hypothesis-testing is the analytic goal. We review here some inroads machine learning has made in public health, with particular emphasis on surveillance, and provide a glossary of terminology as used in machine learning for public health researchers (Table 2).
Broadly, machine learning is an umbrella term for techniques that fit models algorithmically by adapting to patterns in data. These techniques can be classified as one of: a) supervised learning, b) unsupervised learning, and c) semi-supervised learning. Supervised learning is defined by identifying patterns that relate variables to measured outcomes and maximize accuracy when predicting those outcomes. For example, an automatically fitted regression model (including any form of generalized linear model) is a supervised learning technique. By contrast, unsupervised learning exploits innate properties of the input data set to detect trends and patterns without explicit designation of one column as the outcome of interest. For example, principal component analysis, which identifies underlying covariance structures in observed data, is unsupervised. Semi-supervised learning, a sort of hybrid, is used in contexts where prediction is a goal but the majority of data points are missing outcome information.(148) Semi-supervised and unsupervised methods are often used in the "data mining" phase as precursors to supervised approaches intended for prediction or more rigorous statistical analyses in a follow-up.
While machine learning has been more broadly adopted within data science, some public health researchers and practitioners have embraced machine learning as well. For example, unsupervised learning has been used for spatial and spatio-temporal profiling,(4; 134) outbreak detection and surveillance,(38; 146) identifying patient features associated with clinical outcomes (47; 142) and environmental monitoring.(26; 65) Semi-supervised variants of existing learning algorithms (Table 3) have been utilized to build an early warning system for adverse drug reactions from social media data, (145) detect falls from smartphone data (33) and identify outlier air pollutants, (18) among other applications. Supervised learning has been used to predict hospital readmission, (32; 44) tuberculosis transmission, (87) serious injuries in motor vehicle crashes (61) and Reddit users shifting towards suicidal ideation, (28) among many other applications. Table 3 reviews some specific applications of machine learning techniques to address public health problems.

Using Machine Learning for Hypothesis Generation from Big Data
Machine learning has also been used in big data settings for hypothesis generation. Algorithmic identification of the measures associated with an outcome of interest allows researchers to focus on independent validation and interpretation of these associations in subsequent studies. Techniques to identify subsets of more strongly associated covariates, referred to within machine learning as 'feature selection', can broadly be divided into three groups: wrapper methods, filter methods, and embedded methods. Wrapper methods involve fitting machine learning models (such as those used for prediction) on different subsets of variables. Based on differences in how well models fit when variables are included, a final set of variables can be selected as the most predictive. For example, the familiar stepwise regression technique is one such wrapper method.(30; 128) By contrast, filter methods leverage conventional measures such as correlation, mutual information, or P-values from statistical tests to filter out features of lower relevance. Filter methods are often favored over wrapper methods for their simplicity and lower computational costs.(24) Finally, embedded methods embed the variable selection step into the learning algorithm. Embedded methods such as least absolute shrinkage and selection operator (LASSO), (123) elastic nets (150) and regularized trees (29) have been used to select features for the prediction of "successful aging", (54) flu trends (112) and lung cancer mortality, (63) among others. Scalable approaches to feature selection in extremely large feature spaces ("ultra-wide" data sets) constitute an active area of research.(119; 144)

Analysis of Big Data for Causation
Causal inference from observational data is notoriously challenging,(45) and yet remains a cornerstone of public health research, particularly epidemiology. Within the public health community, it is well known that the conditions under which an observed statistical association in observational data can be explained only as the effect of manipulating the exposure of interest cannot typically be ensured, regardless of the scale of data. (107) Moreover, confounding, selection bias, and measurement error, all common threats to valid causal inference, are independent of sample size. However, there are four key ways big data and the machine learning techniques developed in part to work with big data may improve causally-focused research.
First, novel sources of exposure data increase the availability of potential instrumental variables. In instrumental variable analyses, an upstream exposure that causes an outcome only by manipulating a downstream exposure of interest can be used to estimate the causal effect of the downstream exposure.(46) For example, it is plausible that changes to compulsory schooling laws change all-cause mortality only by affecting years of schooling completed. (79) Under this 'instrumental variable assumption', compulsory schooling laws can be used as an instrument to estimate the effect of education on all-cause mortality. Instrumental variables have been used extensively for Mendelian randomization studies (in which a genetic variant acts as the instrumental variable).(115; 131) Recent developments in analytic techniques combining estimates from using multiple genetic variants, which may be considered a form of meta-analysis, are a particularly intriguing use of big data.(19; 55) However, we caution that the instrumental variable assumption for any given instrument variable must be considered carefully and the assumption requires specific background knowledge.(36) As such, proliferation of potential instruments is not in itself beneficial; it is only proliferation of valid instruments that can improve causal research.
Second, wider datasets with more measured covariates offer opportunities to use negative controls (76) more extensively to estimate the potential magnitude of residual confounding, measurement error, or selection bias.(8) For example, an analyst using electronic medical records to estimate the impact of BMI in early adulthood in relation to risk of adult onset diabetes might be concerned about confounding by socio-economic status (acting as a fundamental cause through health-orientation, health literacy, etc. (75)) and might control for the best available proxy measure of socio-economic status (e.g. median income in reported ZIP code). While this measure is likely imperfect and thus may leave residual confounding, she might take advantage of the breadth of outcomes available in electronic medical records that might act as negative controls by, for example, assessing whether BMI is associated with mammography screening after controlling for the socio-economic proxy. If an association exists before controlling for ZIP code median income but drops close to zero after controlling, the analyst may conclude that residual confounding due to error in her socio-economic status measure is unlikely to result in strong bias in her primary analysis because such error would need to be uncorrelated with screening status (though residual confounding can never be ruled out). The use of negative controls has been described extensively in the epidemiologic methods literature (76; 77) but remains relatively uncommon.
Third, the availability of more covariates may allow for more precise causal mediation estimates (130) allowing stronger "causal explanation" tests of hypotheses regarding health production.(42) For example, studies exploring residential proximity to fast food as a cause of obesity (e.g.(25)) typically hypothesize that the exposure (proximity) affects the outcome (obesity) as mediated by consuming fast food. Such a study could benefit from linked GPSbased personal monitoring data that allow researchers to consider whether study subjects actually visited the fast food restaurants proximal to their residential location.
Finally, machine learning is increasingly being integrated into causal inference techniques, particularly in contexts where prediction or discovery is a component of an inferential process. For example, analysts using target maximum likelihood estimation (TMLE) to estimate causal treatment effects frequently use SuperLearner, an 'ensemble' supervised learning technique (i.e. one that combines estimates from multiple machine learning algorithms), as a portion of the targeting phase. In TMLE, the targeting step requires a predictive model incorporating information from covariates but imposes no functional form on that model; thus, tunable predictive models such as SuperLearner are ideal. (129) Similarly, methodologists have recently proposed techniques using machine learning to identify the strata in which a randomized intervention has the strongest effect. In this case, machine learning is being used for discovery, as an efficient search over set of potential groupings too large to test each one independently.(2; 132)

Big Data and Privacy
The proliferation and availability of big data, especially effluent data, has already fostered privacy concerns among the general public, and these concerns are expected to grow and diversify. (86) With respect to public health research and practice, big data raises three key issues: 1) the risk of inadvertent disclosure of personally identifying information (e.g. by the use of online tools(10)), 2) the potential for increasing dimensionality of data to make it difficult to determine if a dataset is sufficiently de-identified to prevent 'deductive disclosure' of personally identifying information (Figure 1), 3) the challenge of identifying and maintaining standards of ethical research in the face of emerging technologies that may shift the generally accepted norms regarding privacy (e.g. GPS, drones, social media, etc.).
Although avoiding disclosure of study participants' private information is a key principle of research ethics mandated in the United States by the Health Insurance Portability and Accountability Act (HIPAA),(69; 97) inadvertent disclosure of publically identifying information by health researchers has occurred repeatedly.(94) Indeed, inadvertent disclosure has become increasingly commonplace as increasing volumes of personally identifying data are stored in massive data warehouses. (81) While such disclosure can occur owing to malicious acts by malefactors, it may occur more frequently due to misunderstandings of well-meaning individuals.(94) For example, researchers may be unaware that using online geographic tools such as Google Maps to identify contextual features of subjects' neighborhood constitutes a violation of typical terms of Institutional Review Board conditions.(10) Similarly, researchers who report pooled counts or allele frequencies in genome-wide association studies may inadvertently reveal the presence of an individual in that study sample to anyone who knows that person's genotype.(20; 48) Secondly, increasing columns of data may create a form of fingerprint such that subjects in de-identified datasets containing could be re-identified, a process known as deductive disclosure.(110; 126) Whereas institutional review board terms have conventionally treated the 18 columns of data specified by the HIPAA privacy rule to be the personally identifying ones (e.g. name, phone number), they often consider data derived from these identifying measures to connote anonymity (e.g. mean household income among census respondents living within a 1 km radius of the subject, or a specific variant of a given SNP taken from the whole exome dataset), formally, HIPAA specifies that data is considered identifiable if there is a way to identify an individual regardless of the columns included. Merged datasets containing many columns of big data from different domains that are themselves deidentified may still combine to make subjects re-identifiable (e.g. neighborhood median income plus ARDB2 Gln27Glu variant may be sufficient to identify a subject who would not be identifiable through neighborhood median income or Gln27Glu variant alone). Figure 1 is a schematic representation of this deductive disclosure that may occur as a result of merging. Techniques to protect confidentiality in the face of data merges (see sidebar on Data Perturbation for one such example), may become a key component of future data sharing agreements, though such techniques induce precision costs.
Finally, in part because of changing technologies including social media, drone surveillance, and open data in general, some ethicists suggest accepted norms around privacy may change.
(143; 149) Changing privacy norms have a long history: formal definitions of privacy have been inconsistent, from "the right to be left alone" (135) in 1890 to the late 1960's idea that privacy amounted to control over the information one produces (138) to more recent notions defining non-intrusion, seclusion, limitation, and control as separate categories of privacy. (120) A recurring theme in discussions of privacy, even prior to the big data era, is that the notion of ownership of information is problematic because nearly all data-producing actions, from clinical visits to social media postings to lab-based gene expression measurement, involve the work of more than one person, each of whom have created and therefore have some rights to the data.(85; 117) If anything, one constant theme regarding privacy is that no single clear definition suffices, (122) and we may expect the waters to get muddier as more people are involved in the data creation and collation process. For public health, there are no proscriptive answers; rather, we must follow and contribute to the societal discussion of privacy norms while remaining true to principles of using fair procedures to determine acceptable burdens imposed by our decisions. (58)

Big Data, Public Health Training, and Future Directions
The use of big data in public health research and practice calls for new skills to manage and analyze these data, though it does not remove the need for the skills traditionally considered part of public health training, such as statistical principles, communication, domain knowledge, and leadership.(124) However, the training and effort required to gain and maintain current knowledge of recent advances in algorithmic and statistical frameworks is non-trivial.
Two specific skills may become important to foster for all big data users. First, it may be important to develop the capacity to 'think like a computer' when working with data. For example, while it is comparatively easy for a person to guess that records showing a "Bob Smith" and "Robert Smirh" living at the same address probably represent the same person, it is a much more complex leap for a simple name-matching algorithm that naively compares one letter at a time, to recognize not only that Bob is a common nickname for Robert, but also that t and r look similar in some fonts and are next to each other on a keyboard. Such 'computational thinking', wherein an analyst can recognize which problems pose greater algorithmic challenges, runs deeper than simply knowing how to program, run software, or build hardware, and has been suggested as a supplement to reading, writing, and arithmetic early in a child's life.(141) But even public health trainees without childhood computational education may benefit from being able to "think like a computer" when faced with data sets that are time-and resource-intensive. We refer the reader to important reviews (41; 83) that have concretized the two core principles in computational thinking: abstraction and automation.
Second, quantitative bias analysis and related techniques will likely become a more important part of public health training, especially within epidemiology and biostatistics. As complex public health data sets become more integrated, more studies are expected to use secondary data. However, because systematic biases are harder to rule out in contexts where the investigator was not part of the data collection process, techniques that can explore the probability of incorrect inference under different assumptions of bias will be important to retain confidence in substantive conclusions. (60) Similarly, decisions about choice and evaluation of methods often involve tradeoffs between correctness on specific data points and probabilistic notions of correctness on the whole data set, e.g. gene-specific vs. genomewide predictive models (106) and will require deep understanding of probability and statistics.
These two core skills are only a subset of the overall data science skills needed to work with public health big data, including an understanding of health informatics, data engineering, computational complexity, and adaptive learning. However, because these skills require substantial investment to master, we submit that training in more advanced data science techniques should be available but not required of public health students, analogous to other optional but important skills such as community-based health assessment.(74) This cultivation of specialized skills will necessitate diverse teams, a model already familiar to public health practitioners but less incorporated in training to date. Sidebar 5 summarizes how specialization in training has shaped bioinformatics education, which may provide a template for public health education. Numerous other perspectives on data science education may also be helpful.(40; 99) As both specialized and generalized big data skills become more common in the public health workforce, these skills should be used to optimize data collection procedures. A biostatistician comfortable with real-time data processing may be more likely to push for data-adaptive trial protocols,(6) for example, or an informatics specialist with experience using natural language processing techniques to extract data from clinician notes might help a clinician understand how to frame her notes to be most efficient for clinical and research use. Epidemiologists comfortable with stepped wedge designs(118) may be more likely to suggest them to policy makers rolling out public health initiatives. Broadly, learning new ways to work with data effectively will and should shape not only which data we will choose to collect but also how we choose to collect it.

Public Health
Appropriate use of both big data and machine learning rely on understanding several key limitations of each. First, we observe that machine learning's capacity to overcome the curse of dimensionality requires tall data sets.(43) Small and/or biased training sets can lead to overfitting ( Table 2) which limits the problems that current machine learning methods can address. Second, machine learning models are often described as "black boxes" whose opacity precludes interpretability or sanity-checking of key assumptions by non-experts. (109) While recent work has partially addressed this limitation (Sidebar 4),the problem persists. Third, in some instances, observers assume that models that learn automatically from data are more objective therefore more accurate than human-constructed models. Although data-driven models frequently can predict outcomes better than theory-driven models, data-driven model building also involves subjective decisions, such as choice of training and evaluation data sets, choice of pre-processing criteria, and choice of learning algorithms and initial parameters. These decisions cumulatively result in biases and prejudices that may be obscured from casual users.(17; 21) Fourth, data quantity often comes at the expense of quality. This is an issue for any big data analysis, but may be especially pernicious in the context of machine learning methods that use a test set to estimate prediction accuracy in the broader world. If data collection artifacts render training and test sets overly similar to each other but different from those of the data sets that the model would typically be applied to, overfitting may lead to unanticipatedly poor prediction accuracy in the real world.(15; 23) Finally, because big data studies often requires linking secondary-use data from heterogeneous sources, discrepancies between these data sources can induce biases, including demographically patterned bias (e.g. linking by name more frequently misses women who change surname after marriage.(16))

Conclusions
As the big data revolution continues, public health research and practice must continue to incorporate novel data sources and emerging analytical techniques, while contributing to knowledge, infrastructure, methodologies, and retaining a commitment to the ethical use of data. We feel this is a time to be optimistic: all five sources of big data identified in this review hold considerable potential to answer previously unanswerable questions, perhaps especially with the use of modern machine learning techniques. Such successes may arrive more quickly and more rigorously to the extent that the public health community can embrace a specialized, team science model in training and practice.

Sidebar 1. Measurement Error and Big Data
Although larger sample sizes afforded by big data reduce the probability of bias due to random error, bias due to measurement error is independent of sample size. (50; 56; 92) While some have argued the decrease in random error allows researchers to tolerate more measurement error, (88) this perspective implicitly assumes that hypothesis testing rather than estimation is an analyst's goal, a perspective which has repeatedly been rejected within the public health literature.(35; 103) Indeed, measurement error may be more problematic in big data analyses,(64) because analysts working with secondary or administrative data may not have access to knowledge about potential data artifacts. For example, metabolomic datasets are vulnerable to measurement error related to timing of sample collection,(108; 111) but if the timing of sample collection was not included the dataset, an analyst will be unable to assess the potential impact of this error. Emerging machine learning techniques accounting for measurement error (known within that literature as 'noisy labels') may also be informative.(52; 53; 90; 96)

Sidebar 2. Data Perturbation
Data perturbation is a technique in which random noise is added to potentially identifying observed variables in order to prevent study participants from being identified while attempting to minimize information loss.(57) For example, a data perturbation algorithm might replace identifying information (e.g. birthdate) with values sampled from observed distribution of that variable. This idea has been developed extensively within the computer science data mining literature,(72; 78; 114) but relatively less explored within public health research to date (with some notable exceptions, including the National Health Interview Survey (80)).

Sidebar 3. Specialization in Bioinformatics Training
Bioinformatics curricula are typically framed to support three roles: (a) scientists, who use existing tools and domain expertise to develop and test hypotheses, (b) users, who consume information generated through bioinformatics research but typically do not apply the tools directly (e.g. genetic counselors, clinicians, etc.) (c) engineers, who develop novel bioinformatics tools to address problems that may or may not be specific to a domain.(136) Although many individuals act more than one of these roles at some point in an informatics career, identifying the core competencies of each role helps to frame the training need to specialize in each. For example, whereas engineers require strong algorithmic and programming skills, users need only a conceptual understanding of algorithms (but require much stronger interpretive and translational skills).

Sidebar 4. Interpretability of Machine Learning Models
While interpretability is not the primary goal of machine learning, some algorithms (e.g. decision trees) are inherently more interpretable than others. Broadly, interpretation of models is an area of active research, wherein one key idea involves the separation of the predictive model and the interpretation methodology itself. For instance, a naive approach involves the post hoc ranking of features based on empirical P-values calculated against a null distribution for each feature.(71) A modification of this involves ranking features in terms of their actual values in situations where they can be interpreted as probabilities. (102) More sophisticated approaches such as LIME (105) and Shap (82) provide general yet simple linear explanations of how features are weighted when a prediction is made, irrespective of the underlying model.

Sidebar 5. Future Directions in Machine Learning for Big Data in Public Health
There are three developments in machine learning that may be of interest to public health researchers and practitioners. First, machine learning has recently begun to formally confront outcome measurement error, (52; 53; 90; 96) particularly for datasets with a low-sensitivity outcome measure.(22; 98; 147) Second, several machine learning approaches designed for real-time prediction learn through a penalty-reward system based on feedback on its predictions rather than by fitting a model to a previously collected dataset. (140) This class of approaches, known as reinforcement learning, could be used in online data collection tools and surveillance. Finally, 'deep learning' approaches, which use large volumes of data and computational power to identify common but abstract components for automated classification (without the need for human guidance), have been used extensively in image classification and natural language processing.(68) It is expected that they will gain increased application to health data in the future as computational costs decrease.(66) A schematic illustration of deductive disclosure: merging two datasets that are each successfully anonymized may result in a dataset in which subjects can be personally identified. Mooney