Designing Difference in Difference Studies: Best Practices for Public Health Policy Research

The difference in difference (DID) design is a quasi-experimental research design that researchers often use to study causal relationships in public health settings where randomized controlled trials (RCTs) are infeasible or unethical. However, causal inference poses many challenges in DID designs. In this article, we review key features of DID designs with an emphasis on public health policy research. Contemporary researchers should take an active approach to the design of DID studies, seeking to construct comparison groups, sensitivity analyses, and robustness checks that help validate the method’s assumptions. We explain the key assumptions of the design and discuss analytic tactics, supplementary analysis, and approaches to statistical inference that are often important in applied research. The DID design is not a perfect substitute for randomized experiments, but it often represents a feasible way to learn about casual relationships. We conclude by noting that combining elements from multiple quasi-experimental techniques may be important in the next wave of innovations to the DID approach. 453 Click here to view this article's online features: • Download figures as PPT slides • Navigate linked references • Download citations • Explore related articles • Search keywords ANNUAL REVIEWS Further A nn u. R ev . P ub lic H ea lth 2 01 8. 39 :4 53 -4 69 . D ow nl oa de d fr om w w w .a nn ua lr ev ie w s. or g A cc es s pr ov id ed b y 19 2. 25 4. 1. 7 on 0 9/ 07 /1 8. S ee c op yr ig ht f or a pp ro ve d us e. PU39CH26_Simon ARI 26 February 2018 17:21


INTRODUCTION
Causal inference is a key challenge in public health policy research intended to assess past policies and help decide future priorities. The causal effects of policies and programs related to vaccines, vehicle safety, toxic substances, pollution, legal and illegal drugs, and health behaviors are difficult to measure. But scientific research and sound policy analysis demand information about causal relationships. The standard advice is to implement a randomized controlled trial (RCT) to avoid confounding and isolate treatment effects. But large-scale RCTs are rare in practice. Without an RCT, researchers often seek answers from natural experiments, including regression discontinuity designs, instrumental variables, covariate matching, and synthetic control strategies (for recent methods reviews, see 9,10,13,41). In this article, we focus on the design of quasi-experimental studies that compare the outcomes of groups exposed to different policies and environmental factors at different times. Most people describe the approach as a difference in difference (DID) design, but it is sometimes called a comparative interrupted time series design or a nonequivalent control group pretest design (6,55,92,99,105).
Regardless of nomenclature, the DID design is well established in public health research (45). It has been around since the middle of the nineteenth century, when John Snow published the results of his DID study showing that cholera is transmitted through the water supply rather than air (97). Since Snow's study, researchers have developed tools and tactics that can strengthen the credibility of DID studies. Our goal in this article is to review principles and tools that researchers can use to design and implement a high-quality DID study. Throughout the article, we point to theoretical work and empirical examples that help clarify important techniques or challenges that are common in health research. Observing a variety of applied examples that implement these techniques is a very useful complement to describing the DID challenges in abstract.

Potential Outcomes Notation
Throughout the article, we use g = 1 . . . G to index cross-sectional units and t = 1 . . . T to index time periods. In DID studies, g often refers to geographical areas such as states, counties [e.g., when studying the historical rollout of a food stamp program (61)], or census tracts, although it could also refer to distinct groups such as those separated by age [as used in studies of Medicare Part D (e.g., 3,65,101) or the young adult mandate of the Affordable Care Act (e.g., 91)]. Most of the time, t represents years, quarters, or months. In most applications, researchers are concerned with outcomes in two alternative treatment regimes: the treatment condition and the control condition. To make the idea concrete, let D gt = 1 if unit g is exposed to treatment in period t, and D gt = 0 if unit g is exposed to the control condition in period t. In public health applications, the set of treatments might consist, for example, of two alternative approaches to the regulation of syringe exchange programs that are adopted in different states in different years (23).
Research on the causal effects of the treatment condition revolves around the outcomes that would prevail in each unit and time period under the alternative levels of treatment. One way to make this idea more tangible is to define potential outcomes that describe the same unit under different (hypothetical) treatment situations. To that end, let Y (1) gt represent an outcome of interest for unit g in period t under a hypothetical scenario in which the treatment was active in g at t; Y (0) gt is the outcome of the same unit and time under the alternative scenario in which the control condition was active in g at t. The treatment effect for this specific unit and time period is 454 Wing · Simon · Bello-Gomez gt = Y (1) gt − Y (0) gt , which is simply the difference in the value of the outcome variable for the same unit across the two hypothetical situations. The notation suggests this would be easily done, but applied researchers cannot observe the identical unit under two different scenarios as one could through a lab experiment; in practice, each unit is exposed to only one treatment condition in a specific time period, and we observe the corresponding outcome. Specifically, for a given unit and time, we observe The notation so far describes the counterfactual inference problem that arises in every causal inference study. In a typical study, researchers have access to data on Y gt and D gt , and they aim to combine the data with research design assumptions to learn about the average value of Y (1) gt − Y (0) gt in a study population. The DID design is a quasi-experimental alternative to the wellunderstood and straightforward RCT design, seen for example in the health insurance context in the RAND Health Insurance Experiment in the 1970s and more recently in the Oregon Health Insurance Experiment (12, 67; see 74 for new techniques in external validity).
RCT and DID share some characteristics: Both involve a well-defined study population and set of treatment conditions, where it is easy to distinguish between a treatment group and a control group and between pretreatment and post-treatment time periods. The most important distinction is that treatment conditions are randomly assigned across units in an RCT but not in a DID design. Under random assignment, treatment exposure is statistically independent of any (measured or unmeasured) factor that might also affect outcomes. In a DID design, researchers cannot rely on random assignment to avoid bias from unmeasured confounders and instead impose assumptions that restrict the scope of the possible confounders. Specifically, DID designs assume that confounders varying across the groups are time invariant, and time-varying confounders are group invariant. Researchers refer to these twin claims as a common trend assumption. In the next two sections, we describe the DID design further and explain how the key assumptions of the design lead to a statistical modeling framework in which treatment effects are easy to estimate. We start with the simple two-group two-period DID model and then examine a more general design that allows for multiple groups and time periods.

Two Groups in Two Periods
The simplest form of the DID design is a special case in which there are only two groups (g = 1, 2) observed in two time periods (t = 1, 2); this situation is often represented by a 2 × 2 box. In the first period, both groups are exposed to the control condition. In the second period, the treatment rolls out in group 2 but not in group 1. Let T g = 1[g = 2] be a dummy variable identifying observations on group 2. T g has no time subscript because group membership is time invariant. P t = 1[t = 2] indicates observations from period 2, and P t has no group subscript because the time period does not vary across the groups. In the simple DID, the treatment variable is the product of these two dummy variables: D gt = T g × P t . It is easy to see the connection between the description of the design and the notation. For example, D gt = 0 for both groups in the first period because P t = 0; and D gt = 1 only for group 2 in period 2 because that is the only way that both T g and P t are equal to 1.
In the two-group two-period DID design, the common trend assumption amounts to a simple statistical model of the treated and untreated potential outcomes. Under the simple DID, the untreated potential outcome is Y (0) gt = β 0 + β 1 T g + β 2 P t + gt . In the absence of treatment, the average outcome in group 1 is β 0 in period 1 and β 0 + β 2 in period 2. Likewise, the average untreated outcome in group 2 is equal to β 0 + β 1 in period 1 and β 0 + β 1 + β 2 in period 2. Under the common trend assumption, the coefficient on T g captures the time-invariant difference in outcomes between the two groups. Implicitly, the group coefficient captures the combined effects www.annualreviews.org • Designing Difference in Difference Studies 455 of all unmeasured covariates that differ systematically between the two groups and that do not change over the course of the study period. In a similar manner, the coefficient on P t captures the combined effects of any unmeasured covariates that change between the two periods but affect outcomes the same way in both groups. In practice, researchers call β 1 the group effect and β 2 the time trend. The model for the treated potential outcome is the untreated outcome plus a treatment effect, which is usually restricted to be constant across observations: Y (1) gt = Y (0) gt + β 3 . The two potential outcome specifications combine with the treatment indicator to produce realized outcomes according to the general formula Replacing the potential outcomes with the model specification gives Y gt = β 0 +β 1 T g +β 2 P t + gt +D gt [Y (0) gt +β 3 −Y (0) gt ]. In the two-group two-period setting, D gt = T g × P t , which means that after canceling the Y (0) gt terms we can rewrite the observed outcome equation in terms of the group and time period indicators to obtain the standard DID estimating equation: The model is easy to estimate with data on outcomes, group membership, and time periods. The coefficient on the interaction term is an estimate of the treatment effect under the common trend assumption.

Multiple Groups and Time Periods
The two-group two-period DID design is intuitive, but it does not accommodate the complexity encountered in applications, which often involve treatment exposures in multiple groups and multiple time periods. An example of this is the state adoption of medical marijuana laws, which remains an active area of state policy. Research in this area includes a study by Harper et al. (57), who reexamined earlier research that did not include state fixed effects, and one by Anderson et al. (4), who incorporated more DID techniques. Luckily, the main features of the DID design also apply in a broader set of conditions. When G ≥ 2 groups and T ≥ 2 periods, D gt = 1 if the treatment is active in group g and period t; otherwise, D gt = 0. As in the two-group two-period case, the core assumption in the generalized DID is that any unmeasured determinants of the outcomes are either time invariant or group invariant.
The generalized design is easy to analyze using a two-way fixed effects regression model to describe the potential outcomes. The model for the untreated outcome is Y (0) gt = a g + b t + gt . In the model, a g represents the combined effects of the time-invariant characteristics of group g, and b t represents the combined effects of the time-varying but group-invariant factors. 1 The average untreated outcome for group 3 in period 5 is given by a 3 + b 5 . Likewise, the average untreated outcome for group 4 in period 5 is a 4 + b 5 . The two groups have different levels in every period, but any changes over time within a group come from the group-invariant trend terms described by b t . Researchers call a g a group-fixed effect and b t a time-fixed effect. The time-fixed effects trace out the common time trend. A key point is that the group effects and time trends stem from underlying differences in unmeasured covariates across groups and time periods. The DID design 1 It may be more revealing to think of a g = x g α, where x g is a vector of time-invariant covariates associated with group g and α is a coefficient vector. Likewise, we can think of b t = z t γ , where z t is a vector of time-varying but group-invariant covariates, and γ is a coefficient vector. In practice, x g and z t are unmeasured, and we do not attempt to estimate each of the covariate specific coefficients. Instead, we estimate or eliminate the combined effects of all covariates using fixed effects differencing techniques.

456
Wing · Simon · Bello-Gomez is meant to control for these unmeasured confounders even though the underlying variables are not measured explicitly.
Like the two-group two-period design, the generalized DID also specifies that the treated outcome is a shifted version of the untreated outcome so that Y (1) gt = Y (0) gt + δ. Combining the equations shows that the observed outcome is Substitute the fixed effects structure for the potential outcomes to obtain Y gt = a g and cancel the remaining Y (0) gt terms to find the generalized DID estimating equation: The two-way fixed effects parameterization stems from the same common trend assumption involved in the two-group two-period DID, but it accommodates considerably more variation in the details of the research design. In practice, researchers estimate the treatment effect parameter, δ, using fixed effects regression models; they simply regress the observed outcome on the treatment variable and a full set of group-and time-fixed effects. For an example, see the main specification in Bitler & Carpenter (21).

The Common Trends Assumption
Both the simple and generalized DID designs rely on the assumption that the important unmeasured variables are either time-invariant group attributes or time-varying factors that are group invariant. Together, these restrictions imply that the time series of outcomes in each group should differ by a fixed amount in every period and should exhibit a common set of period-specific changes. Loosely speaking, a graph of the time series should look like a set of parallel lines. For an example, see the graphs in Kaestner et al. (64) of the treatment group and synthetic control group trends among low-educated adults prior to state Medicaid expansion or figures in other Medicaid expansion studies (e.g., 96). Note that parallel lines do not have to be linear: Time-fixed effects allow for flexible time trends that move up or down across from period to period, as they do, for example, in the study of Sommers et al. (100), who examine state Medicaid expansions using as a control group low-income adults in states that did not expand Medicaid.
In applied work, the most difficult task is evaluating the credibility of the common trends assumption. Later in the article, we discuss statistical tests and graphical analyses that researchers can use to empirically probe the credibility of the assumption. Researchers, however, must also think carefully about the conceptual reasons for which the common trends assumption might be valid in some settings and not in others. It may be helpful to interpret the common trends assumption as a byproduct of a set of underlying variables that differ across states and change over time. Consider the case of vaccine policy (a topic studied, for example, in 102). Instead of asking the abstract question of whether vaccination rates in two states are apt to follow a common time trend absent the policy, we could ask what sorts of (unmeasured) factors likely explain variation in vaccination rates across states and over time, such as parental attitudes. Next, we would ask whether those factors are likely covered by the DID design: Are they time-invariant group attributes or group-invariant time-varying factors? Naming the unmeasured variables that the fixed effects structure is intended to capture is a good way to assess the quality of a DID design, because it is easier to construct and evaluate arguments for and against specific variables than for abstract trends that arise from unknown origins.
Being specific about unmeasured variables often points the way to stronger research designs as well. Perhaps it makes sense to exclude certain groups from the analysis if they seem likely to differ from the others with respect to the important unmeasured variable. A version of this argument is www.annualreviews.org • Designing Difference in Difference Studies 457 used in forming synthetic control groups, where groups that differ in past characteristics compared to the treatment groups are excluded or given less weight when forming the control group for a single difference (as is done in 1, which forms a synthetic California from a weighted average of potential control states that do not have tobacco control programs; for longer reviews of synthetic control methods, see 10,47,77). The common trends assumption may hold in a restricted sample of groups or time periods even if it does not hold across all groups and times. This line of thinking is the starting point for combined research designs in which researchers use propensity score matching in a first step and then estimate treatment effects using DID methods on the matched sample [as was done, for example, in studying health effects of employment transitions in Germany (50); for use of DID and synthetic control methods together, see also 54].

Strict Exogeneity
The DID design aims to difference out unmeasured confounders using techniques that eliminate biases from group-or time-invariant factors. For the differencing technique at the core of the method to work, the timing of treatment exposures in the DID design must be statistically independent of the potential outcome distributions, conditional on the group-and time-fixed effects. This aspect of the design is harder to understand. Econometrics textbooks use the term "strict exogeneity" to describe it, pointing out that it is stronger than "contemporaneous exogeneity," which is the foundational assumption in studies based on propensity score matching and cross-sectional regression adjustment.
To better understand the distinction, suppose that a g and b t are functions of vectors of the underlying covariates x g and z t . A researcher who collects data on each covariate might estimate the causal effect of D gt on Y gt under the conditional mean independence assumption that To put this idea into practice, the researcher might form matched pairs of treated and control observations and estimate the treatment effect using the mean difference in outcomes in the matched sample, as, for example, Obermeyer et al. (83) did in studying Medicare's hospice benefit. The situation is different in the DID design. To remove confounding using differencing, the entire sequence of past and future treatment exposures must be independent of unmeasured determinants of the outcome variables. Formally, strict The idea is that-after conditioning on the group and period effects-treatment exposures that occur at t + 1 are not anticipated by outcomes measured in an earlier period such as t. The restriction could fail in practice for many reasons. Perhaps states change their regulations in response to changes in the outcome variable of interest (19), or perhaps companies change their behavior in anticipation of a regulation that seems likely to occur in the near future. Such behavioral patterns almost certainly occur in the real world, and they represent important threats to the validity of DID designs. One way that researchers investigate such effects is to include the policy variables on the left-hand side and show that the factors that most concern us do not predict the passage of the law. Some studies use these specifications to show that political variables are influential, and to the extent that they can be considered exogenous, they could be used as instruments for the policy (71).

SENSITIVITY ANALYSIS AND ROBUSTNESS CHECKS OF THE COMMON TRENDS ASSUMPTION
Modern applications of the DID design devote much attention to sensitivity analysis and robustness checks designed to probe the main assumptions that support the internal validity of 458 Wing · Simon · Bello-Gomez the research design. Although the specific details involved vary with the context and data limitations of individual studies, this section provides a short summary of the analytical techniques researchers use to shed light on the validity of the common trends assumption and threats to the strict exogeneity condition.

Graphical Evidence
In the simple two-group two-period DID, the common trend assumption is not testable. In settings with multiple pretreatment time periods, however, researchers can partially validate the common trends assumption. For example, researchers often plot the mean outcomes by group and time period and then ask whether the lines appear to be approximately parallel (e.g., see 8, figure 1, for an example related to the young adult mandate of the ACA, where the visual plot serves as a precursor to a statistical test of the parallel trends assumption). When the annual means are precisely estimated and year-to-year volatility is relatively low, it is easy to spot deviations from the common trends assumption in a long time series.
Visual evidence may be less compelling when the data are noisy or the time series is short. In such cases, it may be difficult to distinguish between statistical noise and genuine deviations from the common trends. A graph also helps convey the strength of the policy shock, as measured for example by the impact of a health insurance policy on coverage rates. This is important because studies often go on to examine the impact of a policy on downstream effects (such as health care use or health status). The interpretability of graphical evidence is related to the broader issue of statistical power in DID designs. The statistical power of DID designs often requires more analysis than the standard power analysis for simple mean differences and linear regression coefficients considered in standard textbooks, and it is important to consider the size of effects that such studies can reliably detect (see 26, p. 46; 70, 80).

Group-Specific Linear Trends
Another strategy for evaluating the common trend assumption in studies with more than two time periods is to fit an augmented DID regression that allows for group-specific linear trends [as done, for example, by Hansen et al. (56) in studying state cigarette taxes]. In practice, this amounts to a regression of the outcome on the treatment variable, group and period effects, and each group effect interacted with the linear time index: Y gt = a g + b t + β g (a g × t) + D gt δ + gt . The common trends model is nested in the group-specific trend model. An F-test of the compound null in which all the coefficients of the group-specific linear trends are jointly zero is a test of the common trends model. Rejecting the null hypothesis implies that common trends is not a valid assumption. In practice, most researchers interpret the group-specific linear trends model more casually by comparing the treatment effect estimates in the restricted and unrestricted models. If the treatment effect is not sensitive to the alternative specification, most researchers consider the core results more credible.

Balancing Tests for Changes in Composition
In RCTs and matching studies, researchers often present evidence that the distribution of covariates is very similar in the treatment and control groups (59,63). The basic goal in this case is to show that the two groups were comparable prior to treatment exposure. In a DID study, the groups are usually nonequivalent prior to treatment exposure, so that a simple covariate balancing be more reassured when covariates are similar. What matters for DID validity is that differences between the two groups are stable over time and that the changes in treatment exposure are not associated with changes in the distribution of covariates. One way to examine this aspect of DID validity empirically is to estimate covariate balance regressions (see, for example, 86, which uses covariate balancing to study the productivity of new surgeons). Suppose that in addition to data on Y gt and D gt , researchers also have access to data on a covariate C gt associated with group g in period t. A simple way to test for problematic compositional changes is to replace the outcome variable with the covariate and fit the standard DID regression model: C gt = a g + b t + D gt δ + gt . Under the null hypothesis that there are no compositional changes, we expect that δ = 0. Of course, it is sensible to consider the magnitude of the change in composition rather than the pure statistical significance of the coefficient estimate. Researchers can fit the DID regression to data on a large list of available covariates to assess the relevant concept of balance across a broad range of factors.

Granger-Type Causality Tests
To examine the possibility that future treatment exposures are anticipated by current outcomes, researchers can augment the standard DID regression model to include leading values of the treatment variable. For example, researchers might fit a model with S leading values of the treatment variable: Under the strict exogeneity null, we expect that future policy changes will not be associated with current outcomes, so that γ s = 0 for s = 1 . . . S. Decisions about how many leads to examine are somewhat arbitrary and mainly have to do with the total number of periods available for analysis and the timing of the policy changes. Examples of studies that include lead tests are those by Bachhuber et al. (11) (on the relationship between medical cannabis laws and opioid overdose mortality) and Raifman et al. (88) (on the relationship between same-sex marriage laws and adolescent suicide attempts).

Time-Varying Treatment Effects
In many applications, the effect of the treatment may vary with time since exposure. Researchers can study these effects by including lagged treatment variables in the standard DID model. One common strategy is to use an event study framework examining anticipation effects and phase-in effects in a single regression such as In this specification, δ captures the immediate effect of the policy, and λ m measures any additional effects of a policy that occur m periods after adoption. If the initial effect of the policy is positive, then negative values of λ m imply that the initial effect of the policy dissipates over time, and positive values of λ m suggest that the policy has larger effects over time. Event study figures are included, for example, in Bellou & Bhatt (16), who study drivers' license laws; Anderson et al. (5), who study medical marijuana laws; Bitler & Carpenter (21), who study mammography mandates; Simon (93), who studies cigarette taxes; Marcus & Siedler (75), who study alcohol policy in Germany; and Paik et al. (84), who study medical malpractice. Some studies, such as the one by 460 Wing · Simon · Bello-Gomez Brot-Goldberg et al. (24), specifically look for anticipatory effects, in this case studying the effect of deductibles on health care prices, quantities, and spending. In general, whenever a policy includes a time gap between announcement and effective date, such behaviors are possible. In the context of a well-publicized federal policy change, Alpert (3) examines anticipatory effects before Medicare Part D implementation, exploiting the difference in behaviors observed for chronic versus acute drugs, and Kolstad & Kowalski (66) consider periods before, during, and after treatment.

Triple Differences
When the core DID assumptions are suspicious on conceptual or empirical grounds, researchers sometimes seek to strengthen the research design by adding an additional comparison group and estimating treatment effects using a difference in difference in difference (DDD) design. Suppose that the DID design is questionable because there is some time-varying confounder that changes differentially across the states that make up the study design. A time-varying confounder that is not state invariant is a problem for the DID study because it violates the common trend assumption. To address the problem with a DDD design, researchers need to find a new within-state comparison group that is not exposed to treatment but is exposed to the problematic time-varying confounder. With the two groups in hand, researchers can estimate the standard DID specification separately on the original data and on the new comparison group data. The DID estimate from the comparison group represents an estimate of the effect of the state-specific time-varying confounder that is free from any treatment effect. The DID estimate from the original data represents the combined effect of the confounder and the treatment. By subtracting one DID estimate from the otherforming a triple difference-researchers can remove the bias from the confounder and isolate the treatment effect [see Atanasov & Black (9, pp. 254-58) for a careful treatment of DDD designs].
Suppose that some states impose a tax on large hospitals but not small hospitals, and we wish to study its impact on the wages of nurses. The treatment states experience some of the same spurious shocks that affect control states, but suppose also that the tax-adopting states are mostly from geographic areas that faced a different set of regional economic booms and busts over time. The standard DID estimate might conflate the changes in the hospital tax policy with the regional economic conditions; that is, the DID model might fail to meet the common trends assumption. A DDD strategy might start by reasoning that small hospitals are subject to the same regional economic conditions as large hospitals but are not subject to the large hospital tax.
Nationwide, there are also some small-firm shocks and some large-firm shocks. Thus, either a DID that compares small and large firms within treated states or a DID that compares large firms across treatment and control states would be compromised. However, a DDD that compares changes over time in large firms in states with and without the policy, compared to the similar difference for small firms, would produce an unbiased result. In other words, the common trends assumption should hold in the DDD, whereas it would not hold in either of the two possible DID methods separately. Researchers almost always present triple difference specification results as a supplement to a main DID specification; recent examples of use in health include the studies by Chatterjee and colleagues (36) and Heim & Lin (58), both of which examine the labor market outcomes of health insurance reform. It is fairly rare to find an article presenting a parallel trends test of a DDD, but Paik and colleagues (85) offer an example of such a test and show the importance of conducting such tests.

STATISTICAL INFERENCE IN DIFFERENCE IN DIFFERENCE
So far we have focused on the assumptions and conceptual threats to the validity of DID studies. However, a substantial literature makes it clear that statistical inference is also an important www.annualreviews.org • Designing Difference in Difference Studies 461 challenge in DID studies. The core message is that standard errors estimated under the assumption that errors are independent across observations are often biased downwards, which leads to overrejection of the null hypothesis. Moulton (79) considers statistical inference of regression coefficients on variables that do not vary within aggregate groups. His examples involve models that link micro data on labor market outcomes with aggregate geographical information. The problem is that these factors do not vary within groups (or are correlated within groups), and the groups may also have a shared error structure. Moulton uses a parametric random effects model to show that standard errors are biased downwards and that the magnitude of the bias depends positively on group size, intraclass correlations of the regression errors, and intraclass correlations of the regressors included in the model. Bertrand and colleagues (18) point out that many DID studies involve large group sizes and are apt to exhibit high levels of intraclass correlation of both errors and key independent variables. They use Monte Carlo simulations to assess the performance of several different methods of performing statistical inference in clustered data designed to mimic many DID studies. They find that many methods of inference fare poorly, especially when the number of clusters is relatively small. However, they also find that collapsing the data down to group-level cells, clustering robust standard errors, and using clustered bootstraps work relatively well.
Since the article by Bertrand and colleagues (18), there has been a small boom in research on alternative approaches to statistical inference in DID studies. Cameron & Miller (30) provide a helpful review of the literature. By our reading, the literature has not reached a consensus on the best way to perform inference in DID models. However, several themes have emerged. In most cases, it makes sense to aggregate the data so that outcomes are measured at the same level as the treatment variable [as is done by Bedard & Kuhn (14), who study healthy food nudging messages in a restaurant chain]. The standard cluster robust variance estimator (72) should perform well in studies based on a large number of clusters. For studies with smaller numbers of clusters [this applies to geographical variation in countries like Germany, which has 16 states (69), or Sweden, which has 4 (2)], three broad families of methods have emerged. One set of methods performs inference using cluster-level randomization distributions (38,90). Another pursues various forms of the cluster bootstrap (28). A third approach performs finite sample corrections based on biasreduced linearization (15,46,62,87). Cameron et al. (29) provide a method for adjusting for multiway clustering; Solon et al. (98) discuss the role of sampling weights. In addition, recent work by Abadie et al. (1) revisits the rationale for cluster standard error adjustments and emphasizes that the decision to adjust for clustering should flow from the treatment assignment rule embedded in the research design and the data collection method.

POLICY VARIATION AND HETEROGENEITY
Many US health policies in the last century have been decided at the state level, reflecting principles of federalism and efforts to find locally tailored solutions (60,82). State policy variation, however, often displays a high degree of standardization across states, making it possible to generalize from the experience of several states in one study. If each state were to adopt extremely unique legislative solutions to public health challenges, the result would be a series of one-state DID studies, which would make it difficult to develop consensus and to provide evidence to aid in future policy making. This is not to downplay the importance of single-geographic-unit studies when health policies such as indoor tobacco bans are introduced nationally, as they were in Ireland and China (51,106), or when one US state or locality enacts a policy that is unique in its time [e.g., Massachusetts health reform, in Kolstad (35)]. However, researchers are sometimes still able in such cases to compare policies across countries, as in the case of health care privatization in Latin America (27). Researchers are also able to use synthetic control methods to construct comparison groups using a weighted average of other countries, as done by Rieger and colleagues (89) in studying the effects of universal health insurance in Thailand.
One reason for this relative standardization across US state laws is the proliferation of model laws by policy organizations. For example, when states regulate access to controlled substances, they are able to consider sample legislation available through the National Alliance for Model State Drug Laws. Standardized versions of state laws for policies like the medicinal use of marijuana allow researchers to conduct studies using categorizations of states, exploiting variations in the year of adoption to implement a study with a DID design (22,81).
Despite the forces acting toward standardization in state laws, policies do tend to differ in important ways that reflect local political marketplaces (105). Researchers often separate state laws into a reasonably small number of meaningfully different categories, but it is important to understand the degree of detail that is sacrificed in this approach, for example because of alternative classifications or sensitivity analyses that remove states that are difficult to classify. Researchers often investigate the characteristics of the policies themselves or borrow classifications from other studies or policy organizations. In the area of state small-group insurance market reforms, for example, state laws may be characterized as strong or weak depending on whether regulations apply to all or some insurance policies (94). Considering alternative classifications of state policies and testing for sensitivity to the removal of certain states with particularly ambiguous policy status are both useful additions to analyses with policy heterogeneity. However, the availability of multiple analyses using the same classification systems facilitates comparisons across studies, and providing enough details for replication is good practice.
Another way in which policy heterogeneity commonly presents itself in public health settings is a tax rate, for example in the area of regulating health behaviors (e.g., cigarette or alcohol taxes). Because each state tends to set an individualized rate, there is heterogeneity in the policy; however, the policy is linear, and its intensity can be measured continuously. Carpenter & Cook (33) advanced the study of cigarette tax effects on youth by implementing a DID model, whereas prior literature had not included state fixed effects. Non-tax-rate examples of such linear policy measures include Medicaid physician fees or minimum wage laws, the public health impacts of which have been discussed in several recent articles (25,40,103). Linear measures of policy variation can be placed into the DID framework directly, but researchers may also explore nonlinearity in policy impacts using quadratic terms or by creating dummy variables for ranges of policy values (such as classifying tax rates as under or over certain values, or entering the values as a spline).
Even when using linear measures, researchers are faced with decisions as to whether the values should be entered in logs if the distribution of values is skewed across states, whether policy values should be measured in real or nominal terms, and whether the values should be normalized to the cost of some outside option (for example, studies of Medicaid fees often measure them relative to Medicare or private insurance fees, using a ratio as the key policy measure: e.g., 43,44). If there are nuances in these linear forms of laws, for example if health insurance regulations only apply to large firms, or alcohol taxes apply to beer but not wine, some may use the excluded group as a within-state control (e.g., 76) or may test for unintended spillover effects onto those groups; others may prefer to simply exclude those other groups.
The use of a within-state control group is especially helpful when diagnostic tests indicate that the DID is problematic due to a violation of the common trends assumption; if a credible withinstate control group can be found that trends similarly to the treatment group absent the policy, then researchers may be able to implement a DDD as well. An example that is often used to explain www.annualreviews.org • Designing Difference in Difference Studies 463 DDD is the case of maternity coverage mandates and wages; Gruber (52) show that because men should not be affected by the policy, they form a convincing within-state control group. Sometimes researchers report two separate DIDs rather than explicitly estimate a DDD [e.g., Simon & Kaestner (95) estimate the effects of minimum wages for low-educated and high-educated persons using the high-educated group as a close-to-placebo group]. This way of observing effects on different groups differs from the approach taken by studies that examine policy heterogeneity (for example, researchers wishing to examine differences in the effects of cigarette taxes on smoking rates among youth versus adults would run two DIDs and report them separately, rather than run a DDD). Similarly, several health insurance studies use baseline county characteristics to examine whether the intended effects are greater in counties that are likely to benefit more from the policy (20,48,78). Cook & Durrance (39) take advantage of state variation in the degree to which federal alcohol taxes should be binding to construct an identification strategy.
Multidimensional policy heterogeneity can also be transformed into a linear measure, a technique that has proven popular in cases in which a formula can be created to measure the strength of the overall policy based on the fraction of people affected. Measures of Medicaid eligibility expansions in the 1980s and 1990s (42,53) and the literature on the long-term impacts of these expansions (e.g., 37) represent a prominent example. Medicaid eligibility is determined by a formula that counts some but not other forms of income and deducts certain expenses, with different rules depending on the number and ages of children in the family. Rather than create separate variables to measure each aspect of the policy, which leads to a cumbersome interpretation of parameters, or to separate states into strong versus weak expansions, researchers collect the parameters that determine eligibility and boil down the variation into an index of stringency. Taking a nationally representative population, one could examine the percentage of the population that would be eligible by the rules in place in a certain state and year, leading to an index that increases with generosity.
Using this variable as the policy term leads to a DID format whereby researchers can interpret how the outcome changes as generosity is increased, so that, for example, 10% more of a representative population may become eligible for the policy. This linear policy measure can then be used as the sole policy measure, although one criticism of this approach is that policy makers may want to know the effect of each actual policy lever they control (55). No matter how the policy variable is created, it can be used as an instrument for eligibility (e.g., when asking whether being eligible due to policy variation causes a reduction in private coverage) or as a reduced form (e.g., when answering how policy generosity affects the outcome).

DISCUSSION
Quasi-experimental research designs can be an effective way to learn about causal relationships that are important for public health science and public health policy. Recent innovations allow researchers to approach the design of quasi-experimental studies in much the same way that they would approach the design of a fully randomized experimental study. Quasi-experiments are apt to work best when researchers actively decide which of the possible imperfect comparison groups is likely to best satisfy the assumptions of a particular technique. A study will be most convincing when researchers have thought carefully about the substantive meaning of key assumptions for their specific study. Given that the modern technical literature is large and complex, care is needed to identify and employ the tools and techniques that are most relevant to a given study.
This article examined DID designs in detail not because DID designs are the best approach to quasi-experimental research design, but because DID designs are often feasible in public health research in large federal or decentralized countries that collect data through a wide range of surveys 464 Wing · Simon · Bello-Gomez and administrative databases. However, there are several cases in which methods other than DID are best for evaluating state policy: When data prior to state policy variation were not available, researchers have used age-based regression discontinuities to understand the impact of Medicare (32) or alcohol policy (34). In the United States, for instance, a wide range of regulations and environmental conditions vary across geographic areas and over time, providing many opportunities to learn about causal effects. DID designs are also applied in nongeographic units, such as in studying Medicare Part D and the ACA's young adult mandate, where groups are compared over time by age; moreover, DID designs can also be applied with neither time nor geography (e.g., access to insurance and health are the two dimensions used in 73). Honing skills at designing and implementing high-quality DID studies that can make the best of the available data is a valuable part of the public health research toolkit.
Although it is beyond the scope of our review, we anticipate that future methodological advances will often involve hybrid research designs that exploit multiple quasi-experimental design elements. For example, Wing & Cook (104) use design elements from DID and matching studies to strengthen the external validity of the regression discontinuity design [ (7) and (17) also aim to expand external validity of regression discontinuity design], and Kreif et al. (68) compare the results of a synthetic control approach to those of a DID approach to evaluate the effects of hospital pay-for-performance programs. The advances in DID methods surveyed in this article, together with these future possibilities for further innovation, suggest that the DID framework will continue to be one of the workhorse models used in public health policy research.

DISCLOSURE STATEMENT
The authors are not aware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this review. The Annual Review of Criminology provides comprehensive reviews of significant developments in the multidisciplinary field of criminology, defined as the study of both the nature of criminal behavior and societal reactions to crime. International in scope, the journal examines variations in crime and punishment across time (e.g., why crime increases or decreases) and among individuals, communities, and societies (e.g., why certain individuals, groups, or nations are more likely than others to have high crime or victimization rates). The societal effects of crime and crime control, and why certain individuals or groups are more likely to be arrested, convicted, and sentenced to prison, will also be covered via topics relating to criminal justice agencies (e.g., police, courts, and corrections) and criminal law.