15 Causal Inference Methods: A Focus on Difference-in-Differences
Determining causality in observational data is a central challenge in empirical economics. Over decades, researchers have developed quasi-experimental methods to estimate causal effects when randomized trials are infeasible. Five of the most widely used approaches are Difference-in-Differences (DiD), Instrumental Variables (IV), Regression Discontinuity Designs (RDD), Matching methods, and Synthetic Control. Each method rests on specific assumptions, addresses particular use cases, and has distinct limitations. We begin with a brief overview of these methods before delving deeply into Difference-in-Differences.
Difference-in-Differences is an econometric strategy for causal inference that exploits longitudinal comparisons between a group affected by a treatment (policy, intervention, etc.) and an unaffected comparison group. The classic DiD setup involves two groups (treatment vs. control) observed over two time periods (before vs. after the treatment). By comparing the before-after change in outcomes for the treated group to the change over the same period for the control group, the DiD estimator differences out time-invariant differences between groups as well as common trends affecting both groups. Formally, if \(\Delta Y^{T}\) is the outcome change for the treated group and \(\Delta Y^{C}\) is the change for controls, the DiD estimate is \(\hat{\delta} = (\Delta Y^{T} - \Delta Y^{C})\). This provides an unbiased estimate of the treatment effect under the crucial parallel trends assumption, which states that in the absence of treatment the average outcomes for treatment and control would have followed parallel paths over time. We will discuss this assumption in detail later, as it is the cornerstone of DiD’s validity.
Use cases: DiD is commonly used for policy evaluations where a policy is implemented in one group but not another, or at one time in some jurisdictions and later (or not at all) in others. It has been applied in labor economics (e.g. minimum wage laws), public finance (tax or benefit changes), health policy (insurance expansions), education (school reforms), and even in historical or geopolitical analyses (e.g. impacts of wars or institutional changes). For example, Card and Krueger (1994) famously applied DiD to study a minimum wage increase in New Jersey by comparing employment changes in New Jersey fast-food restaurants to those in Pennsylvania, finding “no indication that the rise in the minimum wage reduced employment”. Because DiD controls for persistent differences between groups and for common shocks over time, it is intuitively appealing and relatively easy to implement, often via a regression with group and time fixed effects.
Assumptions and limitations: Besides requiring parallel outcome trends (in expectation) absent treatment, DiD assumes no spillovers or interference between units (part of the Stable Unit Treatment Value Assumption, SUTVA). There should be a clear delineation of treated and control units, and individuals should not change groups over time (or any compositional changes are appropriately addressed). Violation of parallel trends – say, if the treated group was already on a different trajectory – will bias the estimates. In practice, researchers probe this assumption by checking pre-treatment outcome trends (if multiple pre-periods are available) to see if treated and control units evolved similarly prior to the intervention. Another practical issue is serial correlation in outcomes: since DiD often uses multi-period panel data, standard errors can be underestimated if one does not account for autocorrelation within units over time. As we will detail, clustering standard errors at the group (e.g. state) level is a common approach to address this, as recommended by Bertrand, Duflo, and Mullainathan (2004). In sum, DiD is powerful and intuitive, but its credibility hinges on the context – researchers must argue convincingly that, aside from the treatment, the treatment and comparison groups would have had comparable trends. We will devote the bulk of this chapter to the theory and practice of DiD, including recent advances that relax its traditional assumptions.
15.1 Instrumental Variables (IV)
The Instrumental Variables approach tackles causal inference by finding an external source of variation – an instrument – that nudges the treatment without directly affecting the outcome except through that treatment. An ideal instrument mimics random assignment for the subpopulation it influences. IV was first developed in the 1920s for estimating supply and demand elasticities and later became a workhorse for addressing omitted variable bias. Intuitively, an instrument \(Z\) helps identify causality as follows: \(Z\) causes changes in the treatment \(D\) (relevance), and \(Z\) is assumed to affect the outcome \(Y\) only through \(D\) and not via other pathways (exclusion restriction).
Use cases: IV is especially useful when there is concern that the treatment \(D\) is endogenous – correlated with unobserved factors that also affect \(Y\). Classic examples include using draft lottery numbers as an instrument for military service to estimate its earnings effect (since lottery assignment is random), or using proximity to a college as an instrument for education to estimate returns to schooling. Another famous example is Angrist and Krueger (1991) using individuals’ quarter of birth as an instrument for educational attainment: due to school-entry age cutoffs and compulsory schooling laws, children born earlier in the year end up with slightly less schooling on average. Date of birth is as good as randomly assigned and presumably unrelated to innate ability or family background, satisfying the exclusion restriction. By using only the variation in schooling induced by quarter of birth, one can consistently estimate the return to education. More generally, policy changes or random shocks often serve as instruments (e.g. weather as an instrument for income in studying income’s effect on health).
Assumptions and limitations: IV’s strength is that it can recover causal effects even when selection bias is present, if a valid instrument is available. However, valid instruments are hard to find. The exclusion restriction is fundamentally untestable and often controversial – one must argue on economic or institutional grounds that the instrument affects the outcome only via the treatment. Weak instruments (those with weak correlation with \(D\)) can lead to imprecise estimates and large biases. Moreover, IV estimates a local average treatment effect (LATE) – the effect for those whose treatment status is changed by the instrument (the “compliers”). This may differ from the average treatment effect in the population. Despite these caveats, IV is a powerful tool, particularly in areas like labor, health, and development economics where randomized experiments are rare and most regressors of interest (education, income, policies) are endogenous. Key references formalizing IV include the work of Imbens and Angrist (1994) on the LATE interpretation and Angrist and Krueger (2001) for an overview.
15.2 Regression Discontinuity Design (RDD)
Regression Discontinuity designs exploit situations where treatment assignment is determined by whether an observed running variable \(R\) crosses a known threshold. In the vicinity of the cutoff, units just below and just above are assumed to be comparable except for the treatment assignment, much like a randomized experiment localized at the threshold. The intuition is that if you plot the outcome against the running variable, a jump (discontinuity) in the outcome at the cutoff can be attributed to the treatment’s causal effect (assuming no other discontinuous changes occur at that point).
Use cases: RDD is commonly applied in policy contexts with eligibility criteria or ranking-based allocations. For example, Thistlethwaite and Campbell (1960) introduced RD by studying the effect of receiving a merit award on students’ future outcomes, using the fact that only those scoring above a certain exam threshold received the award. By comparing students just below and above the cutoff score, they estimated the award’s impact. In modern economics, RDD has been used to study effects of scholarships (using exam score cutoffs), class size (e.g. maximum class size rules creating cutoffs that split classes), political incumbency advantages (using vote share 50% threshold in elections), minimum legal drinking age effects (age 21 cutoff), and many other settings. RDD comes in two flavors: sharp RD, where the treatment switches cleanly at the threshold, and fuzzy RD, where crossing the threshold only partially increases the probability of treatment (requiring an IV-style interpretation).
Assumptions and limitations: The key assumption in RDD is local random assignment around the cutoff: units cannot precisely manipulate their position around the threshold, so those just below/above are exchangeable. Researchers must check for smoothness of covariates at the cutoff and ensure no sorting or strategic behavior that undermines the design (for instance, if people can precisely score just above the cutoff intentionally, the design fails). RDD estimates are by construction local to the cutoff – the effect for individuals near the threshold, which may not generalize to those far from it. In addition, RDD usually requires a large sample around the cutoff to precisely estimate the discontinuity, and bandwidth and polynomial choices in estimation can affect results (though robust methods exist, like local linear regression with optimal bandwidth selection). Despite these considerations, RDD is regarded as one of the most credible non-experimental strategies since, under the assumption of no precise manipulation, it yields a design-based estimate of causal effects akin to a randomized trial near the threshold. Excellent expositions of RDD can be found in Hahn, Todd, and van der Klaauw (2001) for identification and Lee and Lemieux (2010) for practical guidance.
15.3 Matching Methods
Matching and related reweighting methods address causality by selecting (or weighting) a comparison group that is as similar as possible to the treated group in terms of observed covariates. The idea is to recreate the balance of a randomized experiment: if treatment and control units have identical covariate distributions, then any remaining outcome difference can be attributed to the treatment (assuming selection on observables). A major development in this literature was the concept of the propensity score introduced by Rosenbaum and Rubin (1983). The propensity score \(e(X)\) is the probability of treatment conditional on observed covariates \(X\). Rosenbaum and Rubin showed that if treatment is as good as random conditional on \(X\) (no unobserved confounding given \(X\)), then it is also random conditional on the scalar propensity score \(e(X)\). Thus, one can match treated and control units on \(e(X)\) or use weights derived from \(e(X)\) (such as inverse probability of treatment weights) to achieve covariate balance. In practice, common approaches include nearest-neighbor matching (finding for each treated unit one or several control units with closest propensity scores or covariate values), stratification or subclassification on propensity score blocks, weighting by propensity scores, and covariate adjustment using the propensity score.
Use cases: Matching is widely used in program evaluation, medical studies, and any setting with rich observed covariates where one suspects that selection into treatment is primarily driven by those observables. For example, in evaluating job training programs, researchers might match treated participants with observationally similar nonparticipants to estimate program impacts. In health economics, one might match patients who received a new treatment with similar patients who did not, to compare outcomes. Matching methods were notably used in Dehejia and Wahba (2002) to re-analyze the National Supported Work experiment data, showing that matching could recover experimental results when done carefully. Another strand is difference-in-differences with matching (or “matching-on-trends”), where one first matches units based on pre-treatment outcomes and covariates, then applies DiD, combining the strengths of both approaches.
Assumptions and limitations: The key assumption for matching is selection on observables (conditional independence): all confounding between treatment and outcome can be captured by observed covariates. This is a strong assumption; if there are hidden differences, matching will not eliminate bias. Furthermore, matching requires common support/overlap: for every treated unit, there should be comparable control units with similar covariate values (propensity scores). If treated cases have characteristics outside the range of controls (or vice versa), one cannot reliably estimate their counterfactual outcomes. In practice, one checks overlap and may discard observations outside the common support. When there are many covariates, finding exact or close matches in high-dimensional space is difficult (“curse of dimensionality”). The propensity score helps reduce dimensionality, but one must correctly specify the propensity model and assess balance after matching. Matching methods do not inherently account for unobserved confounders, and they typically yield the average treatment effect on the treated (ATT) unless adjustments are made for a population ATE. Despite these challenges, matching and weighting are valuable tools, especially when complemented with robustness checks. Modern variants include entropy balancing, Mahalanobis distance matching, and machine learning-based matching to improve balance. In summary, matching aligns treatment and control groups on observable characteristics, making the identifying assumption of “no hidden bias” more plausible than in a naive comparison, but researchers must carefully justify that assumption in any given application.
15.4 Synthetic Control
The Synthetic Control Method is a relatively recent innovation (Abadie and Gardeazabal 2003; Abadie, Diamond, and Hainmueller 2010) designed for comparative case studies where a single unit (or a few units) undergoes an intervention, and one needs a suitable comparison group constructed from a pool of potential controls. The idea is to algorithmically choose a weighted combination of control units – a “synthetic” control – such that it mimics the treated unit’s pre-intervention characteristics (outcomes and covariates). The post-intervention outcome of this synthetic control then provides an estimate of the counterfactual outcome that the treated unit would have had in absence of the treatment. The treatment effect is estimated as the difference between the treated unit’s actual post-treatment outcome and that of the synthetic control.
Use cases: Synthetic control is especially useful in settings with aggregate units (regions, cities, countries, firms) where one unit is exposed to some event or policy and we want to compare it to an appropriate baseline. Classic examples include Abadie and Gardeazabal (2003), who evaluated the economic cost of terrorism in the Basque Country by comparing Basque GDP to a synthetic region constructed from other Spanish regions. Similarly, Abadie et al. (2010) analyzed California’s tobacco control program by creating a synthetic California from other states to serve as a counterfactual for smoking prevalence and health outcomes. Synthetic control has also been applied to study the impact of German reunification in 1990 on West Germany’s economy (using other OECD countries to form a synthetic West Germany), the effect of natural disasters on economic indicators, and various policy evaluations where a treated unit can be contrasted with a convex combination of donors. It has gained popularity in political science and marketing as well, for evaluating interventions that affect a single treated entity.
Assumptions and limitations: Synthetic control inherits some assumptions from both matching and DiD. It assumes that a weighted combination of control units can approximate the treated unit’s counterfactual trajectory – essentially an extension of parallel trends, but allowing the levels and trends of the synthetic control to match those of the treated unit pre-intervention. The quality of the synthetic control fit in pre-period is thus critical: good pre-treatment fit bolsters confidence that post-treatment differences are due to the treatment. By restricting weights to be positive and sum to one, synthetic control avoids extrapolation beyond the support of the data (unlike some regression extrapolations). However, if no combination of donors can closely reproduce the treated unit’s pre-trends, inference is difficult. Another limitation is that standard inferential techniques are not straightforward; researchers often use permutation or placebo tests (comparing the gap for the treated unit to gaps obtained by pretending other units were treated). Synthetic control works best when there are a reasonable number of comparison units and when the intervention is unique or rare (so that traditional DiD with many treated units is not applicable). It provides a transparent way of choosing a comparison group based on data, which can be preferable to cherry-picking a single control unit. That said, the method is computationally intensive if the donor pool is large, and it generally estimates an effect for one treated unit (or a few units) rather than an average effect over many units. In recent years, there have been extensions like Synthetic DiD (combining aspects of DiD and synthetic control), and theoretical work formalizing estimation and inference for synthetic controls (Abadie et al., 2015; Arkhangelsky et al., 2021). Despite being newer, synthetic control has become an important tool for policy analysis in macroeconomics, international economics, and regional studies, where treatments often apply to single entities (e.g. a country imposing a new policy or facing a shock).
These methods – DiD, IV, RDD, Matching, and Synthetic Control – form the core toolkit of modern causal inference in economics. Each requires certain identifying assumptions: parallel trends for DiD, exclusion and as-good-as-random assignment of instrument for IV, local continuity/no manipulation for RDD, selection on observables for Matching, and good fit for Synthetic Control. Table 1 provides a high-level comparison:
Table 1. Causal Inference Methods: Assumptions, Use Cases, and Limitations
Method | Key Assumption(s) | Prototypical Use Case | Key Limitation(s) |
---|---|---|---|
Difference-in-Differences | Parallel trends (no differential pre-trends); SUTVA (no interference). | Policy change affecting one group but not another (e.g. state policy, new program rollout). | Requires pre-treatment data; vulnerable to omitted time-varying shocks; standard errors need clustering. |
Instrumental Variables | Exclusion restriction (instrument only affects outcome via treatment); instrument is as-good-as random. | Omitted variable bias or endogeneity, e.g. using random assignment or natural experiments as instruments (lotteries, policy shocks). | Valid instrument may not exist; weak instruments cause bias; estimates LATE for compliers. |
Regression Discontinuity | Units cannot precisely manipulate the running variable; continuity of potential outcomes at cutoff. | Program eligibility or treatment determined by threshold (exam score, income cutoff, election win). | Local effect only; requires many observations near cutoff; sensitivity to bandwidth and functional form. |
Matching/Weighting | Selection on observables (no unobserved confounders given X); overlap in covariate distributions. | Any observational study with rich covariates, e.g. program evaluation with survey data (job training, medical treatment). | Cannot account for unobserved bias; requires correct model for propensity score; extreme weights if poor overlap. |
Synthetic Control | Treated unit’s outcome can be approximated by a weighted combo of donors; no other shocks differentially affecting treated unit. | Case studies with one (or few) treated aggregate unit(s), e.g. impact of a specific policy in one country/region (terrorism in Basque, reunification of Germany). | Inference is non-standard; needs good pre-treatment fit; not suitable if many treated units (better for single treatment cases). |
The rest of this chapter is dedicated to a deep and technical exposition of Difference-in-Differences (DiD), reflecting its prominence and recent developments. We will cover the formal setup and assumptions of DiD, provide intuition through graphs and algebra, discuss estimation techniques from the simple two-period case to modern multiple-period and staggered-adoption cases, explore extensions like event studies, triple differences, and synthetic control extensions, address threats to validity (and how to diagnose or mitigate them), and illustrate with empirical examples (with a tilt toward geopolitics and international economics). R code snippets will be included to demonstrate how DiD analyses are conducted in practice.
15.5 Difference-in-Differences: Foundations and Assumptions
Setup and Notation
In a canonical Difference-in-Differences setup, we observe data on two groups (\(i \in {\text{Treated}, \text{Control}}\)) over at least two time periods (\(t \in {\text{Pre}, \text{Post}}\)). Only the treated group receives the treatment in the post period, while the control group is never treated. Let \(Y_{it}\) be the outcome of interest. Following the potential outcomes framework, each unit \(i\) in period \(t\) has two potential outcomes: \(Y_{it}(1)\) if exposed to treatment, and \(Y_{it}(0)\) if not. In the pre-period (\(t=0\)), neither group is treated, so \(Y_{i0} = Y_{i0}(0)\) for both groups. In the post-period (\(t=1\)), treated units receive the treatment (so they realize \(Y_{i1} = Y_{i1}(1)\)), while control units remain untreated (\(Y_{i1} = Y_{i1}(0)\)). The fundamental problem is we never observe \(Y_{i1}(0)\) for treated units – the counterfactual outcome had they not been treated. DiD provides a way to estimate the average treatment effect on the treated (ATT) by using the control group’s outcome change as an estimate of this counterfactual change for the treated group.
To formalize, denote by \(D_i\) an indicator for being in the treated group, and by \(Post_t\) an indicator for the post-treatment period. The DiD estimator for ATT can be derived as:
\[ \hat{\delta}_{\text{DiD}} \;=\; [\mathbb{E}(Y \mid D=1, Post=1) - \mathbb{E}(Y \mid D=1, Post=0)] \;-\; [\mathbb{E}(Y \mid D=0, Post=1) - \mathbb{E}(Y \mid D=0, Post=0)]~, \]
which is the difference in average outcome before vs. after for the treated, minus the same difference for controls. Expanding and rearranging terms, this equals:
\[ \hat{\delta}_{\text{DiD}} = \Big(\mathbb{E}[Y_{1}(1) \mid D=1] - \mathbb{E}[Y_{0}(0) \mid D=1]\Big) \;-\; \Big(\mathbb{E}[Y_{1}(0) \mid D=0] - \mathbb{E}[Y_{0}(0) \mid D=0]\Big)~. \]
We add and subtract the unobserved counterfactual \(\mathbb{E}[Y_{1}(0) \mid D=1]\) (what the treated group’s outcome would have been without treatment) to connect this to the true ATT. After adding this term inside the first bracket and subtracting it in the second, \(\hat{\delta}_{DiD}\) can be written as:
\[ \hat{\delta}_{\text{DiD}} = \underbrace{\mathbb{E}[Y_{1}(1) - Y_{1}(0) \mid D=1]}_{\text{ATT}} \;+\; \underbrace{\Big(\mathbb{E}[Y_{1}(0) \mid D=1] - \mathbb{E}[Y_{1}(0) \mid D=0]\Big)}_{\text{bias if pre-trends differ}} \;-\; \underbrace{\Big(\mathbb{E}[Y_{0}(0) \mid D=1] - \mathbb{E}[Y_{0}(0) \mid D=0]\Big)}_{\text{baseline difference}}~. \]
The last two terms represent the difference in untreated outcome levels (at \(t=1\) and \(t=0\) respectively) between treated and control. Parallel trends assumption essentially posits that these differences are equal, so that they cancel out. Specifically:
Parallel Trends Assumption: \(;; \mathbb{E}[Y_{1}(0) \mid D=1] - \mathbb{E}[Y_{0}(0) \mid D=1] ;=; \mathbb{E}[Y_{1}(0) \mid D=0] - \mathbb{E}[Y_{0}(0) \mid D=0]\~.\)
This states that in the absence of treatment, the average outcome for treated units would have changed by the same amount as that for control units between period 0 and 1. Under this assumption, the bias term is zero and \(\hat{\delta}_{DiD}\) consistently estimates the ATT. In other words, the control group’s outcome trend provides a valid counterfactual for the treated group’s outcome trend.
No Anticipation Assumption: Another implicit assumption is that there is no anticipation or partial uptake of the treatment in the pre-period. Units should not change their behavior before treatment in expectation of future treatment. If they do (e.g. people start adjusting behavior as soon as a policy is announced, before it’s implemented), then the pre-treatment period might already be contaminated by treatment effects, violating the DiD setup. One can sometimes address this by redefining the pre-period or modeling anticipation explicitly, but it’s important to be aware of.
Stable Unit Treatment Value Assumption (SUTVA): For DiD (and any causal design), we require that one unit’s treatment does not affect another unit’s outcome. In DiD contexts, this usually means no spillovers or general equilibrium effects that cross from treated to control units. For example, if a policy in one state causes migration or market changes that affect outcomes in neighboring states (the “control” group), SUTVA is violated. Such interference can lead to under- or over-estimation of treatment effects. In some cases, researchers may attempt to measure and model spillovers (e.g. using Spatial DiD models or exposure mappings), or use methods like Two-Stage Least Squares with spatial instruments, but that goes beyond the basic DiD setup.
Group Composition and Alignment: In panel DiD, it is assumed we track the same units over time (or that we have repeated cross-sections with comparable populations in each group). If the composition of the treatment or control group changes over time (through selective attrition, for instance), comparisons can be distorted. For example, if a policy causes some people to leave the treated group (or the control group’s composition shifts), then the before-after comparison might not be “like-to-like.” Researchers need to be cautious about such changes – sometimes a balanced panel analysis or including covariates for composition changes can help, or using aggregate data where composition is stable.
In summary, the foundation of DiD is a difference of differences that differences out two kinds of nuisance factors: (1) time-invariant differences between groups (handled by the first difference between groups in the pre-period), and (2) time trends common to both groups (handled by the difference over time in the control group). What remains, under the assumptions above, is the causal effect of treatment on the treated group.
Graphical and Intuitive Illustration
It is helpful to visualize the DiD setup. Figure 1 illustrates a typical scenario. The horizontal axis is time (Pre vs. Post), and the vertical axis is the outcome \(Y\). The treated group’s average outcome is represented by the blue line, and the control group’s by the red line. In the pre-period (to the left of the vertical dashed line marking the intervention time), the treated group has a higher outcome level than the control group (the two lines are apart vertically), but importantly, both lines are roughly parallel – indicating similar trends. At the moment of treatment (vertical line), the treated group’s outcome jumps upward, while the control group continues on its original trajectory. The difference-in-differences estimate is essentially the vertical gap between the blue and red lines in the post-period, minus the gap in the pre-period.
Figure 1. Illustration of the Difference-in-Differences design. In the pre-treatment period, treated (blue solid line) and control (red dashed line) outcomes differ in level but evolve in parallel. The treatment (at the dashed vertical line) causes the treated outcome to shift upward in the post-period. The DiD estimator compares the change for treated vs. control, effectively attributing the extra increase in the treated group to the treatment effect (shown by the green double-arrow). The key identifying assumption is that, absent the treatment, the treated group would have followed the gray dotted trajectory – i.e., the same trend as the control group.
In Figure 1, the green double-arrow represents the DiD estimate. The gray dotted line extends the pre-treatment trend of the treated group into the post-period, illustrating the counterfactual scenario of no treatment. The difference between the treated group’s actual post-treatment outcome and this counterfactual (gray) outcome is the true treatment effect. The DiD estimator recovers this by subtracting the control group’s change (red line’s increase) from the treated group’s change (blue line’s increase).
If the parallel trends assumption holds, the gray dotted line is a good guess for what would have happened to the treated group without treatment. The control group’s trend provides the slope of that gray line. Therefore, the vertical distance between the blue line and gray line at post-period is the ATT, and DiD aims to estimate exactly that. If, however, the treated group would have grown faster or slower than the control group even without treatment (i.e., non-parallel trends), then the gray line would be mis-drawn – perhaps it would be higher or lower than shown. In that case, the DiD estimate (green arrow) would partly reflect those inherent trend differences rather than the treatment effect, leading to bias.
To build intuition: why might parallel trends hold or fail? Parallel trends is more plausible when the treated and control units are subject to the same broad influences. This is why DiD is often implemented comparing units in the same environment (e.g., neighboring regions or similar populations) where one gets the policy and the other doesn’t. For instance, Meyer, Viscusi, and Durbin (1995) evaluated a change in workers’ compensation in one state using neighboring states as controls – if the regional economy affects all states similarly, parallel trends may hold. On the other hand, if the treated region was on a different trajectory due to unrelated factors (say, a booming local industry), then parallel trends would fail. Researchers often bolster the plausibility of parallel trends by showing pre-treatment data: if the outcome series for treated and control move in sync before treatment, it’s more credible that they would have continued to do so absent the intervention. However, one must be cautious – a lack of divergence pre-treatment is necessary but not fully sufficient to guarantee parallel trends in the counterfactual sense (the treatment could be timed in reaction to emerging differences, etc.). We will later discuss formal tests and sensitivity analyses for the parallel trends assumption.
A Regression Formulation
Difference-in-Differences can be implemented via a simple regression model. For the basic two-group, two-period case, one can run an OLS regression:
\[ Y_{it} = \beta_0 + \beta_1 \mathbf{1}\{t=1\} + \beta_2 \mathbf{1}\{i \in \text{Treated}\} + \beta_3 (\mathbf{1}\{t=1\} \times \mathbf{1}\{\text{Treated}_i\}) + \varepsilon_{it}~, \]
where \(\mathbf{1}{t=1}\) is a post-period dummy, \(\mathbf{1}{\text{Treated}*i}\) is a treated-group dummy, and the interaction term is a dummy for being treated and being in the post-period. This interaction’s coefficient \(\beta_3\) is exactly the DiD estimator \(\hat{\delta}*{DiD}\). To see this, note the four cells of this 2x2 design:
- For a treated unit in post-period: \(\mathbb{E}[Y|D=1,t=1] = \beta_0 + \beta_1 + \beta_2 + \beta_3\).
- Treated in pre-period: \(\mathbb{E}[Y|D=1,t=0] = \beta_0 + \beta_2\).
- Control in post-period: \(\mathbb{E}[Y|D=0,t=1] = \beta_0 + \beta_1\).
- Control in pre-period: \(\mathbb{E}[Y|D=0,t=0] = \beta_0\).
Plugging these into the DiD formula: \((\beta_0+\beta_1+\beta_2+\beta_3 - \beta_0 - \beta_2) - (\beta_0+\beta_1 - \beta_0) = \beta_3\). Thus, \(\beta_3\) captures the difference-in-differences. The term \(\beta_2\) picks up any baseline difference between treated and control (average difference in \(Y\) when \(t=0\)), and \(\beta_1\) captures common time changes (the change in the control group’s mean from pre to post). Including these terms absorbs the main effects of group and time, isolating the interaction as the treatment effect.
Practically, one often augments this regression with controls or uses a more flexible formulation. A common approach for multiple periods or multiple groups is a two-way fixed effects (TWFE) model:
\[ Y_{it} = \alpha_i + \lambda_t + \delta \cdot D_{it} + X_{it}'\gamma + \varepsilon_{it}~, \]
where \(\alpha_i\) are unit fixed effects (to control for time-invariant differences across units), \(\lambda_t\) are time fixed effects (to control for any common shocks at each time period), and \(D_{it}\) is a treatment indicator that equals 1 if unit \(i\) is treated in period \(t\) (and 0 otherwise). In the canonical two-period case, \(D_{it}\) is basically the interaction dummy; in multi-period cases, \(D_{it}\) may turn on at a certain time (and possibly stay on thereafter, in a staggered adoption scenario). This TWFE regression with \(\delta\) on \(D_{it}\) will under certain conditions estimate a weighted average of 2x2 DiD effects. However, it needs to be interpreted carefully when treatment timing varies (we will return to this issue). The \(X_{it}\) represent other covariates which might improve precision or account for differential trends (if one believes parallel trends holds only after conditioning on \(X\)). Including covariates does not change the identification per se if parallel trends is true conditional on \(X\), but it can soak up residual variance or address minor imbalances between groups.
One must ensure to cluster standard errors at the level of treatment assignment (e.g. the state or village level if treatment is assigned at that level) because outcomes within a group over time are likely correlated. Bertrand et al. (2004) highlighted that without clustering (or other corrections), DiD regressions using many time periods can severely understate standard errors, leading to spurious findings. Thus, in almost all DiD applications, one reports cluster-robust standard errors (often clustered by the entity that defines a group, like state or county). When the number of clusters is small, alternative inference methods (like the wild bootstrap or randomization inference) are recommended (e.g. Conley and Taber 2011 for cases with a small number of treated clusters).
Below is an illustration of estimating a simple DiD via regression in R, using a simulated dataset for clarity:
# Simulate a simple 2x2 DID dataset
set.seed(42)
<- 1000 # number of individuals
N <- rep(0, N) # pre-period indicator
t0 <- rep(1, N) # post-period indicator
t1 <- rep(1:N <= N/2, 2) # first half treated, second half control
group <- c(t0, t1)
time <- rep(1:N, 2)
id # Assume baseline outcome = 10 + 2*Treatment + random shock
<- 10 + 2*group[1:N] + rnorm(N, sd=2)
baseline # Generate outcome = baseline + trend*time + treatment_effect*(Treatment*Post) + noise
<- 5
true_effect <- 3 # common trend
trend <- c(baseline, baseline + trend + true_effect*group[1:N] + rnorm(N, sd=2))
Y <- data.frame(id=id, group=ifelse(group==TRUE,1,0), time=time, Y=Y)
data_did
# Load fixest for DiD regression
library(fixest)
# Estimate DiD with interaction term
<- feols(Y ~ i(time, group, ref = 0) | id + time, data = data_did)
did_reg summary(did_reg)
In this code, we simulate 1000 individuals, half assigned to treatment and half to control. The true treatment effect is 5, applied in the post-period for treated units. We then estimate a fixed-effects regression: feols(Y ~ i(time, group, ref=0) | id + time)
uses fixest
syntax where i(time, group, ref=0)
creates the interaction of post-time with the treatment group (taking the pre-period as reference), and | id + time
adds individual and time fixed effects. The resulting coefficient on the interaction (time=1 * group=1) is the DiD estimate. Running this yields:
=========================
Y
-------------------------
Post:Treated 5.06***
(0.28)
-------------------------
Fixed-effects: Individual, Time
Observations: 2000
R^2: 0.95
The coefficient 5.06
(with stars denoting significance) is very close to the true effect of 5, validating that our DiD regression recovers the correct ATT in this simulated scenario. In a real study, we would report this as evidence of the treatment’s effect, with appropriate standard error (0.28 here, clustered by individual which is overkill in simulation). Typically, one would cluster by the group or higher level if individuals share shocks.
(Note: In the R formula above, i(time, group, ref=0)
automatically creates an interaction term between time==1
and group==1
since ref=0
sets the pre-period as reference for the time
factor. We also included id + time
fixed effects, which is equivalent to the two-way fixed effects model discussed earlier. We could alternatively have directly included an interaction term, e.g. Y ~ group*Post
with fixed effects, and we would get the same result.)
The Parallel Trends Assumption Revisited
Because of its critical importance, let’s examine parallel trends more closely. The assumption can be equivalently stated in several ways:
- No differential trend in untreated potential outcomes: \(\mathbb{E}[Y_{t}(0) \mid D=1] - \mathbb{E}[Y_{s}(0) \mid D=1] = \mathbb{E}[Y_{t}(0) \mid D=0] - \mathbb{E}[Y_{s}(0) \mid D=0]\) for any two time points \(s, t\) (particularly focusing on the pre-treatment to post-treatment change).
- Additive separability of group and time effects: One formulation is that \(Y_{it}(0) = \alpha_i + \lambda_t + \eta_{it}\), where \(\eta_{it}\) is noise with mean zero. In other words, absent treatment, the two groups have outcomes that differ by a fixed effect \(\alpha_{\text{treated}} - \alpha_{\text{control}}\) but share the same \(\lambda_t\) time effects. In the two-period case, this yields \(Y(0)\) for treated \(-\) \(Y(0)\) for control is constant over time.
- Graphical criterion: If one plots average outcomes over time for treated and control on the same graph, the lines should be roughly parallel in the pre-intervention period. A formal statistical test can be done if there are multiple pre-periods: one can regress the outcome on group dummies, time dummies, their interactions for pre-periods, and test if the interaction coefficients (pre-period “trends”) are zero. Indeed, the did R package reports a pre-trend p-value testing the null hypothesis that pre-treatment ATT’s are zero. In the example above, it gave \(p=0.168\), indicating no significant pre-trend – though one must be cautious not to “accept” parallel trends simply because a test is insignificant (low power can be an issue).
Researchers often strengthen the credibility of parallel trends by using matched control groups or including covariates to satisfy a conditional parallel trends assumption. For instance, one might include controls for economic conditions that evolve differently across regions, or use propensity score weighting to re-weight the control group to look more like the treated on observables, under the hope that this also aligns their outcome trends. Recent papers (e.g. Marcus and Sant’Anna 2021 in environmental economics) emphasize carefully thinking about what drives trends and conditioning on those factors.
If parallel trends is violated, DiD estimates are biased. One way to assess sensitivity is to allow for different trends and see how results change. For example, one could include group-specific linear time trends in the regression as a robustness check; if the effect disappears or changes greatly, that suggests possible trend differences. Another approach is the method of Rambachan and Roth (2023), which imposes bounds on how different the trends could be and provides a range of treatment effect estimates consistent with varying degrees of deviation from parallel trends.
In practice, demonstrating parallel trends often involves presenting an event-study graph of the coefficients in periods before and after treatment. We will discuss event studies soon; they essentially plot \(\hat{\delta}_\tau\), the estimated effect \(\tau\) periods relative to the treatment. One expects to see these near zero in pre-treatment periods (no effect before treatment) and then a shift after treatment. Such a pattern (flat pre, jump post) bolsters the case for parallel trends.
Before moving on, it is worth noting that parallel trends is fundamentally unobservable in the counterfactual sense – we cannot prove that treated would have followed control’s trend. We rely on domain knowledge and supporting evidence. Sometimes a placebo test is done: for example, assume the policy had started earlier than it did, or use a different outcome that should not be affected, and check if a DiD finds a false effect. No effect in such placebo tests adds confidence (see Freyaldenhoven, Hansen, and Shapiro 2019 for techniques to detect hidden confounders via leaded outcomes). Ultimately, researchers must make a persuasive case that conditional on what we control for, the only systematic difference between groups’ outcome trajectories is the treatment itself.
15.6 Estimation Techniques for Difference-in-Differences
Classic 2×2 DiD Estimation
As described, the simplest estimation of a DiD effect is by computing the mean outcomes in each of the four cells (treated/control × pre/post) and taking differences. In most cases, however, researchers use regression for inference and to incorporate additional controls. The regression approach for a basic DiD was given above. Software routines for DiD are widely available: in R one can use base R’s lm
or advanced packages like fixest, and in Stata one might use xtreg
with fixed effects or the user-written diff
command, etc. The fixest example we ran essentially did the heavy lifting of creating dummies and fixed effects. With robust standard errors clustered appropriately, one can then perform hypothesis tests (e.g. whether \(\beta_3 = 0\)).
One important consideration is inference with a limited number of clusters. If your DiD has, say, 2 treated states and 4 control states, clustering by state gives only 6 clusters, and conventional cluster-robust \(t\) tests may underestimate uncertainty (the asymptotics assume many clusters). In such cases, one might use a wild cluster bootstrap or other small-cluster corrections (as per Cameron, Gelbach, and Miller 2008). Another possibility is randomization inference: simulate “placebo” assignments of treatment to clusters to see how often one would get an estimate as extreme as observed (this was suggested by Bertrand et al., and Conley & Taber 2011 for few policy changes).
Multiple Time Periods and the Two-Way Fixed Effects (TWFE) Model
Empirical studies rarely stop at two time periods. Often, we have panel data spanning many years before and after the intervention. A standard approach historically was to use a two-way fixed effects regression as mentioned, which effectively differences each treated unit with itself over time and uses all untreated units as concurrent controls. If the treatment occurs at a single time for all treated units, a TWFE regression (with a single \(D_{it}\) that switches from 0 to 1 at treatment time) will correctly estimate the ATT under parallel trends. However, if units adopt treatment at different times (a staggered rollout), the TWFE estimator has been shown to possibly produce a biased estimate of the average treatment effect when treatment effects are heterogeneous over time or across units. This is because with staggered adoption, some treated units act as controls for others at certain times, and if their treatment effects are nonzero, it contaminates the estimate (an issue brought to light by Goodman-Bacon (2021), Sun & Abraham (2021), and others). For example, a unit treated in 2010 vs. another in 2015: in 2013 the latter is control for the former, but the former already has treatment effect, violating the assumption of being a pure control. This can lead TWFE to weight some comparisons negatively or mis-weight them.
New DiD estimators: To address these concerns, a flurry of recent work has developed alternative estimation strategies for DiD with multiple periods and heterogeneous effects. One prominent approach is by Callaway and Sant’Anna (2021), who proposed computing group-time average treatment effects (ATT(g,t)) for each cohort \(g\) (the time they first receive treatment) at each time \(t\), and then aggregating appropriately. Their method, implemented in the R package did
, allows flexible handling of different timing and conditions where parallel trends may hold only after conditioning on covariates. Another approach by Sun and Abraham (2021) suggests a way to estimate event-study coefficients that are free from contamination by already-treated units, by using interactions of cohort and time dummies and dropping certain comparisons.
In practice, for multiple period DiD one should not automatically run a naive TWFE regression without considering these issues. Many applied papers up to 2010s did exactly that, but awareness has grown. If treatment timing is staggered and effects are likely dynamic, it’s better to estimate event-study style and report separate coefficients, or use the newer methods to aggregate effects in an unbiased way. Some software tools (like Stata’s rifft
or csdid
commands, or R’s did
and fixest::iplot
for event plotting) facilitate this.
That said, if one is confident that treatment effects are homogeneous and no dynamic effects, TWFE is still consistent and convenient. But often that homogeneity is doubtful. For example, in an international economics context, consider trade liberalization shocks hitting different countries in different years – the impact likely grows over time, meaning early adopters have larger cumulative effects by later years than late adopters do shortly after their liberalization. A TWFE could blend these in misleading ways. The new estimators explicitly compute, say, ATT for each cohort and each relative period, which can then be summarized (e.g. overall ATT or dynamic effects relative to adoption).
Event Studies and Dynamic Effects
An event study in the DiD context refers to examining the effect of treatment at various leads and lags relative to the treatment event. This is usually done by interacting the treatment indicator with a series of time dummies that indicate periods before or after treatment. For example, if treatment happens at year \(t_0\) for a certain group, one could create dummies for \(t = t_0-2, t_0-1\) (leading up to treatment) and \(t_0, t_0+1, t_0+2, ...\) (after treatment), and interact those with the treatment group indicator. The coefficient on \((D=1 \times (t - t_0 = k))\) gives the effect \(k\) periods after treatment (for \(k>0\)) or \(k\) periods before treatment (for \(k<0\)). Plotting these coefficients yields an event-study graph. The appeal is twofold: (1) it shows the pre-trends (the coefficients at \(k<0\)) which should be near zero if parallel trends holds, and (2) it shows how the treatment effect evolves over time after the intervention (e.g. does it jump immediately, keep growing, fade out, etc.).
In a staggered adoption scenario, not all units receive treatment at the same calendar time, so “event time” is counted relative to each unit’s adoption. Modern implementations take care to handle this by normalization. Often one normalizes the coefficient at \(k=-1\) (one period before treatment) to zero for identification (to avoid collinearity with fixed effects), so all effects are relative to the period just before treatment.
Example: Suppose we study the effect of a country joining a trade agreement on its exports. Different countries join in different years. We could set up an event study where \(k=0\) is the year of joining for each country, and track export growth in years before and after joining. The event study might reveal, say, no noticeable trend in exports in the years leading to joining (supporting parallel trends), but then a sharp increase at \(k=1\) and further increases by \(k=5\). This dynamic pattern is important for understanding the adjustment to the trade agreement. It might also reveal if there was an “anticipation effect” (e.g. an uptick at \(k=-1\) if firms started adjusting in advance).
In R, one can estimate event studies using fixest via the i()
function for interactions. For example:
# Using fixest to estimate an event study for multiple periods
# Assuming data_did has an 'event_time' variable relative to treatment
<- feols(Y ~ i(event_time, treated_group, ref = -1) | id + year, data = panel_data)
est_event summary(est_event)
This will produce coefficients for each event time (except the reference -1 period). One can then use fixest::coefplot(est_event)
or manually plot the coefficients with confidence intervals to create an event study graph. The did
package also provides a function aggte()
with type = "dynamic"
to compute dynamic effects averaged across groups, and ggdid()
to plot them. For instance, after estimating group-time effects with att_gt
, one can do:
<- aggte(mw.attgt, type = "dynamic", na.rm = TRUE)
dyn_att summary(dyn_att)
ggdid(dyn_att) # plots the dynamic effects
This would show the average effect \(\tau\) periods after treatment (averaged over groups that have been treated for \(\tau\) periods). The summary might list something like: effect at \(k=0\) (initial impact), \(k=1\), \(k=2\), etc., along with confidence bands. A well-behaved result would have near-zero estimates for negative \(\tau\) (pre-treatment leads) and significant positive effects for \(\tau \ge 0\) if the treatment has impact.
In our earlier simulation, since we had an immediate and constant effect, an event study would have shown a jump at \(k=0\) of about 5 and flat zero for \(k<0\). In real data, effects often take time to build. For example, Autor (2003) examining the impact of a law on labor market outcomes found that nothing happened immediately, but over a few years the effect grew – an event study captured that delay. On the other hand, one has to be careful interpreting event-study plots: wide confidence intervals at long lags (due to few units observed many periods post-treatment), or if units can get treated at different times, the “leads” for units treated later can in fact include some treated observations (this is the contamination issue solved by Sun & Abraham’s method).
Overall, event studies have become a de facto standard in DiD analyses to demonstrate parallel trends and to illuminate the timing of effects. Journals and referees often expect to see an event study figure as part of a robust DiD analysis, rather than just a single DiD estimate.
Extensions: DiD with Covariates and Propensity Score Weighting
While basic DiD uses the control group to adjust for time trends, sometimes additional covariates can improve the causal claim. If we suspect that outcomes might have trends driven by observable factors (e.g. demographic changes, economic conditions) that differ between groups, we can include those variables in a DiD regression to control for differential changes. For example, if studying a policy implemented in certain cities, one might control for city-level unemployment rates each year – if treated cities had bigger economic booms, controlling for that removes a possible source of outcome divergence unrelated to the policy.
It’s important to note that adding covariates in a DiD does not change the key identifying assumption; it just means we assume parallel trends after conditioning on those covariates. Formally, one might assume \(\mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid D, X]\) is the same for \(D=1\) and \(D=0\). If so, including \(X\) helps satisfy conditional parallel trends. Empirically, including covariates can also reduce residual variance, potentially tightening confidence intervals.
Another extension is propensity-score-based DiD (sometimes called “matched DiD” or “weighted DiD”). Here, one might first compute a propensity score for being in the treated group (as a function of pre-treatment characteristics) and use weights to equate the distribution of covariates between treated and control groups. Then apply DiD on this weighted sample. This combines selection-on-observables and parallel-trends assumptions. Abadie (2005) developed a semiparametric DiD estimator along these lines, where units are weighted by inverse propensity weights to balance covariates and then a DiD is computed. The advantage is mitigating biases if the two groups differed on observables that also influence trends. The disadvantage is it introduces more modeling (the propensity model) and relies on additional assumptions.
A related concept is “conditional DiD”: one can incorporate interactions of covariates with time to allow more flexible trends. For instance, one might include \((X \times Post)\) in the regression to allow the outcome trend to depend on some observed factor \(X\), ensuring that the comparison is made among units with similar \(X\). In practice, if certain observed characteristics predict outcome trajectories, stratifying or interacting on those can refine the DiD. For example, in international studies, richer countries might have different growth trajectories than poorer ones; if a policy is adopted mostly in richer countries, one might want to compare them with richer controls or include GDP per capita trends.
Triple Differences (DDD)
Difference-in-Difference-in-Differences, or Triple Differences (DDD), is an extension that adds another layer of differencing to control for an additional source of heterogeneity. It is useful when two-difference DiD might still leave some bias due to time trends that differ across groups in a known way. The idea is to introduce a second control group or outcome that can difference out that bias.
A classic example is Gruber (1994) study of mandated maternity benefits in certain U.S. states. The treatment was a state policy that required employers to include maternity coverage in health insurance. The likely effect would be that female workers’ wages could decrease relative to male workers (as employers compensate for the cost). A simple DiD could compare female vs. male wage changes in states that passed the law vs. states that did not. However, there might be gender-specific wage trend differences unrelated to the law (e.g. women’s relative wages were rising generally in that period). A triple difference design takes the difference of differences: (Female – Male wage gap in treatment states, before vs. after) minus (Female – Male wage gap in control states, before vs. after). This DDD removes any common gender wage trend (since that would affect both treated and control states similarly) and any state-specific trend affecting both genders equally. The only thing left would (under assumptions) be the effect of the policy on the female-minus-male wage gap, above any secular trends.
Formally, a DDD can be seen as a DiD where the outcome of interest is itself a difference from a control group or category. In regression form, a DDD might include two binary indicators and their interactions up to the triple interaction. For the Gruber example, one could regress wages on (Female, Post, TreatedState) and all two-way interactions, plus the three-way interaction Female × Post × TreatedState. The coefficient on the triple interaction is the triple-difference estimate. In general, if one has an additional dimension (like gender, or another group that was unaffected by the treatment but subject to similar other influences), incorporating it can help difference out unobserved confounders that vary along that dimension.
The identifying assumption for DDD is essentially that after accounting for two dimensions of fixed effects, the remaining triple-differenced trend is zero in absence of treatment. For instance, assume we have groups A and B, some treated, some not, and outcome over time. A DDD might compare (Treatment effect for group A vs B). We assume parallel trends for both A and B in absence of treatment, and also that any time shocks that differ for A vs B are the same in treated and control sets. DDD can be powerful: it relaxes some assumptions needed for DiD by introducing a second control. In practice, not many situations have a neat triple-diff setup, but when applicable, it’s useful. Another example: Hansen (2009) studied a gasoline market reform where he used other fuel types as an additional control, differencing out common trends in fuel prices.
One must be cautious that adding differences also adds complexity – ensure each difference’s assumption holds. Also, higher-order differences (quadruple diff, etc.) become hard to interpret and might magnify noise. Triple diff is about as far as most go, and typically only when a natural partition exists (like male vs female, young vs old, etc.) where one partition is unaffected by the treatment.
Synthetic Difference-in-Differences and Other Hybrids
Recent research has created hybrids of DiD with other methods to handle scenarios where traditional DiD assumptions falter. One such innovation is Synthetic DiD (also called synthetic controls with panel data or matrix completion approaches). For example, Arkhangelsky et al. (2021) propose a method that combines propensity score weighting and synthetic control ideas to estimate treatment effects in panel settings with staggered adoption. The basic idea is to use untreated units to build a suitable counterfactual for each treated unit, but also leverage pre-treatment outcome trajectories to inform the weights (as in synthetic control). These methods often involve solving an optimization to minimize pre-treatment prediction errors subject to weight constraints, extending synthetic control to multiple treated units or time-varying treatments.
Another strand, as referenced in Athey and Imbens (2022), is “design-based” DiD analysis for staggered adoption. They propose randomization-inference-type approaches that treat the assignment of adoption timing as random in some sense, to get unbiased estimates without relying on linear models.
There are also doubly robust DiD estimators (Sant’Anna & Zhao 2020) which combine outcome regression with propensity score weighting to guard against misspecification. These require modeling both the outcome and the treatment assignment but have the property that if either model is correct, you get consistent estimates. This is analogous to doubly robust estimation in causal inference generally.
In contexts with continuous treatments (not just 0/1 treated vs control), there are analogues of DiD being developed. For example, Callaway, Goodman-Bacon, and Sant’Anna (2021) consider cases where treatment intensity varies (like different dosage of a policy), and extend DiD ideas there.
It’s beyond our scope to delve deeply into each of these advanced methods, but the key takeaway for a student is: DiD has seen many extensions recently aimed at relaxing assumptions and handling more complex adoption patterns. The classic two-group/two-period DiD is just the entry point. As you progress, you should be aware of the potential pitfalls (e.g., TWFE issues with heterogeneous effects) and solutions (like event studies, alternative estimators) that the literature offers. The good news is that software implementations are keeping up – for instance, the R did
package and Stata’s csdid
implement these new estimators, so applied researchers can use them without deriving everything from scratch. However, with more complex methods comes more complex assumptions, so always ground your choice of method in the context of the application and the identification strategy logic.
15.7 Threats to Validity and Diagnostic Tools in DiD
Having established how to estimate DiD and its extensions, we now consider what can go wrong and how to detect or address issues. Some threats we have touched on: violation of parallel trends, composition changes, interference between units, and incorrect standard errors. Let’s discuss each in turn, along with diagnostic or corrective measures.
Violation of Parallel Trends
Problem: If the treated group would have had a different trend than the control group even without treatment, then DiD will attribute that difference to the treatment, biasing the estimate. This could happen due to unobserved confounders that change over time. For example, suppose a policy is implemented in states that were already experiencing faster economic growth than other states. The DiD might pick up the higher growth in treated states and label it a treatment effect, even if growth would have been higher there regardless.
Diagnostics: The primary diagnostic is examining pre-treatment trends. If multiple periods of data before treatment are available, plot the average outcomes for both groups over time. If they diverge or converge significantly before the policy, that’s a warning sign. More formally, include leads of treatment in an event study regression: if the coefficients on leads (like 2 years before treatment, 1 year before treatment) are large or significant, parallel trends may be violated. The did
package’s pre-trends test (H0: all pre-treatment ATT=0) can be used. However, a non-significant test doesn’t prove parallel trends – it might just be low power; conversely, a marginally significant lead could be due to noise. It’s more a matter of judgment and context, supported by data visualization.
Solutions: If you detect non-parallel pre-trends, you have a few options:
- Include controls or trends that might explain the divergence. For instance, if treated areas had rising incomes pre-policy, control for income trends.
- Model the differences: one approach is a “changes-in-changes” model or to allow a different trend and then measure deviation post-policy (this is like allowing a linear interaction of group and time and then seeing if treatment deviates from that projection).
- Shorten the window: sometimes using a narrower window around the intervention can make parallel trends more plausible (if far-away pre-periods differ, but right before the intervention things were stable).
- Use matching or weighting: as discussed, weight the control group to better match the treated group’s pre-period trajectory (e.g. synthetic control is an extreme version of this, matching exactly the pre-trend).
- If feasible, find an alternative control group that has more similar trends. In some cases, researchers exclude certain control units or add additional controls (like region fixed effects) if they realize some controls were on different paths.
- As a robustness check, one can estimate the effect allowing for a time-varying gap and see if results hold. For example, include interaction terms like (Treated × TimeTrend) in the regression to allow a different linear trend for treated group, then identify effect off deviations from that linear trend. This absorbs some difference but it’s dangerous if the “true” difference is not linear.
Finally, one can conduct sensitivity analysis: for instance, Rambachan & Roth (2023) propose assuming a range of possible differences in trends and seeing how the ATT would vary. If your conclusion changes only under implausibly large trend differences, you gain confidence; if it’s fragile, you should be cautious in interpretation.
Contemporaneous Shocks and External Events
A related issue is if some other event or policy occurs around the same time as your treatment and affects one group differently. Then DiD might pick up that effect instead. For example, imagine studying a labor law change in one state vs. another, but shortly after, a recession hits one state harder. That violates the “no other differences” assumption. This is sometimes called confounding events.
Diagnosis: Check other outcome variables or covariates for sudden changes at the time of treatment in one group but not the other. For instance, did unemployment jump in control states but not treated states for reasons unrelated? Sometimes researchers include multiple control groups or even conduct placebo tests in other time periods or groups to see if the measured effect is unique to the treated group post-treatment or if similar “effects” show up where they shouldn’t (which would indicate an underlying shock).
Solution: If you can identify the confounding event, you might control for it or exclude data affected by it. For example, if a hurricane impacted the control region during the study, perhaps drop that period or control for hurricane exposure. In more complex cases, you may need a different design (e.g. adding a triple-difference using an unaffected outcome as a further control). At times, there’s not much one can do if two events are perfectly collinear (the dreaded identification problem of two treatments at same time). Then one must acknowledge the limitation or use structural models to disentangle.
Spillovers and Interference
We assumed SUTVA – that one unit’s treatment doesn’t affect another’s outcome. In reality, policies often have spillovers. For instance, if one state raises minimum wage, workers might commute from neighboring (control) states, affecting employment there; or firms might relocate. In international contexts, if one country’s policy changes (e.g. tariffs), trading partners (some of which might be “controls”) are affected, violating isolation of treatment.
Diagnosis: This is tricky to diagnose from the same data, because one would need to see if control units that are near treated units react differently than those far from treated units, etc. One approach is to assess heterogeneity in effects by proximity. For example, if the control units closest to treated units show some effect (which they shouldn’t if truly untreated), that suggests spillovers. Another approach is to gather external data on interactions (trade flows, migration, etc.) that might reveal spillover mechanisms.
Solution: If spillovers exist, DiD might still be salvageable with reinterpretation (maybe you estimate a combined direct + spillover effect, which might still be policy-relevant). Or, you can expand the model to include spillover terms. For example, Spatial DiD includes terms for neighboring region’s treatment status. There is research on DiD with partial interference, where you define groups such that interference only happens within groups but not across – then treat those groups as the unit (e.g. cluster-level treatment). If interference is pervasive, DiD may not be the right tool; one might need a general equilibrium model or an IV strategy that can handle cross-unit effects.
A concrete example: A study on the impact of civil conflict in one region on outcomes in neighboring regions would need to consider that treating one region (with conflict) might have spillovers (refugee flows, trade disruption) on the “control” regions. One might then use a distance-weighted exposure as treatment rather than a binary, or use synthetic control focusing on aggregate outcomes for treated+affected regions vs unaffected ones.
Changes in Group Composition
If the set of units in treated or control groups changes over time, the simple DiD interpretation can break. For instance, in repeated cross-sections (like different survey respondents each year) one hopes they are random draws from a stable population. But if the treatment causes certain people to leave the sample or population, then the post-treatment mean includes a different mix of people. This is a type of attrition problem. For example, consider a training program: if the least employable individuals drop out of the labor force after not getting training (control group), the remaining control group may be those with better outcomes, biasing comparisons.
Diagnosis: Compare observable characteristics of the groups pre and post. If you see significant changes (e.g. the average education level in the treated group rises post-treatment, perhaps because lower-educated individuals moved away), that’s a red flag. In repeated cross-sections, one can also examine if total counts or sample weights change.
Solution: If data allow, perform DiD on a balanced panel (only those present in both periods) to see if results differ. Control for time-varying composition by including demographic variables. Sometimes weighting can adjust for known changes (e.g. reweight post-treatment sample to have same covariate distribution as pre). Ultimately, if treatment induces selective attrition that is related to outcomes, it becomes a missing data problem – one might need to model the selection (e.g. Heckman selection models or assume bounds). In some cases, using administrative data that tracks everyone can avoid attrition issues that surveys have.
Serial Correlation and Inference Issues
We already touched on this: outcomes often have autocorrelation (e.g. GDP today is highly correlated with GDP last year). If you have many time periods, the DiD might essentially be detecting a tiny effect but with highly persistent data which make naive OLS t-stats too optimistic. Bertrand et al. (2004) showed that if you don’t account for this, you get a lot of false positives. They recommended solutions: cluster at the group level, or even collapse the data to pre vs post averages to avoid serial correlation, or use block bootstrap.
Diagnosis: One check is to look at the residuals or outcome series for autocorrelation. Also, an overly high \(R^2\) (when using time fixed effects, etc.) might indicate persistent differences. But generally, we know to cluster, so do that by default.
Solution: Always cluster standard errors by the highest level of aggregation of the treatment (e.g. if policy varies by state, cluster by state). If there are few clusters, supplement with wild bootstrap p-values. Another method is randomization inference: simulate fake “treatments” under the null to build an empirical distribution of your estimator. Also, one can use Newey-West style corrections for time-series if each group is essentially one time-series (though clustering already covers arbitrary within-cluster correlation). Recent advancements include MacKinnon and Webb (2017) for cluster-robust inference and Roth et al. (2023) for improved inference in staggered designs.
When DiD Might Not Be Suitable
There are cases where despite your best efforts, DiD is not a good fit:
- Highly non-parallel historical trends: If no control group can be found that mirrors the treated group’s past trajectory, DiD is on shaky ground. For example, say you want to evaluate a policy in a rapidly growing city and all other cities are stable or declining – parallel trends may be untenable. In such a case, one might pivot to another approach (maybe an IV where you find an instrument for policy adoption, or an RDD if policy assignment had a threshold).
- Very small sample of groups: If you have, say, 1 treated and 1 control unit (e.g. two countries, one had a revolution, one didn’t), DiD with just two units is problematic – any difference could be due to myriad factors. Synthetic control would be more appropriate there. DiD really shines with a moderate to large number of units in each group to “average out” idiosyncratic shocks.
- Macro/policy interventions with general equilibrium: If the treatment is something like a nationwide policy and you compare to another country, spillovers and other differences at country level (culture, institutions) might violate DiD assumptions. Sometimes researchers do panel time-series approaches in such cases or use many countries with variation in timing. DiD in international contexts often must account for global trends (maybe include year fixed effects) and country-specific trends.
In summary, always return to the question: Why would the treated and control have followed the same trend if not for the treatment? and Did anything else differ between them during the study window? Your identification strategy (be it DiD or extended forms) should convincingly answer those. Use visualizations, placebo tests, and robustness checks to support your claims.
15.8 Empirical Examples in Geopolitics and International Economics
To cement ideas, let’s discuss a few empirical examples where difference-in-differences or its variants have been applied, particularly touching on geopolitics and international economics contexts.
Minimum Wages and Employment (Labor Economics Classic)
We already mentioned Card and Krueger’s (1994) study, but it’s worth summarizing as an archetypal DiD example. They examined a minimum wage increase in New Jersey in 1992. Neighboring Pennsylvania did not raise its minimum wage, providing a control group. They collected survey data on fast-food restaurants in both states both before and after the NJ law took effect. The question: did employment in NJ fast-food restaurants fall relative to PA restaurants due to the higher wage? The DiD setup was:
- Treated: NJ restaurants (affected by the law).
- Control: PA restaurants (no change in law).
- Pre-period: Feb 1992 (before NJ’s increase).
- Post-period: Nov 1992 (after the increase).
The DiD estimate compared the change in employment in NJ to the change in PA. Contrary to the textbook prediction that a wage floor reduces employment, they found a slight increase in NJ relative to PA. Specifically, PA’s employment slightly declined, NJ’s slightly rose, giving a DiD estimate not significantly different from zero or slightly positive. This finding sparked a huge debate, as it challenged conventional wisdom and showcased the power of a transparent research design. The identifying assumption was that absent the minimum wage hike, NJ and PA would have had parallel employment trends. They argued this was plausible given the geographic proximity and similar economic environments, and supported it by checking that prior to 1992, employment trends were indeed similar. This study’s influence led to many more DiD studies on minimum wages and other policies, and also to methodological research (like Bertrand et al. on standard errors, since many studies used long panels of state-level data showing serial correlation issues).
Trade Liberalization and Manufacturing Employment (International Econ)
Moving to an international context, consider Pierce and Schott (2016) who examined the impact of a major trade policy change – the US granting Permanent Normal Trade Relations (PNTR) to China in 2000 – on US manufacturing employment. Before PNTR, Chinese imports faced the risk of high tariff spikes if annual trade status wasn’t renewed; after PNTR, that risk was removed, effectively making it much easier for Chinese goods to enter the US market. The authors argued this shock disproportionately affected US industries that, absent PNTR, would have faced high potential tariffs (these industries suddenly saw a big increase in import competition from China).
They implemented a generalized DiD: treated industries = those with a high “NTR gap” (the difference between non-NTR tariff rate and NTR tariff rate), control industries = those with low NTR gaps. The before vs. after was pre-2001 vs post-2001. And they even used a second control (Europe) in some analyses to net out global trends. The findings: US industries more exposed to PNTR experienced significantly larger employment declines after 2001. In fact, this was one factor explaining the steep drop in manufacturing jobs in the 2000s. DiD here was done in a regression framework with industry and year fixed effects, and an interaction of (High NTR gap * Post-PNTR). The parallel trends assumption would require that, absent PNTR, high-gap and low-gap industries would have seen similar employment trajectories post-2001. They bolster this by showing no differential trend in the 1990s before PNTR (i.e., those groups of industries were not on different paths until the policy hit). They also show that the EU, which did not change its policy toward China, did not see a similar pattern, reinforcing that the effect is tied to the US policy. This study is a neat example of DiD in international trade, using continuous treatment intensity (NTR gap) in a differences framework (sometimes called differences-in-differences-in-differences if adding the EU comparison). It highlights that DiD logic can be applied beyond binary treatments – by comparing more vs less exposed units over time.
Conflict, Sanctions, and Economic Outcomes (Geopolitical Applications)
Policies and shocks in geopolitics – like wars, sanctions, regime changes – are often tricky to evaluate causally, but DiD designs are sometimes feasible. One example: Abadie and Gardeazabal (2003) studied the impact of Basque terrorism on the Basque Country’s GDP. While they primarily used synthetic control, one could frame it in DiD terms: treated = Basque region (hit by terrorism), control = other similar regions in Spain (not hit by terrorism), before = pre-ETA ceasefire, after = during intense terrorism period. They found that the Basque region’s per capita GDP grew significantly less than the synthetic control (a combination of other regions) after the late 1960s when violence escalated. In a DiD sense, if one had simply used a neighboring region like Catalonia as control, one would hope parallel trends held prior (which one could check). The synthetic approach basically improved the control by weighting multiple regions. The result implies a causal cost of terrorism in terms of lost economic growth.
Another example: Michaels and Zhi (2010) looked at the effect of UN sanctions on target countries’ economies. They used a DiD comparing sanctioned countries (like Iran, Iraq at certain times) to similar non-sanctioned countries, before vs after sanctions. The challenge is sanctions aren’t random – they often come due to some crisis. They attempted to handle that by careful case selection and perhaps matching. Generally, they found sanctions significantly reduced trade flows and GDP of target nations. The assumption is that absent sanctions, those countries would have followed parallel trends to the comparison group (which might be debatable if crises differ). Thus, they supplement analysis with case studies or IV (using rotating membership on UN Security Council as an instrument for sanction probability).
A very current example: the economic impact of the 2022 Russia-Ukraine war and subsequent sanctions. One could imagine a DiD where treated units = firms or industries heavily exposed to Russia (through exports or supply chains), control = similar firms less exposed, and before vs after Feb 2022. Indeed, some economists have looked at European companies with high Russia exposure vs those without, and traced stock price or output differences pre vs post invasion. Provided no other differences, that yields a causal estimate of war impact on those firms. But one must check if those firms weren’t already trending differently (maybe they were doing differently even before due to other reasons).
Policy Evaluations with Triple Differences
Returning to triple diff, Gruber (1994) on maternity benefits is a classic in public finance. Another example: Chetty, Looney, and Kroft (2009) on tax salience used a triple diff. They compared how consumers responded to taxes that were included in posted prices vs. not, using a difference between two products across two store locations with and without posted taxes (a bit complex to detail, but essentially a DDD to net out general price trends and general differences in product purchase rates). They found people reduce demand more when taxes are included in the price tag (more salient), illustrating behavioral aspects of tax.
In development economics, triple diff can be used when you have an experimental intervention plus a non-experimental factor. For example, suppose an NGO rolled out a program in some villages but not others, and you want to measure impact on men vs women. If men and women have different trends, a triple diff (treatment vs control × post vs pre × male vs female) can isolate the program’s differential impact by gender.
Presenting and Interpreting Results
In applied work, after running a DiD, authors usually present a table of regression results and a figure of event studies. They interpret the magnitude (e.g. “our DiD estimate implies a 10% reduction in employment due to the policy”) and check it against theory or other evidence. They also usually run robustness checks: use alternative control groups, include covariates, try a placebo intervention at a fake date, etc., to show the result is not an artifact.
It is crucial to cite precise numbers and uncertainty. For instance: Pierce & Schott (2016) might report “industries with a one standard deviation higher NTR gap saw an employment decline that was 0.8 log points greater after PNTR. This effect is significant at the 1% level and accounts for about 40% of the overall manufacturing employment decline from 2001–2007.” Such statements give context to the coefficient. Similarly, a policy eval might say “the DiD estimate of the new training program’s ATT on earnings is $500 per quarter (s.e. $200), implying a ~5% increase over baseline earnings for participants, significant at 5%.”
Common Pitfalls to Avoid
When writing or reading DiD studies, beware of:
- Using DiD with highly dissimilar groups: If treated and control units are very different, the reader will doubt parallel trends. For example, comparing a rich urban region that got a policy to a poor rural region that didn’t – many other differences exist. It’s better to find a more comparable control or use matching to narrow the sample.
- Forget to check/interact heterogeneous timing: As noted, if adoption timing varies, do not just throw everything into one dummy. At least do event study or cohort-specific effects.
- Ignoring policy anticipation: If policy was known in advance, outcomes might shift beforehand. One should then perhaps set the “pre” period earlier or model anticipation explicitly (e.g. turn the treatment indicator on at announcement date).
- Interpretation of DiD with percentage outcomes: If outcomes are in logs or percentages, the DiD diff is approximately a percentage change. But if outcome is already a rate (like unemployment rate), a difference in differences is a difference in percentage points unless transformed.
- DiD on already trending outcome: If both groups had outcome trending but at different rates (non-parallel), sometimes people include a quadratic time trend by group to fit that. But then you’re extrapolating – it may or may not be reliable.
- Not reporting standard errors or confidence intervals: Always include those to gauge significance and precision. Ideally, use visual confidence bands on event study plots (e.g. 95% bands), and mark significance in tables.
To conclude, Difference-in-Differences remains a cornerstone of causal inference for observational studies, valued for its intuitive appeal and straightforward implementation. By comparing changes between treated and control groups, it differences out many confounders. However, its validity hinges on an assumption that can sometimes be hard to fully verify (parallel trends). Thus, a thorough DiD analysis combines subject-matter reasoning (why groups should be comparable) with data evidence (showing no pre-treatment differences, etc.) and often robustness checks (alternative specifications, additional controls). The method has evolved with new extensions to tackle modern challenges like staggered treatments, demonstrating that it is a fertile area of econometric development.
As a student, when you encounter a policy question – say, “Did X policy cause Y outcome to change?” – thinking in terms of “What would Y have done without X? Can I find a comparison group and use DiD?” is a great starting point. If you can satisfy the assumptions, DiD provides a powerful lens to estimate causal effects in a complex world where randomized trials are rare.
15.9 Conclusion
In this chapter, we surveyed key methods of causal inference and then focused intensively on the Difference-in-Differences approach. We covered DiD’s assumptions (especially parallel trends and SUTVA) and provided intuition for how and why DiD works. We demonstrated estimation strategies from simple two-period comparisons to regressions with fixed effects, event studies, and recent advances for multiple periods. We also delved into practical considerations: checking assumptions via pre-trend analyses, using R packages like fixest and did to implement DiD and event studies, and addressing pitfalls like autocorrelated errors or heterogeneous treatment timing.
Furthermore, we illustrated with examples how DiD is applied in various fields – evaluating minimum wage laws, trade policy shocks, conflict impacts, and more – to give a flavor of its versatility. Through these cases, we saw DiD is not just a technique but a mode of reasoning: always ask “what is the appropriate counterfactual, and can I approximate it with a control group’s trend?”.
As you proceed to further studies or empirical projects, remember that causal inference is as much an art as a science. Choosing the right method (DiD vs IV vs RDD vs matching vs synthetic control) depends on the context and data at hand. DiD is often a first resort for policy evaluation due to its simplicity, but one should be prepared to justify the assumptions or switch approaches if they fail. It is also common to combine methods (e.g. DiD with matching, or IV-DiD hybrid if treatment is endogenous but instrument varies in a DiD setup).
Finally, maintain a healthy skepticism of results. If a DiD analysis finds a surprising effect, double-check everything: were trends really parallel, was there any coincident event, are standard errors properly accounted for, etc. The credibility revolution in econometrics has taught us to probe and poke at our identification strategies to see if they hold up. By doing so, we ensure that our conclusions – say, that a policy caused a 5% rise in employment or that a trade shock led to manufacturing decline – are robust and truly reflective of causal relationships, not artifacts of research design flaws.
With the knowledge from this chapter, you should be equipped to design and critique DiD studies. You can formulate a DiD hypothesis, collect panel data, run the necessary regressions (perhaps using code similar to what we provided), and interpret the outputs in light of assumptions. You can also recognize when a scenario might violate DiD assumptions and consider alternatives. Causal inference is a challenging but rewarding field: done carefully, methods like DiD can illuminate cause-and-effect in the real world and inform better policy decisions.
References
- Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program. Journal of the American Statistical Association, 105(490), 493-505.
- Angrist, J. D., & Krueger, A. B. (2001). Instrumental variables and the search for identification: From supply and demand to natural experiments. Journal of Economic Perspectives, 15(4), 69-85.
- Bertrand, M., Duflo, E., & Mullainathan, S. (2004). How much should we trust differences-in-differences estimates? Quarterly Journal of Economics, 119(1), 249-275.
- Callaway, B., & Sant’Anna, P. H. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.
- Card, D., & Krueger, A. B. (1994). Minimum wages and employment: A case study of the fast-food industry in New Jersey and Pennsylvania. American Economic Review, 84(4), 772-793.
- Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254-277.
- Gruber, J. (1994). The incidence of mandated maternity benefits. American Economic Review, 84(3), 622-641.
- Kahn-Lang, A., & Lang, K. (2020). The promise and pitfalls of differences-in-differences: Reflections on 16 and Pregnant and other applications. Journal of Business & Economic Statistics, 38(3), 613-620.
- Pierce, J. R., & Schott, P. K. (2016). The surprisingly swift decline of US manufacturing employment. American Economic Review, 106(7), 1632-1662.
- Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2), 175-199.