Introduction about design-based studies

Another case where data visualizations can be of particular utility is in the context of describing average levels or change over time in a design-based study of some type.


How do we get there?

We will be focusing on making plots that visualize either trends over time for the treated and control groups of a design based study, or just the difference in trends between these two focal groups.

For relatively simple designs, it’s straightforward to estimate the models and make predictions using the fixest library, which handles incorporating fixed effects and clustering errors quite conveniently. However, once we move towards more complex designs where we model treatment heterogeneity and dynamic treatment effects, in such cases there is a very useful package called did that provides a convenient means for estimating these models.


Setup

Software dependencies

library(tidyverse)   #dplyr, ggplot2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(fixest)      #for simpler DD models
library(did)         #for event study models
library(broom)       #quickly process model output


Data

Fake two-period DD data

#first let's simulate some simple two period DD data to model
set.seed(20220418)

#make the 2 period data
sim_2_period_param <- reset.sim(time.periods = 2)
sim_2_period <- build_sim_dataset(sim_2_period_param)

#look at the simulated data
glimpse(sim_2_period)
## Rows: 6,816
## Columns: 7
## $ G       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ X       <dbl> 0.3252394, 0.3252394, -0.2484243, -0.2484243, 0.2038793, 0.203…
## $ id      <int> 1, 1, 2, 2, 4, 4, 7, 7, 8, 8, 10, 10, 13, 13, 15, 15, 17, 17, …
## $ cluster <int> 15, 15, 23, 23, 6, 6, 49, 49, 3, 3, 20, 20, 25, 25, 13, 13, 49…
## $ period  <dbl> 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1,…
## $ Y       <dbl> 3.0012710, 5.3261926, 2.4214177, 2.9075177, 3.0204852, 5.21777…
## $ treat   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…


Real-world event study data

The other data that we will look at are from a recent study about using difference-in-difference designs where there are multiple tiem periods of observation and treatment heterogeneity among the group receiving the intervention. These data are a strongly-balanced panel covering 500 U.S. counties observed across 5 time periods (2003-2007). The study these data are drawn from analyzes the effect of minimum wage policies on teen employment, and since the laws were passed at different times there is a need for an event study design that considers these differences in exposure timing and length.

Callaway, B., & Sant’Anna, P. H. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.

mpdta <- read_csv("./data/mpdata.csv")
## Rows: 2500 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): year, countyreal, lpop, lemp, first.treat, treat
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(mpdta)
## Rows: 2,500
## Columns: 6
## $ year        <dbl> 2003, 2004, 2005, 2006, 2007, 2003, 2004, 2005, 2006, 2007…
## $ countyreal  <dbl> 8001, 8001, 8001, 8001, 8001, 8019, 8019, 8019, 8019, 8019…
## $ lpop        <dbl> 5.896761, 5.896761, 5.896761, 5.896761, 5.896761, 2.232377…
## $ lemp        <dbl> 8.461469, 8.336870, 8.340217, 8.378161, 8.487352, 4.997212…
## $ first.treat <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
## $ treat       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…


Dependent variable

The outcome is the log of teen employment from 2004 to 2007, taken from the Quarterly Workforce Indicators (QWI).

summary(mpdta$lemp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.099   4.868   5.697   5.773   6.684  10.443


What are the average treatment effects?

Two time periods

Since a two period design produces a single estimate for the ATT, this is really a case where graphics don’t offer much beyond conveying this information through a table if our focus is on a single model.

#estimate TWFE
model_2period <- feols(Y ~ period * treat | id, sim_2_period)
## The variable 'treat' has been removed because of collinearity (see $collin.var).
#give the summary info
summary(model_2period)
## OLS estimation, Dep. Var.: Y
## Observations: 6,816 
## Fixed-effects: id: 3,408
## Standard-errors: Clustered (id) 
##              Estimate Std. Error t value  Pr(>|t|)    
## period       0.709973   0.040592 17.4905 < 2.2e-16 ***
## period:treat 1.569548   0.057843 27.1348 < 2.2e-16 ***
## ... 1 variable was removed because of collinearity (treat)
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.84385     Adj. R2: 0.786127
##                 Within R2: 0.496142


Now it is common to have multiple models due to including covariates, looking at particular subgroups or other reasons, and in these cases there could be value to presenting the focal information about the ATT provided by the different models’ DD coefficients in a visualization (with the relevant model table in appendix).

Here we arbitrarily model the odd and even ID’ed cases separately to illustrate what a multi-model graphic might look like using the DD coefficients and their SEs.

model_2period_odds <- feols(Y ~ period * treat | id, sim_2_period %>% filter(id %% 2 == 1))
## The variable 'treat' has been removed because of collinearity (see $collin.var).
model_2period_evens <- feols(Y ~ period * treat | id, sim_2_period %>% filter(id %% 2 == 0))
## The variable 'treat' has been removed because of collinearity (see $collin.var).
coef_2period <- tidy(model_2period) %>%
  as_tibble() %>%
  mutate(model = "Overall")

coef_2period_odds <- tidy(model_2period_odds) %>%
  as_tibble() %>%
  mutate(model = "Odds")

coef_2period_evens <- tidy(model_2period_evens) %>%
  as_tibble() %>%
  mutate(model = "Evens")

model_coefs <- bind_rows(coef_2period, coef_2period_odds, coef_2period_evens) %>%
  filter(str_detect(term, "treat"))

ggplot(model_coefs, aes(x = estimate, y = model, 
                        xmin = estimate - 1.96 * std.error,
                        xmax = estimate + 1.96 * std.error)) +
  geom_errorbarh() +
  geom_point() +
  geom_vline(xintercept = 0) +
  theme_bw() +
  labs(x = "\nDD Coefficient", y = "Model Specification\n")


Multiple time periods

Nonetheless, in instances where we have greater complexity to the study design, a figure illustrating the average treatment effects–whether group-time specific or averaged over these groups–can be useful.

Following the vignette from the did package website, we can use the following code to estimate the group-time average treatment effects without covariates.

# estimate group-time average treatment effects without covariates
mw.attgt <- att_gt(yname = "lemp",
                   gname = "first.treat",
                   idname = "countyreal",
                   tname = "year",
                   xformla = ~ 1,
                   data = mpdta,
                   )

# summarize the results
summary(mw.attgt) 
## 
## Call:
## att_gt(yname = "lemp", tname = "year", idname = "countyreal", 
##     gname = "first.treat", xformla = ~1, data = mpdta)
## 
## Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> 
## 
## Group-Time Average Treatment Effects:
##  Group Time ATT(g,t) Std. Error [95% Simult.  Conf. Band]  
##   2004 2004  -0.0105     0.0243       -0.0777      0.0567  
##   2004 2005  -0.0704     0.0328       -0.1612      0.0204  
##   2004 2006  -0.1373     0.0403       -0.2487     -0.0258 *
##   2004 2007  -0.1008     0.0341       -0.1952     -0.0064 *
##   2006 2004   0.0065     0.0235       -0.0585      0.0716  
##   2006 2005  -0.0028     0.0188       -0.0548      0.0493  
##   2006 2006  -0.0046     0.0181       -0.0548      0.0456  
##   2006 2007  -0.0412     0.0208       -0.0988      0.0164  
##   2007 2004   0.0305     0.0145       -0.0096      0.0706  
##   2007 2005  -0.0027     0.0160       -0.0470      0.0416  
##   2007 2006  -0.0311     0.0181       -0.0812      0.0190  
##   2007 2007  -0.0261     0.0163       -0.0713      0.0191  
## ---
## Signif. codes: `*' confidence band does not cover 0
## 
## P-value for pre-test of parallel trends assumption:  0.16812
## Control Group:  Never Treated,  Anticipation Periods:  0
## Estimation Method:  Doubly Robust


There is a default plot method for the did package that enables you to avoid making the plot for yourself.

# plot the results
# set ylim so that all plots have the same scale along y-axis
ggdid(mw.attgt, ylim = c(-.3, .3), type = "calendar")


A note about using color in figures

There are some accessibility issues with the default plot that may motivate modifications. Most important is that the default ggplot red and blue colors collapse to very similar shades of gray when printed in black and white. This is emblematic of a general problem that has to be grappled with when producing visualizations for published research articles.

See the interpretability of the figure (cropped for size) from the Callaway and Sant’anna (2021) article in color:

Compared to in grayscale:


It’s not impossible, but takes a lot more scrutiny to see the residual differences in shade once the blue and red hues are taken away.

While you do occasionally see folks punt and add a “you need to be looking at a PDF to see the color version and only this color version makes sense” caption to figures in an article, this is a case where we can easily make a change to have a color rich version for digital viewers, but a still comprehensible solution for print viewers.

Enter: viridis. You have probably already seen this palette in use and maybe you are even already a user yourself, but for any who are not familiar this is a wonderful tool for effective data visualization with color. All variations on the color palette change in their hue (i.e. RGB values) but also value (grayscale). This means that, by default, the colors flatten to grays that should still perceptibly differ. Using discrete or binned values is a surefire way to ensure this is the case, since 4 or 5 levels of gray tend to be reasonably decipherable.

Thankfully, you can also pass theming = FALSE to ggdid() to have it produce an ATT plot but let you handle the theming yourself. There does not seem to be an easy way to manipulate the aesthetics of this plot (e.g., add a shape aesthetic for pre/post), unfortuantely, but you can make some useful modifications to your liking.

# plot the results
# set ylim so that all plots have the same scale along y-axis
ggdid(mw.attgt, ylim = c(-.3, .3), type = "calendar", theming = FALSE) +
  scale_color_viridis_d(begin = .75, end = .25, labels = c("Pre-treatment", "Post-treatment")) +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(title = "", subtitle = "")
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.


and, what would folks viewing our figure in print see?


How do the treatment effects vary by length of exposure?

#aggregate the group-time average treatment effects
mw.dyn <- aggte(mw.attgt, type = "dynamic")

#print the summary output
summary(mw.dyn)
## 
## Call:
## aggte(MP = mw.attgt, type = "dynamic")
## 
## Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> 
## 
## 
## Overall summary of ATT's based on event-study/dynamic aggregation:  
##      ATT    Std. Error     [ 95%  Conf. Int.]  
##  -0.0772        0.0204    -0.1173     -0.0372 *
## 
## 
## Dynamic Effects:
##  Event time Estimate Std. Error [95% Simult.  Conf. Band]  
##          -3   0.0305     0.0142       -0.0044      0.0654  
##          -2  -0.0006     0.0136       -0.0340      0.0328  
##          -1  -0.0245     0.0145       -0.0600      0.0111  
##           0  -0.0199     0.0122       -0.0500      0.0101  
##           1  -0.0510     0.0172       -0.0933     -0.0086 *
##           2  -0.1373     0.0385       -0.2318     -0.0427 *
##           3  -0.1008     0.0358       -0.1888     -0.0128 *
## ---
## Signif. codes: `*' confidence band does not cover 0
## 
## Control Group:  Never Treated,  Anticipation Periods:  0
## Estimation Method:  Doubly Robust
#use the default plot method
ggdid(mw.dyn, ylim = c(-.3, .3))


# plot the results
# set ylim so that all plots have the same scale along y-axis
ggdid(mw.dyn, ylim = c(-.3, .3), type = "dynamic", theming = FALSE) +
  scale_color_viridis_d(begin = .75, end = .25, labels = c("Pre-treatment", "Post-treatment")) +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(title = "", subtitle = "")
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.


Hands-on: Use the injury data or your own data to make graphics for a DD model

The following hands-on practice uses data from the following article.

B.D. Meyer, W.K. Viscusi, and D.L. Durbin (1995), “Workers’ Compensation and Injury Duration: Evidence from a Natural Experiment,” American Economic Review 85, 322-340.

Since these are data from an R library, you can view the codebook here.

Model the data for Kentucky (ki) and Michigan (mi) separately. Make log(durat) the dependent variable, afchnge the policy change dummy variable (i.e., 1 = after policy change) and highearn the dummy variable for high earners (i.e., 1 = high earner).

injury <- read_csv("./data/injury.csv")
## Rows: 7150 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (30): durat, afchnge, highearn, male, married, hosp, indust, injtype, ag...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(injury)
## Rows: 7,150
## Columns: 30
## $ durat    <dbl> 1, 1, 84, 4, 1, 1, 7, 2, 175, 60, 29, 30, 100, 4, 2, 1, 1, 2,…
## $ afchnge  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ highearn <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ male     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
## $ married  <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
## $ hosp     <dbl> 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ indust   <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1…
## $ injtype  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ age      <dbl> 26, 31, 37, 31, 23, 34, 35, 45, 41, 33, 35, 25, 39, 27, 24, 2…
## $ prewage  <dbl> 404.9500, 643.8250, 398.1250, 527.8000, 528.9375, 614.2500, 5…
## $ totmed   <dbl> 1187.57324, 361.07855, 8963.65723, 1099.64832, 372.80188, 211…
## $ injdes   <dbl> 1010, 1404, 1032, 1940, 1940, 1425, 1110, 1207, 1425, 1010, 1…
## $ benefit  <dbl> 246.8375, 246.8375, 246.8375, 246.8375, 211.5750, 176.3125, 2…
## $ ky       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mi       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ldurat   <dbl> 0.0000000, 0.0000000, 4.4308167, 1.3862944, 0.0000000, 0.0000…
## $ afhigh   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ lprewage <dbl> 6.003764, 6.467427, 5.986766, 6.268717, 6.270870, 6.420402, 6…
## $ lage     <dbl> 3.258096, 3.433987, 3.610918, 3.433987, 3.135494, 3.526361, 3…
## $ ltotmed  <dbl> 7.079667, 5.889095, 9.100934, 7.002746, 5.921047, 5.351953, 4…
## $ head     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ neck     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ upextr   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ trunk    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ lowback  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ lowextr  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ occdis   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ manuf    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…
## $ construc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ highlpre <dbl> 6.003764, 6.467427, 5.986766, 6.268717, 6.270870, 6.420402, 6…