Lecture 21 Difference in Differences

Nick Huntington-Klein

March 12, 2019

Recap

Last time we discussed the concept of identifying a causal effect by selecting a comparable untreated group (aka “control group”) that is the same EXCEPT for the treatment, so any differences are because of treatment
We picked a control group explicitly by matching on a set of variables
Of course, this is the same as controlling for those variables - if the variables we match on block all back door paths, we’re good. But if not?

Today

Today we’re going to look at one of the most commonly used methods in causal inference, called Difference-in-Differences
The basic idea is to sort of combine untreated groups with fixed effects
We have a treated group that we observe both before and after they’re treated
And we have an untreated group
The treated and control groups probably aren’t identical - there are back doors! So… we control for group like with fixed effects

The Basic Problem

What kind of setup lends itself to being studied with difference-in-differences?
Crucially, we need to have a group (or groups) that receives a treatment
And, we need to observe them both before and after they get their treatment
Observing each individual (or group) multiple times, kind of like we did with fixed effects

The Basic Problem

So one obvious thing we might do would be to just use fixed effects
Using variation within group, comparing the time before the policy to the time after
But!

The Basic Problem

Unlike with fixed effects, the relationship between time and treatment is very clear: early = no treatment. Late = treatment
So if anything else is changing over time, we have a back door!

The Basic Problem

Ok, time is a back door, no problem. We can observe and measure time. So we’ll just control for it and close the back door!
But we can’t!
Why not?
Because in this group, you’re either before treatment and D = 0, or after treatment and D = 1. If we control for time, we’re effectively controlling for treatment
“What’s the effect of treatment, controlling for treatment” doesn’t make any sense

The Basic Problem

#Create our data
diddata <- tibble(year = sample(2002:2010,10000,replace=T)) %>%
  mutate(D = year >= 2007) %>% mutate(Y = 2*D + .5*year + rnorm(10000))
#Now, control for year
diddata <- diddata %>% group_by(year) %>% mutate(D.r = D - mean(D), Y.r = Y - mean(Y))
#What's the difference with and without treatment?
diddata %>% group_by(D) %>% summarize(Y=mean(Y))

## # A tibble: 2 x 2
##   D         Y
##   <lgl> <dbl>
## 1 FALSE 1002.
## 2 TRUE  1006.

#And controlling for time?
diddata %>% group_by(D.r) %>% summarize(Y=mean(Y.r))

## # A tibble: 1 x 2
##     D.r        Y
##   <dbl>    <dbl>
## 1     0 1.84e-15

The Difference-in-differences Solution

We can add a control group that did not get the treatment
Then, any changes that are the result of time should show up for that control group, and we can get rid of them!
The change for the treatment from from before to after is because of both treatment and time. If we measure the time effect using our control, and subtract that out, we’re left with just the effect of treatment!

Once Again

#Create our data
diddata <- tibble(year = sample(2002:2010,10000,replace=T),
                  group = sample(c('TreatedGroup','UntreatedGroup'),10000,replace=T)) %>%
  mutate(after = (year >= 2007)) %>%
  #Only let the treatment be applied to the treated group
  mutate(D = after*(group=='TreatedGroup')) %>%
  mutate(Y = 2*D + .5*year + rnorm(10000))
#Now, get before-after differences for both groups
means <- diddata %>% group_by(group,after) %>% summarize(Y=mean(Y))

#Before-after difference for untreated, has time effect only
bef.aft.untreated <- filter(means,group=='UntreatedGroup',after==1)$Y - filter(means,group=='UntreatedGroup',after==0)$Y
#Before-after for treated, has time and treatment effect
bef.aft.treated <- filter(means,group=='TreatedGroup',after==1)$Y - filter(means,group=='TreatedGroup',after==0)$Y

#Difference-in-Difference! Take the Time + Treatment effect, and remove the Time effect
DID <- bef.aft.treated - bef.aft.untreated
DID

## [1] 1.976004

The Difference-in-Differences Solution

This is our way of controlling for time
Of course, we’re NOT accounting for the fact that our treatment and control groups may be different from each other

The Difference-in-Differences Solution

Except that we are! We’re comparing each group to itself over time (controlling for group, like fixed effects), and then comparing those differences between groups (controlling for time). The Difference in the Differences!
Let’s imagine there is an important difference between groups. We’ll still get the same answer

Once Again

#Create our data
diddata <- tibble(year = sample(2002:2010,10000,replace=T),
                  group = sample(c('TreatedGroup','UntreatedGroup'),10000,replace=T)) %>%
  mutate(after = (year >= 2007)) %>%
  #Only let the treatment be applied to the treated group
  mutate(D = after*(group=='TreatedGroup')) %>%
  mutate(Y = 2*D + .5*year + (group == 'TreatedGroup') +  rnorm(10000))
#Now, get before-after differences for both groups
means <- diddata %>% group_by(group,after) %>% summarize(Y=mean(Y))

#Before-after difference for untreated, has time effect only
bef.aft.untreated <- filter(means,group=='UntreatedGroup',after==1)$Y - filter(means,group=='UntreatedGroup',after==0)$Y
#Before-after for treated, has time and treatment effect
bef.aft.treated <- filter(means,group=='TreatedGroup',after==1)$Y - filter(means,group=='TreatedGroup',after==0)$Y

#Difference-in-Difference! Take the Time + Treatment effect, and remove the Time effect
DID <- bef.aft.treated - bef.aft.untreated
DID

## [1] 1.953753

The Difference-in-Differences Solution

We can think about what’s in there and what we’re taking out
Untreated Before Treatment: Untreated Group Mean
Untreated After Treatment: Untreated Group Mean + Time
Treated Before Treatment: Treated Group Mean
Treated After Treatment: Treated Group Mean + Time + Treatment

The Difference-in-Differences Solution

Before-After Difference for Untreated:
(Untreated Group + Time) - (Untreated Group) = Time
Before-After Difference for Treated:
(Treated Group + Time + Treatment) - (Treated Group) = Time + Treatment
Difference-in-Differences:
Before-After Diff for Treated - B-A Diff for Untreated = (Time + Treatment) - (Time) = Treatment

The Difference-in-Differences Solution

We are in this way taking out what’s explained by group and by time, controlling for both

Graphically

Example

The classic difference-in-differences example is the Mariel Boatlift
There’s a lot of discussion these days on the impacts of immigration
Immigrants might provide additional labor market competition to people who already live here, driving down wages
Does this actually happen?

Mariel Boatlift

In 1980, Cuba very briefly lifted emigration restrictions
LOTS of people left the country very quickly, many of them going to Miami
The Miami labor force increased by 7% in a year
If immigrants were ever going to cause a problem for workers already there, seems like it would be happening here

Mariel Boatlift

David Card studied this using Difference-in-Differences, noticing that this influx of immigrants mainly affected Miami, and so other cities in the country could act as a control group
He used Atlanta, Houston, Los Angeles, and Tampa-St. Petersburg as comparisons
How did wages and unemployment of everyone other than Cubans change in Miami from 1979-80 to 81-85, and how did it change in the control cities?

Mariel Boatlift

load('mariel.RData')
#Take the log of wage and create our "after treatment" and "treated group" variables
df <- mutate(df,lwage = log(hourwage),
             after = year >= 81,
             miami = smsarank == 26)

#Then we can do our difference in difference!
means <- df %>% group_by(after,miami) %>% summarize(lwage = mean(lwage),unemp=mean(unemp))
means

## # A tibble: 4 x 4
## # Groups:   after [2]
##   after miami lwage  unemp
##   <lgl> <lgl> <dbl>  <dbl>
## 1 FALSE FALSE  1.88 0.0619
## 2 FALSE TRUE   1.74 0.0547
## 3 TRUE  FALSE  1.84 0.0794
## 4 TRUE  TRUE   1.72 0.0854

Mariel Boatlift

Did the wages of non-Cubans in Miami drop with the influx?
means$lwage[4] - means$lwage[2] = -0.019. Uh oh!
But how about in the control cities?
means$lwage[3] - means$lwage[1] = -0.046
Things were getting worse everywhere! How about the overall difference-in-difference?
0.027! Wages actually got BETTER for others with the influx of immigrants

Mariel Boatlift

We can do the same thing for unemployment!
Difference in Miami: means$unemp[4] - means$unemp[2] = 0.031
Difference in control cities: means$unemp[3] - means$unemp[1] = 0.018
Difference-in-differences: 0.013.
So unemployment did rise more in Miami
Similar results if we look only at those without a HS degree, who the Cubanos would be competing with directly (wage DID 0.031, unemployment 0.019)

Difference-in-Differences

It’s important in cases like this (and in all cases!) to think hard about whether we believe our causal diagram, and what that entails
Which, remember, is this:

Hidden Assumptions

One thing we’re assuming is that Time affected the Treatment and Control groups equally*
Otherwise, our attempt to control for Time by using how it changed in Control won’t work!
Is something missing from our diagram related to how either Control or Treatment might have changed from Before to After?
For example, if Miami wages were already growing faster than Control wages before 1980
A common way to check is to look at how wages were changing in the years up to treatment

*This is called the “parallel trends” assumption.

Hidden Assumptions

Also, how did we get that list of control cities?
Our intuition for using a Control Group is that they should be basically exactly the same except they didn’t get the treatment
Are LA, Houston, Atlanta, and Tampa basically the same as Miami?

Hidden Assumptions

In the case of the Mariel Boatlift, a later paper by Peri & Yasinov checks both of these things and gets similar results
It uses Synthetic Control - a form of matching - to pick Control cities that were trending similarly before 1980

Practice

The Earned Income Tax Credit was increased in 1993. This may increase chances single mothers (treated) return to work, but likely not affect single non-moms (control)
read.csv('http://nickchk.com/eitc.csv')
Create variables after for years 1994+, and treated if they have any children
Get average work within year and treated. plot(,type='l',ylim=c(.4,.6)) average work separately against year for treated (blue), then points to add untreated (red). Any concerns they’re already trending together/apart in 1991-1993?
Calculate the DID estimate of the effect of the EITC expansion on work

Practice Answers

df <- read.csv('http://nickchk.com/eitc.csv') %>%
  mutate(after = year >= 1994,
         treated = children > 0)

plotdata <- df %>% group_by(treated,year) %>%
  summarize(work = mean(work))

plot(filter(plotdata,treated==1)$year,
     filter(plotdata,treated==1)$work,col='blue',type='l',ylim=c(.4,.6))
points(filter(plotdata,treated==0)$year,
       filter(plotdata,treated==0)$work,col='red',type='l')
abline(v=1994)
# They don't appear to be trending away or towards each other before 1994. Good!

#Now DID:
did <- df %>% group_by(treated,after) %>% summarize(work = mean(work))
untreat.diff <- did$work[2]-did$work[1]
treat.diff <- did$work[4]-did$work[3]
did.estimate <- treat.diff - untreat.diff

## [1] 0.04687313