We’ve covered quite a few methods for isolating causal effects!
Controlling for variables to close back doors (explain X and Y with the control, remove what’s explained)
Matching on variables to close back doors (find treated and non-treated observations with )
Using a control group to control for time (before/after difference for treated and untreated, then difference them)
Using a cutoff to construct a very good control group (treated/untreated difference near a cutoff)
Today
We’ve got ONE LAST METHOD!
Today we’ll be covering instrumental variables
The basic idea is that we have some variable - the instrumental variable - that causes X but has no other back doors!
Natural Experiments
This calls back to our idea of trying to mimic an experiment without having an experiment. In fact, let’s think about an actual randomized experiment.
We have some random assignment R that determines your X. So even though we have back doors between X and Y, we can identify X -> Y
Natural Experiments
The idea of instrumental variables is this:
What if we can find a variable that can take the place of R in the diagram despite not actually being something we randomized in an experiment?
If we can do that, we’ve clearly got a “natural experiment”
When we find a variable that can do that, we call it an “instrument” or “instrumental variable”
Let’s call it Z
Instrumental Variable
So, for Z take the place of R in the diagram, what do we need?
Z must be related to X (typically Z -> X but not always)
There must be no open paths from Z to Yexcept for ones that go through X
In other words “Z is related to X, and all the effect of Z on Y goes THROUGH X”
Instrumental Variable
How?
Explain X with Z, and keep only what is explained, X'
Explain Y with Z, and keep only what is explained, Y'
[If Z is logical/binary] Divide the difference in Y' between Z values by the difference in X' between Z values
[If Z is not logical/binary] Get the correlation between X' and Y'
Graphically
Instrumental Variables
Notice that this whole process is like the opposite of controlling for a variable
We explain X and Y with the variable, but instead of tossing out what’s explained, we ONLY KEEP what’s explained!
Instead of saying “you’re on a back door, I want to close you” we say “you have no back doors! I want my X to be just like you! I’m only keeping that part of X that’s explained by you!”
Since Z has no back doors, the part of X explained by Z has no back doors to the part of Y explained by Z
Imperfect Assignment
Let’s apply one of the common uses of instrumental variables, which actually is when you have a randomized experiment
In normal circumstances, if we have an experiment and assign people with R, we just compare Y across values of R:
df <-tibble(R =sample(c(0,1),500,replace=T)) %>%mutate(X = R, Y =5*X +rnorm(500))
#The truth is a difference of 5
df %>%group_by(R) %>%summarize(Y=mean(Y))
## # A tibble: 2 x 2
## R Y
## <dbl> <dbl>
## 1 0 -0.0199
## 2 1 4.93
Imperfect Assignment
But what happens if you run a randomized experiment and assign people with R, but not everyone does what you say? Some “treated” people don’t get the treatment, and some “untreated” people do get it
When this happens, we can’t just compare Y across R
But R is still a valid instrument!
Imperfect Assignment
df <-tibble(R =sample(c(0,1),500,replace=T)) %>%#We tell them whether or not to get treatedmutate(X = R) %>%#But some of them don't listen! 20% do the OPPOSITE!mutate(X =ifelse(runif(500) >.8,1-R,R)) %>%mutate(Y =5*X +rnorm(500))
#The truth is a difference of 5
df %>%group_by(R) %>%summarize(Y=mean(Y))
## # A tibble: 2 x 2
## R Y
## <dbl> <dbl>
## 1 0 0.895
## 2 1 3.91
Imperfect Assignment
So let’s do IV (instrumental variables); R is the IV.
iv <-df %>%group_by(R) %>%summarize(Y =mean(Y), X =mean(X))
iv
## # A tibble: 2 x 3
## R Y X
## <dbl> <dbl> <dbl>
## 1 0 0.895 0.158
## 2 1 3.91 0.778
#Remember, since our instrument is binary, we want the slope
(iv$Y[2] -iv$Y[1])/(iv$X[2]-iv$X[1])
## [1] 4.868675
#Truth is 5!
Another Example
Justifying that an IV has no back doors can be hard!
Usually things aren’t as clean-cut as having actual randomization
And sometimes we may have to add controls in order to justify the IV
Think hard - are there really no other paths from Z to Y?
This will often require detailed contextual knowledge of the data generating process
Pollution and Driving
If air quality is really bad, you may choose to drive instead of walk/bike/bus in order to avoid breathing it
So do particularly smoggy days lead people to drive more?
Pan He and Cheng Xu ask this question using Shanghai as an example!
Pollution and Driving
Plenty of back doors - seasons, whether factories are running, smog levels last week…
Pollution and Driving
The direction of the wind could be an IV - Shanghai faces the water, and so when the wind blows West, it brings pollution into the city
Pollution and Driving
This gives us an IV we can use!
Of course, we need to control for Season to block out the back door.
The authors do indeed find that additional smog, brought in by the wind, increases the number of people who choose to drive - making the problem worse later!!
Trade and Manufacturing
Another example: did Chinese imports reduce US manufacturing employment?
Employment in the US manufacturing sector has been dropping for decades
(note - manufacturing itself isn’t dropping, we’re manufacturing more than ever, we’re just doing it without as many actual people)
Trade and Manufacturing
The timing of the drop in manufacturing jobs coincides with us importing a lot more Chinese stuff
But did the Chinese imports cause the decline or was it a coincidence? Automation is another good explanation!
Or general declining US competitiveness in the global market, vs. everyone (not just China)
Trade and Manufacturing
Autor, Dorn, & Hanson use Chinese exports to other countries (CEXoth) as an IV for Chinese exports to the US (CEXus) in order to estimate the impact of Chinese exports to the US on US manufacturing employment (mfg)
Let’s think about whether this makes sense as an IV - any back doors from CEXoth to mfg we can imagine? Or front doors that don’t go through CEXus?
Also, important, do we think that the arrow from CEXoth to CEXus is actually there?
Trade and Manufacturing
D is global demand for US manufactures, L is US labor supply, Close measures how similar the kinds of things the US manufactures are to China manufactures
Trade and Manufacturing
So we need to control for D in some way to close CEXoth <- U1 -> D -> mfg but other than that we have a good instrument
(they do this, and also use information like “what does Close look like on a regional level?” to improve their estimate)
Autor, Dorn, & Hanson found that Chinese exports elsewhere predicted them in the US (China was opening up and becoming more effective as a producer, making their products attractive everywhere)
Trade and Manufacturing
And when you limit mfg and CEXus to just what’s explained by CEXoth, you do see that some decline in mfg is because of Chinese imports
Practice
Does the price of cigarettes affect smoking? Get AER package and data(CigarettesSW). Examine with help().
Get JUST thecigarette taxes cigtax from taxs-tax
Draw a causal diagram using packs, price, cigtax, and some back door W. What might W be?
Adjust price and cigtax for inflation: divide them by cpi
Explain price and packs with cigtax using cut(,breaks=7) for cigtax
Get correlation between the explained parts and plot the explained parts - does price reduce packs smoked?