Lecture 26 Causal Inference Midterm Review

Nick Huntington-Klein

March 28, 2019

Causal Inference Midterm

Similar format to the homeworks we’ve been having
At least one question evaluating a research question and drawing a dagitty graph
At least one question identifying the right causal inference method to use
At least one question about the feature(s) of the methods
At least one question carrying out a method in R

Causal Inference Midterm

Covers everything up to IV (obviously, a focus on things since the Programming Midterm, but there is a little programming)
No internet (except dagitty) or slides available this time
One 3x5 index card, front and back
You’ll have the whole class period so don’t be late!

Causal Diagrams

Consider all the variables that are likely to be important in the data generating process (this includes variables you can’t observe)
For simplicity, combine them together or prune the ones least likely to be important
Consider which variables are likely to affect which other variables and draw arrows from one to the other
(Bonus: Test some implications of the model to see if you have the right one)

Causal Diagrams

Identifying X -> Y by closing back doors:

Find all the paths from X to Y on the diagram
Determine which are “front doors” (start with X ->) and which are “back doors” (start with X <-)
Determine which are already closed by colliders (X -> C <- Y)
Then, identify the effect by finding which variables you need to control for to close all back doors (careful - don’t close the front doors, or open back up paths with colliders!)

Causal Diagrams

Let’s draw (and justify) a diagram to get the effect of Building Code Restrictions BCR, which prevent housing from being built, on Rent
Consider perhaps: the Supply of housing built, characteristics of the location that lead to BCRs being passed, Demand for housing in the area, the overall economy…

Causal Diagrams Answer

One answer, with non-BCR Laws, Labor market, economy:

Causal Diagram Answer

Open front doors:

BCR -> Sup -> Rent
(note all others closed because they use Sup as a collider)

Open back doors:

BCR <- U1 -> Laws -> Rent
BCR <- loc -> Laws -> Rent
BCR <- loc -> Dem -> Rent
BCR <- loc -> Sup -> Rent

Which others paths are there, closed by colliders?

Causal Diagram Answer

If we control for Laws, then BCR <- U1 -> Laws <- loc -> etc. opens back up!
Thankfully if we control for loc that shuts it back down
We can identify this by controlling just for loc and Laws

Controlling

One way to close back doors is by controlling
Control for W by seeing what W explains (sometimes using cut()) and taking it out

library(Ecdat)
data(BudgetFood)
cor(BudgetFood$wfood,BudgetFood$totexp)

## [1] -0.5125209

BudgetFood <- BudgetFood %>% group_by(cut(age,breaks=5)) %>%
  mutate(wfood.r = wfood - mean(wfood),totexp.r = totexp-mean(totexp))
cor(BudgetFood$wfood.r,BudgetFood$totexp.r)

## [1] -0.4852561

Fixed Effects

If we have data where we observe the same people over and over, we can implement fixed effects by controlling for individual
In our rent example, this would be a control for loc
This accounts for everything that’s constant within individual. Here, geography, etc.
Doesn’t account for things that vary within individual over time, like Laws

Instrumental Variables

Or, we can ignore those back doors altogether if we have an instrumental variable
If Z and X are related, and all open paths from Z to Y go through X, then Z can be an instrument for X
We isolate JUST the variation that comes from Z. No back doors in that variation! We have a causal effect
Can conceptually think of it as (or literally apply it to) an experiment where randomization doesn’t work perfectly

Instrumental Variables

df <- tibble(W = rnorm(1000),Z=sample(c(0,1),1000,replace=T)) %>%
  mutate(X = rnorm(1000) + W + Z) %>%
  mutate(Y = rnorm(1000) + 3*X - 10*W)

cor(df$X,df$Y)

## [1] -0.286212

iv <- df %>% group_by(Z) %>%
  summarize(X = mean(X),Y=mean(Y))
(iv$Y[2]-iv$Y[1])/(iv$X[2]-iv$X[1])

## [1] 3.542106

Treated and Untreated Groups

In many cases, we want to know the effect of having something or not, D, and want to compare a treated group (D=1) to an untreated one (D=0)
In each case we are trying to find apples-to-apples comparisons
Controlling works for this, but there are many other methods
How can we make our two groups comparable?

Matching

Instead of controlling, we can construct our treatement and control groups using matching
In class we’ve used Coarsened Exact Matching
This closes back doors for whatever we matched on
Works, like controlling, if we can observe and measure all the variables necessary to block the back doors

Matching

What is the effect of gender on the proportion of your income spent on food?
We’ll match on everything else in the data
Use inner_join to match up treated (“male”) and untreated (“female”) observations

Matching

bf <- BudgetFood %>% select(wfood,size,town,sex) %>%
  mutate(size.c=cut(size,breaks=3)) %>%
  group_by(size.c,town,sex) %>%
  summarize(wfood = mean(wfood)) %>% ungroup()

bf.male <- filter(bf,sex=="man") %>% rename(wfood.m = wfood) %>% select(-sex)
bf.female <- filter(bf,sex=="woman") %>% rename(wfood.f = wfood) %>% select(-sex)

matched <- inner_join(bf.male,bf.female)
mean(matched$wfood.m)

## [1] 0.4240166

mean(matched$wfood.f)

## [1] 0.4300931

Difference-in-Difference

Difference-in-Difference applies when you have a group that you can observe both before and after the policy
You worry that time is a confounder, but you can’t control for it
Unless you add a control group that DIDN’T get the policy

Difference-in-Difference

Get the before-after difference for both groups
Then subtract out the difference for the control

diddata <- tibble(Group=c(rep("C",2500),rep("T",2500)),
                  Time=rep(c(rep("Before",1250),rep("After",1250)),2)) %>%
  mutate(Treated = (Group == "T") & Time == "After") %>%
  mutate(Y = 2*(Group == "T") + 1.5*(Time == "After") + 3*Treated + rnorm(5000))
did <- diddata %>% group_by(Group,Time) %>% summarize(Y = mean(Y))
before.after.control <- did$Y[1] - did$Y[2]
before.after.treated <- did$Y[3] - did$Y[4]
did.effect <- before.after.treated - before.after.control
did.effect

## [1] 3.007027

Difference-in-Difference

Regression Discontinuity

If we have a treatment D that is assigned based on a cutoff in a running variable, we can use regression discontinuity
Focus right around the cutoff and compare above-cutoff to below-cutoff
We’ve isolated a great set of treatment/control groups because in this area it’s basically random whether you’re above or below the cutoff

Regression Discontinuity

rdddata <- tibble(W=rnorm(10000)) %>%
  mutate(run = runif(10000)+.03*W) %>%
  mutate(treated = run >= .6) %>%
  mutate(Y = 2+.01*run+.5*treated+W+rnorm(10000))
bandwidth <- .02
rdd <- rdddata %>% filter(abs(run-.6)<=bandwidth) %>%
  mutate(above = run >= .6) %>%
  group_by(above) %>%
  summarize(Y = mean(Y))
rdd

## # A tibble: 2 x 2
##   above     Y
##   <lgl> <dbl>
## 1 FALSE  1.85
## 2 TRUE   2.54

Regression Discontinuity

Expressed well in graphs! Treatment should jump at cutoff. If not perfectly from 0% to 100%, use IV too

Regression Discontinuity

Variables other than Y and treatment shouldn’t jump at cutoff - they should be balanced

Regression Discontinuity

Technically we’re looking for how much this jumps at the cutoff

That’s it!

In a very condensed way, that’s the material we covered!
I recommend looking back over slides, notes, homeworks
Homeworks will be most similar to the questions on the midterm