Lecture 26 Causal Inference Midterm Review

Nick Huntington-Klein

March 28, 2019

Causal Inference Midterm

  • Similar format to the homeworks we’ve been having
  • At least one question evaluating a research question and drawing a dagitty graph
  • At least one question identifying the right causal inference method to use
  • At least one question about the feature(s) of the methods
  • At least one question carrying out a method in R

Causal Inference Midterm

  • Covers everything up to IV (obviously, a focus on things since the Programming Midterm, but there is a little programming)
  • No internet (except dagitty) or slides available this time
  • One 3x5 index card, front and back
  • You’ll have the whole class period so don’t be late!

Causal Diagrams

  1. Consider all the variables that are likely to be important in the data generating process (this includes variables you can’t observe)
  2. For simplicity, combine them together or prune the ones least likely to be important
  3. Consider which variables are likely to affect which other variables and draw arrows from one to the other
  4. (Bonus: Test some implications of the model to see if you have the right one)

Causal Diagrams

Identifying X -> Y by closing back doors:

  1. Find all the paths from X to Y on the diagram
  2. Determine which are “front doors” (start with X ->) and which are “back doors” (start with X <-)
  3. Determine which are already closed by colliders (X -> C <- Y)
  4. Then, identify the effect by finding which variables you need to control for to close all back doors (careful - don’t close the front doors, or open back up paths with colliders!)

Causal Diagrams

  • Let’s draw (and justify) a diagram to get the effect of Building Code Restrictions BCR, which prevent housing from being built, on Rent
  • Consider perhaps: the Supply of housing built, characteristics of the location that lead to BCRs being passed, Demand for housing in the area, the overall economy…

Causal Diagrams Answer

One answer, with non-BCR Laws, Labor market, economy:

Causal Diagram Answer

Open front doors:

  • BCR -> Sup -> Rent
  • (note all others closed because they use Sup as a collider)

Open back doors:

  • BCR <- U1 -> Laws -> Rent
  • BCR <- loc -> Laws -> Rent
  • BCR <- loc -> Dem -> Rent
  • BCR <- loc -> Sup -> Rent

Which others paths are there, closed by colliders?

Causal Diagram Answer

  • If we control for Laws, then BCR <- U1 -> Laws <- loc -> etc. opens back up!
  • Thankfully if we control for loc that shuts it back down
  • We can identify this by controlling just for loc and Laws

Controlling

  • One way to close back doors is by controlling
  • Control for W by seeing what W explains (sometimes using cut()) and taking it out
library(Ecdat)
data(BudgetFood)
cor(BudgetFood$wfood,BudgetFood$totexp)
## [1] -0.5125209
BudgetFood <- BudgetFood %>% group_by(cut(age,breaks=5)) %>%
  mutate(wfood.r = wfood - mean(wfood),totexp.r = totexp-mean(totexp))
cor(BudgetFood$wfood.r,BudgetFood$totexp.r)
## [1] -0.4852561

Fixed Effects

  • If we have data where we observe the same people over and over, we can implement fixed effects by controlling for individual
  • In our rent example, this would be a control for loc
  • This accounts for everything that’s constant within individual. Here, geography, etc.
  • Doesn’t account for things that vary within individual over time, like Laws

Instrumental Variables

  • Or, we can ignore those back doors altogether if we have an instrumental variable
  • If Z and X are related, and all open paths from Z to Y go through X, then Z can be an instrument for X
  • We isolate JUST the variation that comes from Z. No back doors in that variation! We have a causal effect
  • Can conceptually think of it as (or literally apply it to) an experiment where randomization doesn’t work perfectly

Instrumental Variables

Instrumental Variables

df <- tibble(W = rnorm(1000),Z=sample(c(0,1),1000,replace=T)) %>%
  mutate(X = rnorm(1000) + W + Z) %>%
  mutate(Y = rnorm(1000) + 3*X - 10*W)

cor(df$X,df$Y)
## [1] -0.286212
iv <- df %>% group_by(Z) %>%
  summarize(X = mean(X),Y=mean(Y))
(iv$Y[2]-iv$Y[1])/(iv$X[2]-iv$X[1])
## [1] 3.542106

Treated and Untreated Groups

  • In many cases, we want to know the effect of having something or not, D, and want to compare a treated group (D=1) to an untreated one (D=0)
  • In each case we are trying to find apples-to-apples comparisons
  • Controlling works for this, but there are many other methods
  • How can we make our two groups comparable?

Matching

  • Instead of controlling, we can construct our treatement and control groups using matching
  • In class we’ve used Coarsened Exact Matching
  • This closes back doors for whatever we matched on
  • Works, like controlling, if we can observe and measure all the variables necessary to block the back doors

Matching

  • What is the effect of gender on the proportion of your income spent on food?
  • We’ll match on everything else in the data
  • Use inner_join to match up treated (“male”) and untreated (“female”) observations

Matching

bf <- BudgetFood %>% select(wfood,size,town,sex) %>%
  mutate(size.c=cut(size,breaks=3)) %>%
  group_by(size.c,town,sex) %>%
  summarize(wfood = mean(wfood)) %>% ungroup()

bf.male <- filter(bf,sex=="man") %>% rename(wfood.m = wfood) %>% select(-sex)
bf.female <- filter(bf,sex=="woman") %>% rename(wfood.f = wfood) %>% select(-sex)

matched <- inner_join(bf.male,bf.female)
mean(matched$wfood.m)
## [1] 0.4240166
mean(matched$wfood.f)
## [1] 0.4300931

Difference-in-Difference

  • Difference-in-Difference applies when you have a group that you can observe both before and after the policy
  • You worry that time is a confounder, but you can’t control for it
  • Unless you add a control group that DIDN’T get the policy

Difference-in-Difference

Difference-in-Difference

  • Get the before-after difference for both groups
  • Then subtract out the difference for the control
diddata <- tibble(Group=c(rep("C",2500),rep("T",2500)),
                  Time=rep(c(rep("Before",1250),rep("After",1250)),2)) %>%
  mutate(Treated = (Group == "T") & Time == "After") %>%
  mutate(Y = 2*(Group == "T") + 1.5*(Time == "After") + 3*Treated + rnorm(5000))
did <- diddata %>% group_by(Group,Time) %>% summarize(Y = mean(Y))
before.after.control <- did$Y[1] - did$Y[2]
before.after.treated <- did$Y[3] - did$Y[4]
did.effect <- before.after.treated - before.after.control
did.effect
## [1] 3.007027

Difference-in-Difference

Regression Discontinuity

  • If we have a treatment D that is assigned based on a cutoff in a running variable, we can use regression discontinuity
  • Focus right around the cutoff and compare above-cutoff to below-cutoff
  • We’ve isolated a great set of treatment/control groups because in this area it’s basically random whether you’re above or below the cutoff

Regression Discontinuity

Regression Discontinuity

rdddata <- tibble(W=rnorm(10000)) %>%
  mutate(run = runif(10000)+.03*W) %>%
  mutate(treated = run >= .6) %>%
  mutate(Y = 2+.01*run+.5*treated+W+rnorm(10000))
bandwidth <- .02
rdd <- rdddata %>% filter(abs(run-.6)<=bandwidth) %>%
  mutate(above = run >= .6) %>%
  group_by(above) %>%
  summarize(Y = mean(Y))
rdd
## # A tibble: 2 x 2
##   above     Y
##   <lgl> <dbl>
## 1 FALSE  1.85
## 2 TRUE   2.54

Regression Discontinuity

  • Expressed well in graphs! Treatment should jump at cutoff. If not perfectly from 0% to 100%, use IV too

Regression Discontinuity

  • Variables other than Y and treatment shouldn’t jump at cutoff - they should be balanced

Regression Discontinuity

  • Technically we’re looking for how much this jumps at the cutoff

That’s it!

  • In a very condensed way, that’s the material we covered!
  • I recommend looking back over slides, notes, homeworks
  • Homeworks will be most similar to the questions on the midterm