Identifying X -> Y by closing back doors:
X to Y on the diagramX ->) and which are “back doors” (start with X <-)X -> C <- Y)BCR, which prevent housing from being built, on RentSupply of housing built, characteristics of the location that lead to BCRs being passed, Demand for housing in the area, the overall economy…One answer, with non-BCR Laws, Labor market, economy:
Open front doors:
BCR -> Sup -> RentOpen back doors:
BCR <- U1 -> Laws -> RentBCR <- loc -> Laws -> RentBCR <- loc -> Dem -> RentBCR <- loc -> Sup -> RentWhich others paths are there, closed by colliders?
loc that shuts it back downloc and LawsW by seeing what W explains (sometimes using cut()) and taking it outlibrary(Ecdat)
data(BudgetFood)
cor(BudgetFood$wfood,BudgetFood$totexp)## [1] -0.5125209
BudgetFood <- BudgetFood %>% group_by(cut(age,breaks=5)) %>%
mutate(wfood.r = wfood - mean(wfood),totexp.r = totexp-mean(totexp))
cor(BudgetFood$wfood.r,BudgetFood$totexp.r)## [1] -0.4852561
locLawsZ and X are related, and all open paths from Z to Y go through X, then Z can be an instrument for XZ. No back doors in that variation! We have a causal effectdf <- tibble(W = rnorm(1000),Z=sample(c(0,1),1000,replace=T)) %>%
mutate(X = rnorm(1000) + W + Z) %>%
mutate(Y = rnorm(1000) + 3*X - 10*W)
cor(df$X,df$Y)## [1] -0.286212
iv <- df %>% group_by(Z) %>%
summarize(X = mean(X),Y=mean(Y))
(iv$Y[2]-iv$Y[1])/(iv$X[2]-iv$X[1])## [1] 3.542106
D, and want to compare a treated group (D=1) to an untreated one (D=0)inner_join to match up treated (“male”) and untreated (“female”) observationsbf <- BudgetFood %>% select(wfood,size,town,sex) %>%
mutate(size.c=cut(size,breaks=3)) %>%
group_by(size.c,town,sex) %>%
summarize(wfood = mean(wfood)) %>% ungroup()
bf.male <- filter(bf,sex=="man") %>% rename(wfood.m = wfood) %>% select(-sex)
bf.female <- filter(bf,sex=="woman") %>% rename(wfood.f = wfood) %>% select(-sex)
matched <- inner_join(bf.male,bf.female)
mean(matched$wfood.m)## [1] 0.4240166
mean(matched$wfood.f)## [1] 0.4300931
time is a confounder, but you can’t control for itdiddata <- tibble(Group=c(rep("C",2500),rep("T",2500)),
Time=rep(c(rep("Before",1250),rep("After",1250)),2)) %>%
mutate(Treated = (Group == "T") & Time == "After") %>%
mutate(Y = 2*(Group == "T") + 1.5*(Time == "After") + 3*Treated + rnorm(5000))
did <- diddata %>% group_by(Group,Time) %>% summarize(Y = mean(Y))
before.after.control <- did$Y[1] - did$Y[2]
before.after.treated <- did$Y[3] - did$Y[4]
did.effect <- before.after.treated - before.after.control
did.effect## [1] 3.007027
D that is assigned based on a cutoff in a running variable, we can use regression discontinuityrdddata <- tibble(W=rnorm(10000)) %>%
mutate(run = runif(10000)+.03*W) %>%
mutate(treated = run >= .6) %>%
mutate(Y = 2+.01*run+.5*treated+W+rnorm(10000))
bandwidth <- .02
rdd <- rdddata %>% filter(abs(run-.6)<=bandwidth) %>%
mutate(above = run >= .6) %>%
group_by(above) %>%
summarize(Y = mean(Y))
rdd## # A tibble: 2 x 2
## above Y
## <lgl> <dbl>
## 1 FALSE 1.85
## 2 TRUE 2.54
Y and treatment shouldn’t jump at cutoff - they should be balanced