Identifying X -> Y
by closing back doors:
X
to Y
on the diagramX ->
) and which are “back doors” (start with X <-
)X -> C <- Y
)BCR
, which prevent housing from being built, on Rent
Sup
ply of housing built, characteristics of the loc
ation that lead to BCR
s being passed, Dem
and for housing in the area, the overall economy…One answer, with non-BCR Laws
, Labor
market, econ
omy:
Open front doors:
BCR -> Sup -> Rent
Open back doors:
BCR <- U1 -> Laws -> Rent
BCR <- loc -> Laws -> Rent
BCR <- loc -> Dem -> Rent
BCR <- loc -> Sup -> Rent
Which others paths are there, closed by colliders?
loc
that shuts it back downloc
and Laws
W
by seeing what W
explains (sometimes using cut()
) and taking it outlibrary(Ecdat)
data(BudgetFood)
cor(BudgetFood$wfood,BudgetFood$totexp)
## [1] -0.5125209
BudgetFood <- BudgetFood %>% group_by(cut(age,breaks=5)) %>%
mutate(wfood.r = wfood - mean(wfood),totexp.r = totexp-mean(totexp))
cor(BudgetFood$wfood.r,BudgetFood$totexp.r)
## [1] -0.4852561
loc
Laws
Z
and X
are related, and all open paths from Z
to Y
go through X
, then Z
can be an instrument for X
Z
. No back doors in that variation! We have a causal effectdf <- tibble(W = rnorm(1000),Z=sample(c(0,1),1000,replace=T)) %>%
mutate(X = rnorm(1000) + W + Z) %>%
mutate(Y = rnorm(1000) + 3*X - 10*W)
cor(df$X,df$Y)
## [1] -0.286212
iv <- df %>% group_by(Z) %>%
summarize(X = mean(X),Y=mean(Y))
(iv$Y[2]-iv$Y[1])/(iv$X[2]-iv$X[1])
## [1] 3.542106
D
, and want to compare a treated group (D=1
) to an untreated one (D=0
)inner_join
to match up treated (“male”) and untreated (“female”) observationsbf <- BudgetFood %>% select(wfood,size,town,sex) %>%
mutate(size.c=cut(size,breaks=3)) %>%
group_by(size.c,town,sex) %>%
summarize(wfood = mean(wfood)) %>% ungroup()
bf.male <- filter(bf,sex=="man") %>% rename(wfood.m = wfood) %>% select(-sex)
bf.female <- filter(bf,sex=="woman") %>% rename(wfood.f = wfood) %>% select(-sex)
matched <- inner_join(bf.male,bf.female)
mean(matched$wfood.m)
## [1] 0.4240166
mean(matched$wfood.f)
## [1] 0.4300931
time
is a confounder, but you can’t control for itdiddata <- tibble(Group=c(rep("C",2500),rep("T",2500)),
Time=rep(c(rep("Before",1250),rep("After",1250)),2)) %>%
mutate(Treated = (Group == "T") & Time == "After") %>%
mutate(Y = 2*(Group == "T") + 1.5*(Time == "After") + 3*Treated + rnorm(5000))
did <- diddata %>% group_by(Group,Time) %>% summarize(Y = mean(Y))
before.after.control <- did$Y[1] - did$Y[2]
before.after.treated <- did$Y[3] - did$Y[4]
did.effect <- before.after.treated - before.after.control
did.effect
## [1] 3.007027
D
that is assigned based on a cutoff in a running variable, we can use regression discontinuityrdddata <- tibble(W=rnorm(10000)) %>%
mutate(run = runif(10000)+.03*W) %>%
mutate(treated = run >= .6) %>%
mutate(Y = 2+.01*run+.5*treated+W+rnorm(10000))
bandwidth <- .02
rdd <- rdddata %>% filter(abs(run-.6)<=bandwidth) %>%
mutate(above = run >= .6) %>%
group_by(above) %>%
summarize(Y = mean(Y))
rdd
## # A tibble: 2 x 2
## above Y
## <lgl> <dbl>
## 1 FALSE 1.85
## 2 TRUE 2.54
Y
and treatment shouldn’t jump at cutoff - they should be balanced