Lecture 18 Closing Back Doors: Controlling

Nick Huntington-Klein

March 6, 2019

Recap

We discussed how to draw a causal diagram
How to identify the front and back door paths
And how we can close those back door paths by controlling/adjusting in order to identify the front-door paths we want!
And so we get our causal effect

Today

Today we’re going to be going a little deeper into what it means to actually control/adjust for things
And we’re also going to talk about times when controlling/adjusting makes things WORSE - collider bias!
I’m going to just start saying “controlling”, by the way - “adjusting” is a little more accurate, but “controlling” is more common

Controlling

Up to now, here’s how we’ve been getting the relationship between X and Y while controlling for W:

See what part of X is explained by W, and subtract it out. Call the result the residual part of X.
See what part of Y is explained by W, and subtract it out. Call the result the residual part of Y.
Get the relationship between the residual part of X and the residual part of Y.

With the last step including things like getting the correlation, plotting the relationship, calculating the variance explained, or comparing mean Y across values of X

In code

df <- tibble(w = rnorm(100)) %>%
  mutate(x = 2*w + rnorm(100)) %>%
  mutate(y = 1*x + 4*w + rnorm(100))
cor(df$x,df$y)

## [1] 0.9479742

df <- df %>% group_by(cut(w,breaks=5)) %>%
  mutate(x.resid = x - mean(x),
         y.resid = y - mean(y))
cor(df$x.resid,df$y.resid)

## [1] 0.7367752

In Diagrams

The relationship between X and Y reflects both X->Y and X<-W->Y
We remove the part of X and Y that W explains to get rid of X<-W and W->Y, blocking X<-W->Y and leaving X->Y

More than One Variable

It’s quite possible to control for more than one variable at a time
Although we won’t be doing it much in this class
A common way to do this is called multiple regression
You can do it with our method too, but it gets tedious pretty quickly

More than One Variable

df <- tibble(w = rnorm(100),v=rnorm(100)) %>%
  mutate(x = 2*w + 3*v + rnorm(100)) %>%
  mutate(y = 1*x + 4*w + 1.5*v + rnorm(100))
cor(df$x,df$y)

## [1] 0.9340934

df <- df %>% group_by(cut(w,breaks=5)) %>%
  mutate(x.resid = x - mean(x),
         y.resid = y - mean(y)) %>%
  group_by(cut(v,breaks=5)) %>%
  mutate(x.resid2 = x.resid - mean(x.resid),
         y.resid2 = y.resid - mean(y.resid))
cor(df$x.resid2,df$y.resid2)

## [1] 0.7419072

Graphically

Intuitively

So what does this actually mean? Why do we do it this way?
As mentioned before, the goal here is to remove X<-W and W->Y so as to close the back door
But the way we actually do this is by removing differences that are predicted by W
In other words, we are are comparing people as though they had the same value of W

Intuitively

That’s why you hear some people refer to controlling as “holding W constant” - we literally remove the variation in W, leaving it “constant”
Another way of thinking of it is that you’re looking for variation of X and Y within values of W - this is made clear in the animation
Comparing apples to apples

Intuitively

Thinking about it this way also makes it clear that there are other ways to control for things besides the method we’ve outlined
Anything that ensures that we’re looking at observations with the same (or at least very very similar) values of W is in effect controlling for W
A common way this happens is by selecting a sample

An Example

We’ll borrow an example from the Wooldridge econometrics textbook (data available in the wooldridge package)
LaLonde (1986) is a study of whether a job training program improves earnings in 1978 (re78)
Specifically, it has data on an experiment of assigning people to a job training program (data jtrain2)
And also data on people who chose to participate in that program, or didn’t (data jtrain3)
The goal of causal inference - do something to jtrain3 so it gives us the “correct” result from jtrain2

LaLonde

library(wooldridge)
#EXPERIMENT
data(jtrain2)
jtrain2 %>% group_by(train) %>% summarize(wage = mean(re78))

## # A tibble: 2 x 2
##   train  wage
##   <int> <dbl>
## 1     0  4.55
## 2     1  6.35

#BY CHOICE
data(jtrain3)
jtrain3 %>% group_by(train) %>% summarize(wage = mean(re78))

## # A tibble: 2 x 2
##   train  wage
##   <int> <dbl>
## 1     0 21.6 
## 2     1  6.35

Hmm…

What back doors might the jtrain3 analysis be facing?
People who need training want to get it but are likely to get lower wages anyway!

Apples to Apples

The two data sets are looking at very different groups of people!

library(stargazer)
stargazer(select(jtrain2,re75,re78),type='text')
stargazer(select(jtrain3,re75,re78),type='text')

## 
## ===========================================================
## Statistic  N  Mean  St. Dev.  Min  Pctl(25) Pctl(75)  Max  
## -----------------------------------------------------------
## re75      445 1.377  3.151     0      0       1.2      25  
## re78      445 5.301  6.631   0.000  0.000    8.125   60.308
## -----------------------------------------------------------

## 
## ===============================================================
## Statistic   N    Mean  St. Dev.  Min  Pctl(25) Pctl(75)   Max  
## ---------------------------------------------------------------
## re75      2,675 17.851  13.878    0     7.6      25.6     157  
## re78      2,675 20.502  15.633  0.000  9.243    28.816  121.174
## ---------------------------------------------------------------

Controlling

We can’t measure “needs training” directly, but we can sort of control for it by limiting ourselves solely to the kind of people who need it - those who had low wages in 1975!

## # A tibble: 2 x 2
##   train  wage
##   <int> <dbl>
## 1     0  4.55
## 2     1  6.35

## # A tibble: 2 x 2
##   train  wage
##   <int> <dbl>
## 1     0  5.62
## 2     1  6.00

Controlling

Not exactly the same (not surprising - we were pretty arbitrary in how we controlled for need.tr, and we never closed train <- U -> wage, oh and we left out plenty of other back doors: race, age, etc.) but an improvement
This goes to show that choosing a sample is a form of controlling
ANYTHING that ensures you’re looking at observations with similar values of W is a form of controlling for W

Bad Controls

So far so good - we have the concept of what it means to control and some ways we can do it, so we can get apples-to-apples comparisons
But what should we control for?
Everything, right? We want to make sure our comparison is as apple-y as possible!
Well, no, not actually

Bad Controls

Some controls can take you away from showing you the front door
We already discussed how it’s not a good idea to block a front-door path.
An increase in the price of cigarettes might improve your health, but not if we control for the number of cigarettes you smoke!

Bad Controls

There is another kind of bad control - a collider
Basically, if you’re listing out paths, and you see a path where the arrows collide by both pointing at the same variable, that path is already blocked
Like this: X <- W -> C <- Z -> Y
Note the -> C <-. Those arrow are colliding!
If we control for the collider C, that path opens back up!

Colliders

One kind of diagram (of many) where this might pop up:

Colliders

How could this be?
Because even if two variables cause the same thing (a -> m, b -> m), that doesn’t make them related. Your parents both caused your genetic makeup, that doesn’t make their genetics related. Knowing dad’s eye color tells you nothing about mom’s.
But within given values of the collider, they ARE related. If you’re brown-eyed, then observing that your dad has blue eyes tells us that your mom is brown-eyed

Colliders

So here, x <- a -> m <- b -> y is pre-blocked, no problem. a and b are unrelated, so no back door issue!
Control for m and now a and b are related, back door path open.

Example

You want to know if programming skills reduce your social skills
So you go to a tech company and test all their employees on programming and social skills
Let’s imagine that the truth is that programming skills and social skills are unrelated
But you find a negative relationship! What gives?

Example

Oops! By surveying only the tech company, you controlled for “works in a tech company”
To do that, you need programming skills, social skills, or both! It’s a collider!

Example

set.seed(14233)
survey <- tibble(prog=rnorm(1000),social=rnorm(1000)) %>%
  mutate(hired = (prog + social > .25))
#Truth
cor(survey$prog,survey$social)

## [1] 0.03710333

#Controlling by just surveying those hired
cor(filter(survey,hired==1)$prog,filter(survey,hired==1)$social)

## [1] -0.4789209

#Surveying everyone and controlling with our normal method
survey <- survey %>% group_by(hired) %>%  mutate(p.resid = prog - mean(prog),
         s.resid = social - mean(social)) %>% ungroup()
cor(survey$p.resid,survey$s.resid)

## [1] -0.4268598

Graphically

Colliders

This doesn’t just create correlations from nothing, it can also distort causal effects that ARE there
For example, did you know that height is UNrelated to basketball skill… among NBA players?

Colliders

Sometimes, things can get real tricky
In some cases, the same variable NEEDS to be controlled for to close a back door path, but it’s a collider on ANOTHER back door path!
In those cases you just can’t identify the effect, at least not easily
This pops up in estimates of the gender wage gap - example from Cunningham’s Mixtape: should you control for occupation when looking at gender discrimination in the labor market?

Colliders in the Gender Wage Gap

We are interested in gender -> discrim -> wage; our treatment is gender -> discrim, the discrimination caused by your gender

Colliders in the Gender Wage Gap

Front doors/Open back doors/Closed back doors
gender -> discrim -> wage
gender -> discrim -> occup -> wage
discrim <- gender -> occup -> wage
discrim <- gender -> occup <- abil -> wage
gender -> discrim -> occup <- abil -> wage

Colliders in the Gender Wage Gap

No occup control? Ignore nondiscriminatory reasons to choose different occupations by gender
Control for occup? Open both back doors, create a correlation between abil and discrim where there wasn’t one
And also close a FRONT door, gender -> discrim -> occup -> wage: discriminatory reasons for gender diffs in occup
We actually can’t identify the effect we want in this diagram by controlling. It happens!
Suggests this question goes beyond just controlling for stuff. Real research on this topic gets clever.

Next Time

Get ready! Next time we’ll begin our trek down the list of common causal inference methods as they actually get used!
Many of them apply controlling for stuff in interesting ways
Others use methods other than controlling!
This is what economists and many data scientists actually do with their time
We will begin with “fixed effects”