Lecture 14 Causal Diagrams

Nick Huntington-Klein

February 25, 2019

Recap

  • Last time we talked about causality
  • The idea that if we could reach in and manipulate X, and as a result Y changes too, then X causes Y
  • We also talked about how we can identify causality in data
  • Part of that will necessarily require us to have a model

Models

  • We have to have a model to get at causality
  • A model is our way of understanding the world. It’s our idea of what we think the data-generating process is
  • Models can be informal or formal - “The sun rises every day because the earth spins” vs. super-complex astronomical models of the galaxy with thousands of equations
  • All models are wrong. Even quantum mechanics. But as long as models are right enough to be useful, we’re good to go!

Models

  • Once we do have a model, though, that model will tell us exactly how we can find a causal effect
  • (if it’s possible; it’s not always possible)
  • Sort of like how, last time, we knew how X was assigned, and using that information we were able to get a good estimate of the true treatment

Example

  • Let’s work through a familiar example from before, where we know the data generating process
# Is your company in tech? Let's say 30% of firms are
df <- tibble(tech = sample(c(0,1),500,replace=T,prob=c(.7,.3))) %>%
  #Tech firms on average spend $3mil more defending IP lawsuits
  mutate(IP.spend = 3*tech+runif(500,min=0,max=4)) %>%
  #Tech firms also have higher profits. But IP lawsuits lower profits
  mutate(log.profit = 2*tech - .3*IP.spend + rnorm(500,mean=2))
# Now let's check for how profit and IP.spend are correlated!
cor(df$log.profit,df$IP.spend)
## [1] 0.1609575
  • Uh-oh! Truth is negative relationship, but data says positive!!

Example

  • Now we can ask: what do we know about this situation?
  • How do we suspect the data was generated? (ignoring for a moment that we know already)
    • We know that being a tech company leads you to have to spend more money on IP lawsuits
    • We know that being a tech company leads you to have higher profits
    • We know that IP lawsuits lower your profits

Example

  • From this, we realize that part of what we get when we calculate cor(df$log.profit,df$IP.spend) is the influence of being a tech company
  • Meaning that if we remove that influence, what’s left over should be the actual, negative, effect of IP lawsuits
  • Now, we can get to this intuitively, but it would be much more useful if we had a more formal model that could tell us what to do in lots of situations

Causal Diagrams

  • Enter the causal diagram!
  • A causal diagram (aka a Directed Acyclic Graph) is a way of writing down your model that lets you figure out what you need to do to find your causal effect of interest
  • All you need to do to make a causal diagram is write down all the important features of the data generating process, and also write down what you think causes what!

Example

  • We know that being a tech company leads you to have to spend more money on IP lawsuits
  • We know that being a tech company leads you to have higher profits
  • We know that IP lawsuits lower your profits

Example

  • We know that being a tech company leads you to have to spend more money on IP lawsuits
  • We know that being a tech company leads you to have higher profits
  • We know that IP lawsuits lower your profits

Example

  • We know that being a tech company leads you to have to spend more money on IP lawsuits
  • We know that being a tech company leads you to have higher profits
  • We know that IP lawsuits lower your profits

Example

  • We know that being a tech company leads you to have to spend more money on IP lawsuits
  • We know that being a tech company leads you to have higher profits
  • We know that IP lawsuits lower your profits

Viola

  • We have encoded everything we know about this particular little world in our diagram
  • (well, not everything, the diagram doesn’t say whether we think these effects are positive or negative)
  • Not only can we see our assumptions, but we can see how they fit together
  • For example, if we were looking for the impact of tech on profit, we’d know that it happens directly, AND happens because tech affects IP.spend, which then affects profit.

Identification

  • And if we want to isolate the effect of IP.spend on profit, we can figure that out too
  • We call this process - isolating just the causal effect we’re interested in - “identification”
  • We’re identifying just one of those arrows, the one IP.spend -> profit, and seeing what the effect is on that arrow!

Identification

  • Based on this graph, we can see that part of the correlation between IP.Spend and profit can be explained by how tech links the two.

Identification

  • Since we can explain part of the correlation with tech, but we want to identify the part of the correlation that ISN’T explained by tech (the causal part), we will want to just use what isn’t explained by tech!
    • Use tech to explain profit, and take the residual
    • Use tech to explain IP.spend, and take the residual
    • The relationship between the first residual and the second residual is causal!

Controlling

  • This process is called “adjusting” or “controlling”. We are “controlling for tech” and taking out the part of the relationship that is explained by it
  • In doing so, we’re looking at the relationship between IP.spend and profit just comparing firms that have the same level of tech.
  • This is our “apples to apples” comparison that gives us an experiment-like result

Controlling

df <- df %>% group_by(tech) %>%
  mutate(log.profit.resid = log.profit - mean(log.profit),
         IP.spend.resid = IP.spend - mean(IP.spend)) %>% ungroup()
cor(df$log.profit.resid,df$IP.spend.resid)
## [1] -0.3018621
  • Negative! Hooray

Controlling

  • Imagine we’re looking at that relationship within color

LITERALLY

Recap

  • By controlling for tech (“holding it constant”) we got rid of the part of the IP.spend/profit relationship that was explained by tech, and so managed to identify the IP.spend -> profit arrow, the causal effect we’re interested in!
  • We correctly found that it was negative
  • Remember, we made it truly negative when we created the data, all those slides ago

Causal Diagrams

  • And it was the diagram that told us to control for tech
  • It’s going to turn out that diagrams can tell us how to identify things in much more complex circumstances - we’ll get to that soon
  • But you might have noticed that it was pretty obvious what to do just by looking at the graph

Causal Diagrams

  • Can’t we just look at the data to see what we need to control for?
  • After all, that would free us from having to make all those assumptions and figure out our model
  • No!!!
  • Why? Because for a given set of data that we see, there are many different data generating processes that could have made it
  • Each requiring different kinds of adjustments in order to get it right

Causal Diagrams

  • We observe that profit (y), IP.spend (x), and tech (z) are all related… which is it?

Causal Diagrams

  • With only the data to work with we have literally no way of knowing which of those is true
  • Maybe IP.spend causes companies to be tech companies (in 2, 3, 6)
  • We know that’s silly because we have an idea of what the model is
  • But that’s what lets us know it’s wrong - the model. With just the data we have no clue.

Causal Diagrams

  • Next time we’ll set about actually making one of these diagrams
  • And soon we’ll be looking for what the diagrams tell us about how to identify an effect!

Practice

  • Load in data(swiss) and use help to look at it
  • Get the correlation between Fertility and Education
  • Think about what direction the arrows might go on a diagram with Fertility, Education, and Agriculture
  • Get the corrlelation between Fertility and Education controlling for Agriculture (use cut with breaks=3)

Practice Answers

data(swiss)
help(swiss)
cor(swiss$Fertility,swiss$Education)

swiss <- swiss %>%
  group_by(cut(Agriculture,breaks=3)) %>%
  mutate(Fert.resid = Fertility - mean(Fertility),
         Ed.resid = Education - mean(Education))

cor(swiss$Fert.resid,swiss$Ed.resid)
## [1] -0.6637889
## [1] -0.5560316