- Last time we talked about causality
- The idea that if we could reach in and manipulate
`X`

, and as a result`Y`

changes too, then`X`

*causes*`Y`

- We also talked about how we can identify causality in data
- Part of that will necessarily require us to have a model

- We
*have to have a model*to get at causality - A model is our way of
*understanding the world*. It’s our idea of what we think the data-generating process is - Models can be informal or formal - “The sun rises every day because the earth spins” vs. super-complex astronomical models of the galaxy with thousands of equations
- All models are wrong. Even quantum mechanics. But as long as models are right enough to be useful, we’re good to go!

- Once we
*do*have a model, though, that model will tell us*exactly*how we can find a causal effect - (if it’s possible; it’s not always possible)
- Sort of like how, last time, we knew how
`X`

was assigned, and using that information we were able to get a good estimate of the true treatment

- Let’s work through a familiar example from before, where we know the data generating process

```
# Is your company in tech? Let's say 30% of firms are
df <- tibble(tech = sample(c(0,1),500,replace=T,prob=c(.7,.3))) %>%
#Tech firms on average spend $3mil more defending IP lawsuits
mutate(IP.spend = 3*tech+runif(500,min=0,max=4)) %>%
#Tech firms also have higher profits. But IP lawsuits lower profits
mutate(log.profit = 2*tech - .3*IP.spend + rnorm(500,mean=2))
# Now let's check for how profit and IP.spend are correlated!
cor(df$log.profit,df$IP.spend)
```

`## [1] 0.1609575`

- Uh-oh! Truth is negative relationship, but data says positive!!

- Now we can ask:
*what do we know*about this situation? - How do we suspect the data was generated? (ignoring for a moment that we know already)
- We know that being a tech company leads you to have to spend more money on IP lawsuits
- We know that being a tech company leads you to have higher profits
- We know that IP lawsuits lower your profits

- From this, we realize that part of what we get when we calculate
`cor(df$log.profit,df$IP.spend)`

is the influence of being a tech company - Meaning that if we remove that influence, what’s left over should be the actual, negative, effect of IP lawsuits
- Now, we can get to this intuitively, but it would be much more useful if we had a more formal model that could tell us what to do in
*lots*of situations

- Enter the causal diagram!
- A causal diagram (aka a Directed Acyclic Graph) is a way of writing down your
*model*that lets you figure out what you need to do to find your causal effect of interest - All you need to do to make a causal diagram is write down all the important features of the data generating process, and also write down what you think causes what!

- We know that being a tech company leads you to have to spend more money on IP lawsuits
- We know that being a tech company leads you to have higher profits
- We know that IP lawsuits lower your profits

- We know that being a tech company leads you to have to spend more money on IP lawsuits
- We know that being a tech company leads you to have higher profits
- We know that IP lawsuits lower your profits

- We know that being a tech company leads you to have to spend more money on IP lawsuits
- We know that being a tech company leads you to have higher profits
- We know that IP lawsuits lower your profits