Lecture 18 Treatment Effects

Nick Huntington-Klein

2021-02-13

Recap

We’ve gone over all sorts of ways to estimate a causal effect
And how to tell when one is identified
But… uh… what did we just estimate exactly?
What even is the causal effect?

Treatment Effects

For any given treatment, there are likely to be many treatment effects
Different individuals will respond to different degrees (or even directions!)
This is called heterogeneous treatment effects

Treatment Effects

When we identify a treatment effect, what we’re estimating is some mixture of all those individual treatment effects
But what kind of mixture? Is it an average of all of them? An average of some of them? A weighted average? Not an average at all?
What we get depends on the research design itself as well as the estimator we use to perform that design

Individual Treatment Effects

While we can’t always estimate it directly, the true regression model becomes something like

\[ Y = \beta_0 + \beta_iX + \varepsilon \]

\(\beta_i\) follows its own distribution across individuals
(and remember, this is theoretical - we’d still have those individual \(\beta_i\)s even with one observation per individual and no way to estimate them separately)

Summarizing Effects

There are methods that try to give us the whole distribution of effects (and we’ll talk about some of them next time)
But often we only get a single effect, \(\hat{\beta}_1\).
This \(\hat{\beta}_1\) is some summary statistic of the \(\beta_i\) distribution. But what summary statistic?

Summarizing Effects

Average treatment effect: the mean of \(\beta_i\)
Conditional average treatment effect (CATE): the mean of \(\beta_i\) conditional on some value (say, “just for men”, i.e. conditional on being a man)
Weighted average treatment effect (WTE): the weighted mean of \(\beta_i\), with weights \(w_i\)

The latter two come in many flavors

Common Conditional Average Treatment Effects

The ATE among some demographic group
The ATE among some specific group (conditional average treatment effect)
The ATE just among people who were actually treated (ATT)
The ATE just among people who were NOT actually treated (ATUT)

Comon Weighted Average Treatment Effects

The ATE weighted by how responsive you are to an instrument/treatment assignment (local average treatment effect)
The ATE weighted by how much variation in treatment you have after all back doors are closed (variance-weighted)
The ATE weighted by how commonly-represented your mix of control variables is (distribution-weighted)

Are They Good?

Which average you’d want depends on what you’d want to do with it
Want to know how effective a treatment was when it was applied? Average Treatment on Treated
Want to know how effective a treatment would be if applied to everyone/at random? Average Treatment Effect
Want to know how effective a treatment would be if applied just a little more broadly? Marginal Treatment Effect (literally, the effect for the next person who would be treated), or, sometimes, Local Average Treatment Effect

Are They Good?

Different treatment effect averages aren’t wrong but we need to pay attention to which one we’re getting, or else we may apply the result incorrectly
We don’t want that!
A result could end up representing a different group than you’re really interested in
There are technical ways of figuring out what average you get, and also intuitive ways

Heterogeneous Effects in Action

Let’s simulate some data and see what different methods give us.
We’ll start with some basic data where the effect is already identified
And see what we get!

Heterogeneous Effects in Action

The effect varies according to a normal distribution, which has mean 5 for group A and mean 7 for group B (mean = 6 overall)
No back doors, this is basically random assignment / an experimental setting

tb <- tibble(group = sample(c('A','B'), 5000, replace = TRUE),
             W = rnorm(5000, mean = 0, sd = sqrt(8))) %>%
  mutate(beta1 = case_when(
    group == 'A' ~ rnorm(5000, mean = 5, sd = 2),
    group == 'B' ~ rnorm(5000, mean = 7, sd = 2))) %>%
  mutate(X = rnorm(5000)) %>%
  mutate(Y = beta1*X + rnorm(5000))

Heterogeneous Effects in Action

We’re already identified, no adjustment necessary, so let’s just regress \(Y\) on \(X\)

	Model 1
(Intercept)	0.025
	(0.034)
X	5.975***
	(0.035)
* p < 0.1, p < 0.05, * p < 0.01

We get 5.975, pretty close to the true average treatment effect of 6!
(note the standard error is nothing like the standard deviation of the treatment effect - those are measuring two very different things)

Variance Weighting

The more the treatment moves around, the easier it is to see whether it’s doing anything
So treatment effects from individuals/groups with more variance in treatment get weighted more heavily
Importantly, this is variance in treatment after controls are applied
Variance weighting pops up with most research designs that rely on controlling for stuff via regression

Variance Weighting

The effect varies according to a normal distribution, with mean 5 for group A and mean 7 for group B (mean 6 overall)
Treatment \(X\) has standard deviation \(3\) in group A and \(5\) in group B. But if not for \(W\), then the sd in group A would only be \(1\).

tb <- tibble(group = sample(c('A','B'), 5000, replace = TRUE),
             W = rnorm(5000, mean = 0, sd = sqrt(8))) %>%
  mutate(beta1 = case_when(
    group == 'A' ~ rnorm(5000, mean = 5, sd = 2),
    group == 'B' ~ rnorm(5000, mean = 7, sd = 2))) %>%
  mutate(X = case_when(
    group == 'A' ~ W + rnorm(5000, mean = 0, sd = 1), # SD = sqrt(sqrt(8)^2 + 1^2) = 3
    group == 'B' ~ rnorm(5000, mean = 0, sd = 5))) %>%
  mutate(Y = beta1*X + rnorm(5000))

Heterogeneous Effects in Action

We are already identified, so let’s see what we get from a basic linear regression

	Model 1	Model 2
(Intercept)	0.160	0.114
	(0.129)	(0.131)
X	6.429***	6.635***
	(0.031)	(0.032)
W		-0.843***
		(0.047)
X × W		0.006
		(0.009)
* p < 0.1, p < 0.05, * p < 0.01

So What is That?

Where’s 6.429 come from? Hmm… \((3^2\times5 + 5^2\times7)/(3^2+5^2) = 6.471\)…
And 6.635? \((1\times5 + 5^2\times7)(1+5^2) = 6.66\)…
(also, why do I need an interaction between \(X\) and \(W\) to get these exact numbers?)

Design Isn’t Destiny

The specific average treatment effect you get depends on the estimator, it’s suggested by the design but it’s not inherent
For example, what if we just do weighted least squares, weighting by inverse treatment variance?

tb <- tb %>%
  group_by(group) %>%
  mutate(Xvar = var(X),
         Xcontrolvar = var(resid(lm(X~W))))
m3 <- lm(Y~X, data = tb, weights = 1/Xvar)
m4 <- lm(Y~X*W, data = tb, weights = 1/Xcontrolvar)

Design Isn’t Destiny

The 6 returns!

	Model 1	Model 2
(Intercept)	0.125	0.052
	(0.116)	(0.109)
X	5.966***	5.824***
	(0.032)	(0.062)
W		-0.825***
		(0.068)
X × W		0.005
		(0.008)
Num.Obs.	5000	5000
R2	0.875	0.862
R2 Adj.	0.875	0.862
AIC	35852.7	37654.9
BIC	35872.2	37687.4
Log.Lik.	-17923.333	-18822.427
F	35046.844	10409.194
* p < 0.1, p < 0.05, * p < 0.01

What We Get

Let’s go through our standard methods and think about the treatment effects they give us
First one’s easy: fixed effects gives us an effect weighted by treatment variance within-individual
We can get back to an ATE by weighting by inverse treatment variance
NEXT

Difference-in-Differences

Difference-in-differences separates treated and untreated groups, and basically ensures that no treatments occur in the untreated group ever
The only treatment effects we can possibly see in the estimate come from the treated groups
We have Average Treatment on the Treated

Difference-in-Differences

library(fixest)
tb <- tibble(group = sample(c('Treated','Untreated'),1000, replace = TRUE),
             time = sample(1:20, 1000, replace = TRUE)) %>%
  mutate(beta1 = case_when(
    group == 'Treated' ~ 5,
    group == 'Untreated' ~ 7
  )) %>%
  mutate(Treatment = (group=='Treated')*(time>10)) %>%
  mutate(Y = 3 + time + 3*(group == 'Treated') + beta1*Treatment + rnorm(1000))
m <- feols(Y ~ Treatment | group + time, data = tb)

Difference-in-Differences

	Model 1
Treatment	4.979***
	(0.005)
Num.Obs.	1000
R2	0.983
Std. errors	Clustered (group)
* p < 0.1, p < 0.05, * p < 0.01

Difference-in-Differences

Don’t forget the importance of the estimator!
The whole reason that two-way fixed effects for varying treatment timing doesn’t work is that it gives a weird average
Where some of the effects get negative weights!

Case Methods

Skipping ahead, synthetic control and event studies also give ATT for basically the same reason
If there’s no treatment at all among a group, it’s pretty hard to include their effect at all!
Although we might more accurately say that these methods just give us the treatment effect for the single treated group, rather than any kind of “average”
Although they do (often) average over what the effect is in the post-treatment periods

Regression Discontinuity

Regression discontinuity is a design where we isolate variation driven by the jump over a cutoff
So the variation in treatment we’re allowing is only about that jump
We get the local average treatment effect - our effect is only representative of people near the cutoff who are pushed to get treatment by the cutoff
This is true for both sharp and fuzzy designs, in the case of fuzzy it depends how much the cutoff increased your chances of treatment

Regression Discontinuity

tb <- tibble(Run = runif(1000)) %>%
  mutate(beta1 = case_when(
    abs(Run-.5) < .2 ~ 1,
    abs(Run-.5) >= .2 ~ 5
  )) %>%
  mutate(Y = Run + beta1*(Run>.5) + rnorm(1000)) 

m <- rdrobust(tb$Y, tb$Run, c = .5)

Regression Discontinuity

summary(m)

## Call: rdrobust
## 
## Number of Obs.                 1000
## BW type                       mserd
## Kernel                   Triangular
## VCE method                       NN
## 
## Number of Obs.                 488         512
## Eff. Number of Obs.             91         117
## Order est. (p)                   1           1
## Order bias  (q)                  2           2
## BW est. (h)                  0.108       0.108
## BW bias (b)                  0.164       0.164
## rho (h/b)                    0.657       0.657
## Unique Obs.                    488         512
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional     0.894     0.293     3.047     0.002     [0.319 , 1.469]     
##         Robust         -         -     2.507     0.012     [0.193 , 1.576]     
## =============================================================================

Instrumental Variables

Instrumental variables is all about isolating the variation in treatment that is driven by an exogenous source
So… by the same logic as RDD we also get a local average treatment effect!

Instrumental Variables

tb <- tibble(Z = rnorm(1000), W = rnorm(1000),
             group = sample(c('A','B','C'), 1000, replace = TRUE)) %>%
  mutate(gamma1 = case_when(
    group == 'A' ~ 0,
    group == 'B' ~ 1,
    group == 'C' ~ 3
  )) %>%
  mutate(X = gamma1*Z + W + rnorm(1000)) %>%
  mutate(beta1 = case_when(
    group == 'A' ~ 10,
    group == 'B' ~ 5,
    group == 'C' ~ 1
  )) %>%
  mutate(Y = beta1*X + W + rnorm(1000))
m <- feols(Y ~ 1 | X ~ Z, data = tb)

Instrumental Variables

\((0\times10 + 1\times 5 + 3\times1)/(0+1+3) = 2\)

	Model 1
(Intercept)	0.194
	(0.244)
fit_X	2.010***
	(0.176)
Num.Obs.	1000
R2	0.411
Std. errors	Standard
* p < 0.1, p < 0.05, * p < 0.01

Instrumental Variables

People unaffected by the instrument don’t have their treatment effect counted at all, even if they’re treated!
This is why we need to assume monotonicity - if all the effects of \(Z\) on \(X\) are in the same direction (or are \(0\)) then the weights are positive (if the effects are negative, the negatives all cancel out)
But if some go in the other direction, we have negative weights and no longer have a meaningful average

Next Time

This is the end of material that may end up on the exam
Next time we’ll do some review
And also talk about some cool stuff that we won’t be testing on but you may want to explore - methods that let you estimate a whole distribution of treatment effects!