Lecture 22 Regression Discontinuity

Nick Huntington-Klein

March 17, 2019

Recap

  • We’ve been going over ways in which we can use control groups to isolate causal effects
  • We can select similar control groups using matching or controlling (what economists call “selection on observables”)
  • We can use a treated group at a different time as its own control group with fixed effects
  • When a treatment is applied at a particular time, we can select a reasonable control to account for the effects of time using difference-in-difference (a “natural experiment”)

Today

  • We’re going to go over one other way in which we can find and isolate a very convincing control group
  • Like DID, it’s also an example of a natural experiment
  • Regression discontinuity

Regression Discontinuity

  • For regression discontinuity to work, we need the Treatment to be assigned based on a cutoff of what’s called a “running variable”
  • For example, imagine we want to know the effects of being in a Gifted and Talented (GATE) program on your adult earnings
  • Being admitted to the program is based on your test score (running variable)
  • If you score above 75, you’re in the program. 75 or below, you’re out!

Regression Discontinuity

  • Notice that the y-axis here is In GATE, not the outcome

Regression Discontinuity

  • Here’s how it look when we look at the actual outcome

Regression Discontinuity

  • Now, we have a bit of a problem!
  • If we look at the relationship between treatment and going to college, we’ll be picking up the fact that higher test scores make you more likely to go to college anyway

Regression Discontinuity

  • Except, that’s not actually what the diagram looks like! Test only affects GATE to the extent that it makes you be above the 90 cutoff!

Regression Discontinuity

  • What can we do with that information?
  • Well, imagine that we looked at the area just around the cutoff
  • Say, the cutoff is 75, so we look at 73 to 77
  • Within that group, it’s basically random whether you fall on one side of the line or another

Regression Discontinuity

  • Someone with a 75 is, on average, almost exactly the same as someone with a 76, except that one got the treatment and the other didn’t!
  • Heck, that tiny test score difference could be due to just having a bad day before the test
  • So we have two groups - the just-barely-missed-outs and the just-barely-made-its, that are basically exactly the same except that one happened to get treatment
  • A perfect description of what we’re looking for in a control group!

Regression Discontinuity

  • So we look directly around the cutoff, and compare just below to just above.
  • This is our way of controlling for test score and closing the GATE <- Above <- Test -> earn back door
  • Why not just control for Test in the normal way?
  • Because if we really think that, right around the cutoff, it’s random whether you’re on one side or the other, we don’t just close the Test back door, we have effectively random assignment, like an experiment!
  • We’re not just closing the Test back door, we’re closing all back doors

In Practice

rdd.data <- tibble(test = runif(1000)*100) %>%
  mutate(GATE = test >= 75) %>% mutate(earn = runif(1000)*40+10*GATE+test/2)
#Choose a "bandwidth" of how wide around the cutoff to look (arbitrary in our example)
#Bandwidth of 2 with a cutoff of 75 means we look from 75-2 to 75+2
bandwidth <- 2
#Just look within the bandwidth
rdd <- rdd.data %>% filter(abs(75-test) < bandwidth) %>%
  #Create a variable indicating we're above the cutoff
  mutate(above = test >= 75) %>%
  #And compare our outcome just below the cutoff to just above
  group_by(above) %>% summarize(earn = mean(earn))
rdd
#Our effect looks just about right (10 is the truth)
rdd$earn[2] - rdd$earn[1]
## # A tibble: 2 x 2
##   above  earn
##   <lgl> <dbl>
## 1 FALSE  55.2
## 2 TRUE   66.0
## [1] 10.80055

Graphically

Example: Corporate Social Responsibility

  • Corporate Social Responsibility (CSR) is when corporations engage in the kind of behavior that nonprofits usually do - community outreach, charity, etc.
  • Is this good for the corporation? Or would it make more sense to just send the money they spend to actual nonprofits if they just want to do good?
  • This is a causal question

Example: Corporate Social Responsibility

  • Convenient for our purposes, CSR policies are voted on by shareholder boards
  • If a board votes 49% in favor, it fails. 51% in favor? It passes!
  • Sounds like a regression discontinuity to me!
  • “Close votes” is a common application of regression discontinuity

Example: Corporate Social Responsibility

  • So how do CSR policy announcements affect stock prices?

Example: Corporate Social Responsbility

  • Caroline Flammer studies this topic
  • Looking at the “abnormal return” (stock price return minus what’s expected given the market) comparing CSR votes that just won vs. CSR votes that just lost
  • So what should we do?
  • Focus just around the cutoff and compare abnormal returns just above and just below.

Example: Corporate Social Responsibility

Flammer (2015) Management Science
Flammer (2015) Management Science

Example: Corporate Social Responsibility

  • Looks like stock returns increase by about .02, comparing CSRs that just lost to just won!
  • Seems like the market likes seeing those CSRs and values them
  • And all those things that we might expect to correlate with both stock price growth and CSRs - tech-savvy, youthful leadership, etc., we’ve closed those back doors too!

Balance

  • Have we really closed those back doors?
  • One thing that’s so great about RDD is that, since it’s basically random whether you’re on one side of the cutoff or another, there shouldn’t be other back doors
  • We can check this by seeing if other variables differ on either side of the line
  • This is our way of testing our diagram - if our diagram is true, then above should have no relationship with any back door variable after focusing around the cutoff

Balance

rdd.data <- tibble(test = runif(500)*100) %>%
  mutate(backdoor=rnorm(500)+test/50) %>% mutate(GATE = test + backdoor >= 75) %>%
  mutate(earn = runif(500)*40+10*GATE+5*backdoor+test/2)
bandwidth <- 2
rdd <- rdd.data %>% filter(abs(75-test) < bandwidth) %>%
  #Create a variable indicating we're above the cutoff
  mutate(above = test >= 75) %>%
  #And compare our outcome just below the cutoff to just above
  group_by(above) %>% summarize(backdoor = mean(backdoor))
rdd
## # A tibble: 2 x 2
##   above backdoor
##   <lgl>    <dbl>
## 1 FALSE     1.22
## 2 TRUE      1.57
#Not a lot of difference!
rdd$backdoor[2] - rdd$backdoor[1]
## [1] 0.3516092

Balance

  • Notice there’s NO real difference here, indicating that we’ve closed that back door

Summing Up

  • We’ve covered four main methods of making comparisons as close as possible
  • Controlling and matching both take a set of measured variables and adjust so you’re looking at variation within those variables
  • Difference-in-difference takes a chosen comparison group and uses it to adjust for changes over time in your treated group of interest
  • Regression discontinuity uses a cutoff in a running variable to identify a treated and nontreated group that are basically randomly assigned

Summing Up

  • Next time we’ll be putting some more work into practicing and applying these methods
  • And thinking carefully about how we can use them to create an appropriate research design so we can figure out our causal effects of interest!

Practice

  • Does winning help your party stay in power 30 years later?
  • Install and load the politicaldata package, and load data(house_results)
  • Create tibbles hr76 and hr16 with only 1976 and 2016
  • Create repadv76 equal to rep vote minus dem for 1976, and filter only to those with !is.na(repadv75)
  • Create repwins16 equal to rep > dem for 2016, and filter !is.na(repwins16)
  • select() only district,repadv76, repwins16, and inner_join() the two data sets
  • Compare repwins16 mean above and below repadv76=0 with a bandwidth of .04

Practice Answers

#install.packages('politicaldata')
library(politicaldata)
data(house_results)

hr76 <- filter(house_results,year==1976) %>%
  mutate(repadv76 = rep - dem) %>%
  filter(!is.na(repadv76)) %>%
  select(district,repadv76)
hr16 <- filter(house_results,year==2016) %>%
  mutate(repwins16 = rep > dem) %>%
  filter(!is.na(repwins16)) %>%
  select(district,repwins16)

fulldata <- inner_join(hr76,hr16)
bandwidth <- .04 

fulldata %>% filter(abs(repadv76-0)<=.04) %>%
  mutate(above = repadv76 > 0) %>%
  group_by(above) %>% summarize(repwins16=mean(repwins16))
## # A tibble: 2 x 2
##   above repwins16
##   <lgl>     <dbl>
## 1 FALSE     0.737
## 2 TRUE      0.889