Lecture 27 Explaining Better - Regression

Nick Huntington-Klein

April 3, 2019

Final Exam

  • Combination of both programming and causal inference methods
  • Everything is fair game
  • There will also be a subjective question in which you take a causal question, develop a diagram, and perform the analysis with data I give you
  • Slides and dagitty will be available, no other internet

This Week

  • We’ll be doing a little bit of review this last week
  • And also talking about other ways of explaining data beyond what we’ve done
  • This last week of material will not be on the final but it will be great prep for any upcoming class you take on this, or if you want to apply the ideas you’ve learned in class in the real world

Explaining Better

  • So far, all of our methods have had to do with explaining one variable with another
  • After all, causal inference is all about looking at the effect of one variable on another
  • If that explanation is causally identified, we’re good to go
  • Or, if that explanation is on a back door of what we’re interested in, we’ll explain what we can and take it out

Explaining Better

  • The way that we’ve been explaining A with B so far:
  • Take the different values of B (if it’s continuous, use bins with cut())
  • For observations with each of those different values, take the mean of A
  • That mean is the “explained” part, the rest is the “residual”

Explaining Better

  • Now, this is the basic idea of explaining - what value of A can we expect, given the value of B we’re looking at?
  • But this isn’t the only way to put that idea into action!

Regression

  • The way that explaining is done most of the time is with a method called regression
  • You might be familiar with regression if you’ve taken ISDS 361A
  • But here we’re going to go more into detail on what it actually is and how it relates to causal inference

Regression

  • The idea of regression is the same as our approach to explaining - for different values of B, predict A
  • But what’s different? In regression, you impose a little more structure on that prediction
  • Specifically, when B is continuous, you require that the relationship between B and A follows a straight line

Regression

  • Let’s look at wages and experience in Belgium
library(Ecdat)
data(Bwages)

#Explain wages with our normal method
Bwages <- Bwages %>% group_by(cut(exper,breaks=8)) %>%
  mutate(wage.explained = mean(wage)) %>% ungroup()

#Explain wages with regression
#lm(wage~exper) regresses wage on exper, and predict() gets the explained values
Bwages <- Bwages %>% 
  mutate(wage.reg.explained = predict(lm(wage~exper)))

#What's in a regression? An intercept and a slope! Like I said, it's a line.
lm(wage~exper,data=Bwages)
## 
## Call:
## lm(formula = wage ~ exper, data = Bwages)
## 
## Coefficients:
## (Intercept)        exper  
##      8.7349       0.1345

Regression

Regression

  • Okay, so it’s the same thing but it’s a straight line. Who cares?
  • Regression brings us some benefits, but also has some costs (there’s always a tradeoff…)

Regression Benefits

  • It boils down the relationship to be much simpler to explain
  • Instead of reporting eight different means, I can just give an intercept and a slope!
## 
## Call:
## lm(formula = wage ~ exper, data = Bwages)
## 
## Coefficients:
## (Intercept)        exper  
##      8.7349       0.1345
  • We can interpret this easily as “one more year of exper is associated with 0.134501 higher wages”

Regression Benefits

  • This makes it much easier to explain using multiple variables at once
  • This is important when we’re doing causal inference. If we want to close multiple back doors, we have to control for multiple variables at once. This is unwieldy with our approach
  • With regression we just add another dimension to our line and add another slope! As many as we want
lm(wage~exper+educ+male,data=Bwages)
## 
## Call:
## lm(formula = wage ~ exper + educ + male, data = Bwages)
## 
## Coefficients:
## (Intercept)        exper         educ     maleTRUE  
##      1.0375       0.2006       1.9290       0.0767

Regression Benefits

  • The fact that we’re using a line means that we can use much more of the data
  • This increases our statistical power, and also reduces overfitting (remember that?)
  • For example, with regression discontinuity, we’ve been only looking just to the left and right of the cutoff
  • But this doesn’t take into account the information we have about the trend of the variable leading up to the cutoff. Doing regression discontinuity with regression can!

Regression Benefits

  • Take this example from regression discontinuity

Regression Benefits

  • Look how much more information we can take advantage of with regression

Regression Cons

  • One con is that it masks what it’s doing a bit - you have to learn its inner workings to really know how it operates, and what it’s sensitive to
  • There are also some statistical situations in which it can give strange results
  • For example, problem is that it fits a straight line
  • That means that if your relationship DOESN’T follow a straight line, you’ll get bad results!

Regression Cons

Regression Cons

  • That said, we’re fitting a line, but it doesn’t necessarily need to be a straight line. Let’s add a quadratic term!
lm(y~x,data=quaddata)
## 
## Call:
## lm(formula = y ~ x, data = quaddata)
## 
## Coefficients:
## (Intercept)            x  
##      0.9884       0.8761
lm(y~x+I(x^2),data=quaddata)
## 
## Call:
## lm(formula = y ~ x + I(x^2), data = quaddata)
## 
## Coefficients:
## (Intercept)            x       I(x^2)  
##     0.02935      1.02219      0.98626

Regression Cons

Regression

  • In fact, for these reasons, despite the cons, all the methods we’ve done so far (except matching) are commonly done with regression
  • We’ll talk about this more next time
  • For now, let’s do a simulation exercise with regression
  • I want to reiterate that regression will not be on the final
  • But this will let us practice our simulation skills, which we haven’t done in a minute, and we may as well do it with regresion
  • Any remaining time we’ll do other final prep

Simulation Practice

  • Let’s compare regression and our method in their ability to pick up a tiny positive effect of X on Y
  • We’ll be counting how often each of them finds that positive result
  • Create blank vectors reg.pos <- c(NA) and our.pos <- c(NA) to store results
  • Create a for (i in 1:10000) { loop
  • Then, inside that loop…

Simulation Practice

  • Create a tibble df with x = runif(1000) and y = .01*x + rnorm(1000)
  • Perform regression and get the slope on x: coef(lm(y~x,data=df))[2]. Store 0 in reg.pos[i] if this is negative, and 1 if it’s positive
  • Explain y using x with our method and cut(x,breaks=2). Use summarize and store the means in our.df
  • Store 0 in our.pos[i] if our.df$y[2] - our.df$y[1] is negative, and 1 if it’s positive
  • See which method more often gets a positive result

Simulation Practice Answers

reg.pos <- c(NA)
our.pos <- c(NA)

for (i in 1:10000) {
  df <- tibble(x = runif(1000)) %>%
    mutate(y = .01*x + rnorm(1000))
  
  reg.pos[i] <- coef(lm(y~x,data=df))[2] >= 0
  
  our.df <- df %>% group_by(cut(x,breaks=2)) %>%
    summarize(y = mean(y))
  
  our.pos[i] <- our.df$y[2] - our.df$y[1] > 0
}

mean(reg.pos)
## [1] 0.5382
mean(our.pos)
## [1] 0.5274