Lecture 28 Explaining Better - Regression Part 2

Nick Huntington-Klein

April 4, 2019

Final Exam

  • Reminder of good review material: all the homeworks, end-of-lecture practice slides, the midterms
  • Today we’ll also be revisiting some of the programming bits we haven’t done in a while
  • We did just do a whole review of causal inference, too. May want to check back in on that just in case
  • You’ll be fine!

Today

  • We’ll be talking briefly about how all of the causal inference methods we’ve done work in regression
  • Plus, we’ll be talking about some ways that people in other fields explain one variable with another
  • Just a little peek into what’s out there!
  • Then we’ll do a little programming recap

Causal Inference with Regression

  • Remember, regression is about explaining one variable with another by fitting a line
  • Typically a straight line, but doesn’t have to be
  • So instead of taking a mean within a particular range and having a “stair-step” kind of thing, we have a line that describes what we expect the mean of Y to be for any given value of X

Causal Inference with Regression

Causal Inference with Regression

  • We’re still explaining, it’s just a different shape. So everything transfers over pretty well!
  • You can see how it’s less sensitive to random noise in the data (good, relative to our method), and lets us easily calculate standard errors (good, and something you’ll do in future classes), but is also reliant on the shape that the data takes (bad)
  • As another note, when our X is a logical or a factor, regression and our method are exactly the same! This will make doing most of our methods easy!

Fixed Effects

  • So, controlling for a factor, or doing fixed effects? No difference (although with regression usually we’d get the effect in the reg, not with correlation)
library(Ecdat)
data(Airline)

# Our method
AirlineOurMethod <- Airline %>% group_by(airline) %>%
  mutate(output.r = output - mean(output),
         cost.r = cost - mean(cost))
AirlineReg <- Airline %>%
  mutate(output.reg = residuals(lm(output~factor(airline))),
         cost.reg = residuals(lm(cost~factor(airline))))

c(cor(AirlineOurMethod$output.r,AirlineOurMethod$cost.r),
  cor(AirlineReg$output.reg,AirlineReg$cost.reg))
## [1] 0.9272297 0.9272297

Difference-in-Differences

  • Same with difference-in-difference!
load('mariel.RData')
#(some data-cleaning omitted here, see the code for the slides)
#Then we can do our difference in difference with our method
means <- df %>% group_by(after,miami) %>% summarize(lwage = mean(lwage),unemp=mean(unemp))
(means$lwage[4] - means$lwage[2]) - (means$lwage[3]-means$lwage[1])

#or by regression, using an "interaction term"
lm(lwage~after*miami,data=df)
## [1] 0.02740653
## 
## Call:
## lm(formula = lwage ~ after * miami, data = df)
## 
## Coefficients:
##         (Intercept)            afterTRUE            miamiTRUE  
##             1.88186             -0.04606             -0.14674  
## afterTRUE:miamiTRUE  
##             0.02741

Regression Discontinuity

  • We saw last time how this works - we fit a line on either side of the cutoff and see how that line jumps at the cutoff

Instrumental Variables

  • As always, we’re using only what we can explain with Z. Just now, we explain it using regression!
  • Conveniently, regression lets us skip the “explain Y with Z” step and handles that by itself automatically
  • I do a lil statistical magic to make the correlation and regression slope comparable, don’t worry that the next example doesn’t give the same result as last time

Instrumental Variables

library(AER)
data(CigarettesSW)

#data-cleaning code to perform our version of IV omitted here
cor(CigarettesSW$priceexp,CigarettesSW$packsexp)

#And now with regression
data(CigarettesSW)
x.explained.with.z <- predict(lm(packs~cigtax,data=CigarettesSW))
lm(price~x.explained.with.z,data=CigarettesSW)
## [1] -0.9711096
## 
## Call:
## lm(formula = price ~ x.explained.with.z, data = CigarettesSW)
## 
## Coefficients:
##        (Intercept)  x.explained.with.z  
##          3.155e-16          -1.461e+00

Machine Learning

  • You may be familiar with the terms machine learning or big data
  • These seem like big, scary, magical things, but most of the time it’s just different ways of explaining one variable with another (or many others)
  • In fact, a lot of machine learning methods come down to just slightly fancier versions of regression or what we’ve been doing

Machine Learning

  • Economists are just starting to use these methods, largely because they aren’t built with a focus on uncovering the data-generating process, they just like to predict non-causally
  • (although there is recent causal work - check out Susan Athey)
  • But they’re great ways of incorporating lots and lots and lots of control variables or predictors
  • For example, what if you have 10000 potential control variables? Which do you use? Machine learning can help with that, or even let you use all of them at once!

Random Forests

  • Let’s discuss one machine learning tool - random forests
  • Random forests explain Y with X in a way very similar to our method - it breaks up X into bins and takes the average of Y within each bin. It doesn’t fit a line.
  • In its terminology, it’s creating a “tree” with “branches” to show the bins

Random Forests

  • However, you don’t pick the bins, it tries every possible combination of bins and picks the one that (to simplify) explains the most variance
  • Plus, it’s capable of not just picking the best bins, but if you have lots of variables, picking which variables to split into bins
  • We won’t go too into detail of how it works, but you can see it in action (albeit only with one variable, not showing its full strength…)

Random Forests

  • Let’s predict the price of gold using the price of silver
library(AER)
data(GoldSilver)
GoldSilver <- as.data.frame(GoldSilver)
library(randomForest)
rf <- randomForest(gold~silver,data=GoldSilver)

GoldSilver <- GoldSilver %>%
  mutate(rf.predict = predict(rf),
         reg.predict = predict(lm(gold~silver,data=GoldSilver)))

Random Forests

Random Forests

  • RF is a little messier, but fits that weird shape better

Programming Practice

  • Create a data frame dat with:
  • a all even numbers from 2 to 100 twice, i.e. 2, 2, 4, 4… (hint: rep())
  • b randomly selected from ‘Hi’, ‘Hello’, and ‘Goodbye’. Make it a factor.
  • Then, arrange() the data by a
  • Add up all the values of a
  • Count how many observations are ‘Hi’ or ‘Hello’ using %in%
  • mutate to create c as a logical equal to 1 if b is ‘Goodbye’ OR if a > 90 OR if a <= 10
  • Calculate the proportion of variance in c explained by b

Programming Practice Answers

dat <- data.frame(a = rep(1:50*2,2),
                  b = sample(c('Hi','Hello','Goodbye'),100,replace=T)) %>%
  arrange(a) %>%
  mutate(c = (b == 'Goodbye') | (a > 90) | (a <= 10))

sum(dat$a)
sum(dat$b %in% c('Hi','Hello'))

dat <- dat %>% group_by(b) %>% mutate(c.res = c - mean(c))

1 - var(dat$c.res)/var(dat$c)

Programming Practice

  • x <- rexp(3000)
  • Plot the density of x, with proper labels
  • Based on the density plot, would you expect mean(x) or median(x) to be higher?
  • Add the median and mean as vertical ablines in different colors to your plot to check
  • Create a text stargazer table to describe as.data.frame(x)
  • Separately, use quantile to get the 10th, 20th, … 100th percentile of x
  • What kind of real-world variable might be distributed like x?

Programming Practice Answers

x <- rexp(3000)

plot(density(x),xlab='X',ylab='Density',main='Distribution of X')
#Because we have a small number of huge values, the mean should be larger
abline(v=mean(x),col='blue')
abline(v=median(x),col='red')

library(stargazer)
stargazer(as.data.frame(x),type='text')

quantile(x,c(1:10/10))

#Something unequally distributed, with a few big winners, like income or wealth, might be distributed like x

Programming Practice

  • Take dat from the first practice, and add d = rnorm(100) + .3*a (may need to ungroup())
  • Create a table and a prop.table for b
  • Then, create a barplot for the count of b, and one for the proportion
  • Create a density plot of d when c == FALSE and use lines() to overlay different-colored density when c == TRUE
  • Calculate the difference in means of d between c == FALSE and c == TRUE
  • Create a plot with a on the x-axis and d on the y-axis
  • Use cut(,breaks=8) to get prop. of var. of d explained by a

Programming Practice Answers

dat <- dat %>% ungroup() %>% mutate(d = rnorm(100) + .3*a)

table(dat$b)
prop.table(table(dat$b))
barplot(table(dat$b))
barplot(prop.table(table(dat$b)))

plot(density((filter(dat,c==FALSE))$d),col='red')
lines(density((filter(dat,c==TRUE))$d),col='blue')

meandiff <- dat %>% group_by(c) %>% summarize(d = mean(d))
meandiff$d[2] - meandiff$d[1]

plot(dat$a,dat$d)

dat <- dat %>% group_by(cut(a,breaks=8)) %>% mutate(d.res = d - mean(d))
1 - var(dat$d.res)/var(dat$d)