Lecture 10: Relationships Between Variables, Part 2

Nick Huntington-Klein

February 5, 2019

Recap

  • Last time we talked about how to think about the relationship between two variables
  • We talked about dependence and correlation
  • As illustrated using proportion tables (prop.table), differences in means (group_by() %>% summarize()), correlation (cor), and graphically with scatterplots (plot(xvar,yvar)) and overlaid densities (plot(density()) followed by lines(density()))

Today

  • We’re going to be going much further into explaining
  • How can we use one variable to explain another and what does that mean?
  • One way to think about what we’re doing is to translate “how does X explain Y” as “what would I expect Y to look like, given a certain value of X?”

Explanation

  • Why do we care?
  • Explaining is a very flexible way of understanding the relationship between two variables
  • Plus, it lets us put a magnitude on these relationships
  • “How much of the variation in earnings is explained by education?”
  • “How much of the variation in earnings is not explained by education?”

Explanation

  • Plus, this will end up being very important when we get to causality
  • Think back to this graph from last time:
addata <- read.csv('http://www.nickchk.com/ad_spend_and_gdp.csv')
plot(addata$AdSpending,addata$GDP,
     xlab='US Ad Spend/Year (Mil.)',ylab='US GDP (Bil.)')

Explanation

  • We know that part of the reason for the relationship we see is inflation
  • Explanation lets us say things like “not counting the parts of ad spend and GDP that are explained by inflation, what is the relationship between ad spend and GDP?”
  • When we get into causality, this will let us isolate just the parts of the relationship we’re interested in

Simple Explanation

  • So that’s our goal - for different values of X, see what Y looks like.
  • There are lots of ways to do this - one of which is called regression and you’ll see that in later classes
  • In this class we’re going to focus on a very simple approach - simply taking the mean of Y for different values of X.

Simple Explanation

  • Basically, we’re trying to do a simpler version of this:
Local Linear Regression
Local Linear Regression

Simple Explanation

  • We already know how to calculate these means - we can do it with summarize()
  • Let’s use data on demographic info by county in data(midwest) and get the average poverty rate by county in each state
library(tidyverse)
data(midwest)
midwest %>% group_by(state) %>% summarize(percbelowpoverty = mean(percbelowpoverty))
## # A tibble: 5 x 2
##   state percbelowpoverty
##   <chr>            <dbl>
## 1 IL                13.1
## 2 IN                10.3
## 3 MI                14.2
## 4 OH                13.0
## 5 WI                11.9

Simple Explanation

  • Now we know what we’d expect a county’s poverty rate to be, based on what state it’s in.
  • It will be useful for us to have this average in the data frame itself, which we can do by using group_by() leading into mutate() instead of summarize()
midwest <- midwest %>% group_by(state) %>%
  mutate(avebystate = mean(percbelowpoverty))
head(select(midwest,state,county,percbelowpoverty,avebystate))
## # A tibble: 6 x 4
## # Groups:   state [1]
##   state county    percbelowpoverty avebystate
##   <chr> <chr>                <dbl>      <dbl>
## 1 IL    ADAMS                13.2        13.1
## 2 IL    ALEXANDER            32.2        13.1
## 3 IL    BOND                 12.1        13.1
## 4 IL    BOONE                 7.21       13.1
## 5 IL    BROWN                13.5        13.1
## 6 IL    BUREAU               10.4        13.1

Simple Explanation

  • avebystate now represents the part of poverty that is explained by state
  • Whatever’s left over must be the part unexplained by state, a.k.a “the residual”
midwest <- mutate(midwest,residual = percbelowpoverty - avebystate)
head(select(midwest,state,county,percbelowpoverty,avebystate,residual))
## # A tibble: 6 x 5
## # Groups:   state [1]
##   state county    percbelowpoverty avebystate residual
##   <chr> <chr>                <dbl>      <dbl>    <dbl>
## 1 IL    ADAMS                13.2        13.1   0.0750
## 2 IL    ALEXANDER            32.2        13.1  19.2   
## 3 IL    BOND                 12.1        13.1  -1.01  
## 4 IL    BOONE                 7.21       13.1  -5.87  
## 5 IL    BROWN                13.5        13.1   0.444 
## 6 IL    BUREAU               10.4        13.1  -2.68

Graphically

  • For each value of state, we get the mean. Any deviation from that is the residual

Graphically

How much is explained?

  • How much of poverty-by-county is explained by the state?
  • We do this by seeing how much variance is left over after our explanation
  • 0.9334186 of the variation in county-level poverty is still there after the explanation. State explained only 1-0.9334186 = 0.0665814 of it!
c(var(midwest$percbelowpoverty),var(midwest$residual))
## [1] 26.52410 24.75809
var(midwest$residual)/var(midwest$percbelowpoverty)
## [1] 0.9334186

Continuous Variables

  • The approach with summarize(X = mean(X)) works great!
  • But think about it with a continuous variable…!
midwest <- midwest %>% group_by(as.factor(percollege)) %>%
  mutate(avebycoll = mean(percbelowpoverty)) %>%
  mutate(collresidual = percbelowpoverty - avebycoll) %>% ungroup()
head(select(midwest,state,county,percbelowpoverty,avebycoll,collresidual))
## # A tibble: 6 x 5
##   state county    percbelowpoverty avebycoll collresidual
##   <chr> <chr>                <dbl>     <dbl>        <dbl>
## 1 IL    ADAMS                13.2      13.2             0
## 2 IL    ALEXANDER            32.2      32.2             0
## 3 IL    BOND                 12.1      12.1             0
## 4 IL    BOONE                 7.21      7.21            0
## 5 IL    BROWN                13.5      13.5             0
## 6 IL    BUREAU               10.4      10.4             0
1-var(midwest$collresidual)/var(midwest$percbelowpoverty)
## [1] 1

Overfitting

  • We’re going to have to do something odd - we’ll need to make our prediction worse in order to make it better
  • There are 437 obs and 437 different values of poptotal - of course I can predict it perfectly!
  • But if I got another data set, using the model I have would do a terrible job
  • Because I’ve “fit” my model so closely to the data I have

Overfitting

  • Imagine I were predicting the height of everyone in the room - first row, first chair? I predict you have the height of the person in the first row, first chair
  • I’ll predict perfectly!
  • But when the next class comes in, if I reuse the same predictions I’ll be very wrong!
  • Meaning that my “predictions” aren’t all that useful
  • I’ll do better with less flexibility - I’ll just take average height now. Bigger residuals now, but I’ll do a better job predicting the next class

Continuous Variables

  • So in this context, we’re going to do this by splitting the continuous variable X up into bins, and taking the mean of Y within those bins using mutate().
  • We can do this with the cut() function with the breaks option, which splits the continuous variable into breaks bins of equal length
midwest <- midwest %>% mutate(collbins = cut(percollege,breaks=10))
head(midwest %>% select(county,percbelowpoverty,percollege,collbins))
## # A tibble: 6 x 4
##   county    percbelowpoverty percollege collbins   
##   <chr>                <dbl>      <dbl> <fct>      
## 1 CLARK                13.4        17.8 (15.5,19.6]
## 2 ST CLAIR             10.9        17.6 (15.5,19.6]
## 3 FOREST               21.8        13.6 (11.4,15.5]
## 4 DAVIESS              15.5        13.2 (11.4,15.5]
## 5 CHAMPAIGN             8.83       13.9 (11.4,15.5]
## 6 SHEBOYGAN             6.49       20.8 (19.6,23.6]

Continuous Variables

  • Then, just like before with state, we can use mutate to take means within these bins
midwest <- midwest %>% mutate(collbins = cut(percollege,10)) %>%
  group_by(collbins) %>% mutate(avebycoll = mean(percbelowpoverty)) %>% ungroup()
head(midwest %>% select(county,percbelowpoverty,percollege,avebycoll))
## # A tibble: 6 x 4
##   county    percbelowpoverty percollege avebycoll
##   <chr>                <dbl>      <dbl>     <dbl>
## 1 CLARK                13.4        17.8      11.5
## 2 ST CLAIR             10.9        17.6      11.5
## 3 FOREST               21.8        13.6      13.9
## 4 DAVIESS              15.5        13.2      13.9
## 5 CHAMPAIGN             8.83       13.9      13.9
## 6 SHEBOYGAN             6.49       20.8      10.5

Continuous Variables

  • Let’s see our relationship, what’s explained, and what’s not!
plot(midwest$percollege,midwest$percbelowpoverty,xlab="Percent College",ylab="Percent below Poverty")
points(midwest$percollege,midwest$avebycoll,col='red')

Continuous Variables

  • And so how much can we explain of poverty with college percentage in bins?
var(midwest$percbelowpoverty-midwest$avebycoll)
## [1] 22.04112
var(midwest$percbelowpoverty)
## [1] 26.5241
1-var(midwest$percbelowpoverty-midwest$avebycoll)/var(midwest$percbelowpoverty)
## [1] 0.1690152

Continuous Variables

  • Note our approach is sensitive to how many bins we pick. So how many is right?
  • More bins = more explained, but more overfitting risk.
  • In this class we’re going to be a little arbitrary. But there are good ways to pick bins, and other non-bin ways to do this (future classes!)
for (brks in c(2,10,20,50)) {
  print(1-var(midwest$percbelowpoverty-
                (midwest %>% group_by(cut(percollege,breaks=brks)) %>%
                   mutate(avebycoll = mean(percbelowpoverty)))$avebycoll)/
          var(midwest$percbelowpoverty))
}
## [1] 0.005783466
## [1] 0.1690152
## [1] 0.2114173
## [1] 0.3168701

Practice

  • Get the BudgetFood (rename:BF) data from Ecdat and examine it
  • Explain wfood with town and then with totexp (breaks=10). For each:
  • Create a variable with the predicted/explained values
  • Calculate the residuals
  • Show the proportion of variance explained
  • plot the raw data and add red points for the explained values
  • For totexp, also do a plot of residuals with a red horizontal line at 0

Practice answers

library(Ecdat)
data(BudgetFood)
help(BudgetFood)
str(BudgetFood)
BF <- BudgetFood

BF <- BF %>%
  group_by(town) %>%
  mutate(avebytown = mean(wfood)) %>%
  mutate(townresid = wfood - avebytown)
1-var(BF$townresid)/var(BF$wfood)
plot(BF$town,BF$wfood,xlab='Town',ylab='Food as Pct. of Expenditure')
points(BF$town,BF$avebytown,col='red')

BF <- BF %>%
  mutate(expbins = cut(totexp,breaks=10)) %>%
  group_by(expbins) %>%
  mutate(avebyexp = mean(wfood)) %>%
  mutate(expresid = wfood - avebyexp)
1-var(BF$expresid)/var(BF$wfood)
plot(BF$totexp,BF$wfood,xlab='Total Expenditure',ylab='Food as Pct. of Expenditure')
points(BF$totexp,BF$avebyexp,col='red')

plot(BF$totexp,BF$expresid,xlab='Total Expenditure',ylab='Residuals')
abline(0,0,col='red')