Lecture 10: Relationships Between Variables, Part 2

Nick Huntington-Klein

February 5, 2019

Recap

Last time we talked about how to think about the relationship between two variables
We talked about dependence and correlation
As illustrated using proportion tables (prop.table), differences in means (group_by() %>% summarize()), correlation (cor), and graphically with scatterplots (plot(xvar,yvar)) and overlaid densities (plot(density()) followed by lines(density()))

Today

We’re going to be going much further into explaining
How can we use one variable to explain another and what does that mean?
One way to think about what we’re doing is to translate “how does X explain Y” as “what would I expect Y to look like, given a certain value of X?”

Explanation

Why do we care?
Explaining is a very flexible way of understanding the relationship between two variables
Plus, it lets us put a magnitude on these relationships
“How much of the variation in earnings is explained by education?”
“How much of the variation in earnings is not explained by education?”

Explanation

Plus, this will end up being very important when we get to causality
Think back to this graph from last time:

addata <- read.csv('http://www.nickchk.com/ad_spend_and_gdp.csv')
plot(addata$AdSpending,addata$GDP,
     xlab='US Ad Spend/Year (Mil.)',ylab='US GDP (Bil.)')

Explanation

We know that part of the reason for the relationship we see is inflation
Explanation lets us say things like “not counting the parts of ad spend and GDP that are explained by inflation, what is the relationship between ad spend and GDP?”
When we get into causality, this will let us isolate just the parts of the relationship we’re interested in

Simple Explanation

So that’s our goal - for different values of X, see what Y looks like.
There are lots of ways to do this - one of which is called regression and you’ll see that in later classes
In this class we’re going to focus on a very simple approach - simply taking the mean of Y for different values of X.

Simple Explanation

Basically, we’re trying to do a simpler version of this:

Simple Explanation

We already know how to calculate these means - we can do it with summarize()
Let’s use data on demographic info by county in data(midwest) and get the average poverty rate by county in each state

library(tidyverse)
data(midwest)
midwest %>% group_by(state) %>% summarize(percbelowpoverty = mean(percbelowpoverty))

## # A tibble: 5 x 2
##   state percbelowpoverty
##   <chr>            <dbl>
## 1 IL                13.1
## 2 IN                10.3
## 3 MI                14.2
## 4 OH                13.0
## 5 WI                11.9

Simple Explanation

Now we know what we’d expect a county’s poverty rate to be, based on what state it’s in.
It will be useful for us to have this average in the data frame itself, which we can do by using group_by() leading into mutate() instead of summarize()

midwest <- midwest %>% group_by(state) %>%
  mutate(avebystate = mean(percbelowpoverty))
head(select(midwest,state,county,percbelowpoverty,avebystate))

## # A tibble: 6 x 4
## # Groups:   state [1]
##   state county    percbelowpoverty avebystate
##   <chr> <chr>                <dbl>      <dbl>
## 1 IL    ADAMS                13.2        13.1
## 2 IL    ALEXANDER            32.2        13.1
## 3 IL    BOND                 12.1        13.1
## 4 IL    BOONE                 7.21       13.1
## 5 IL    BROWN                13.5        13.1
## 6 IL    BUREAU               10.4        13.1

Simple Explanation

avebystate now represents the part of poverty that is explained by state
Whatever’s left over must be the part unexplained by state, a.k.a “the residual”

midwest <- mutate(midwest,residual = percbelowpoverty - avebystate)
head(select(midwest,state,county,percbelowpoverty,avebystate,residual))

## # A tibble: 6 x 5
## # Groups:   state [1]
##   state county    percbelowpoverty avebystate residual
##   <chr> <chr>                <dbl>      <dbl>    <dbl>
## 1 IL    ADAMS                13.2        13.1   0.0750
## 2 IL    ALEXANDER            32.2        13.1  19.2   
## 3 IL    BOND                 12.1        13.1  -1.01  
## 4 IL    BOONE                 7.21       13.1  -5.87  
## 5 IL    BROWN                13.5        13.1   0.444 
## 6 IL    BUREAU               10.4        13.1  -2.68

Graphically

For each value of state, we get the mean. Any deviation from that is the residual

Graphically

How much is explained?

How much of poverty-by-county is explained by the state?
We do this by seeing how much variance is left over after our explanation
0.9334186 of the variation in county-level poverty is still there after the explanation. State explained only 1-0.9334186 = 0.0665814 of it!

c(var(midwest$percbelowpoverty),var(midwest$residual))

## [1] 26.52410 24.75809

var(midwest$residual)/var(midwest$percbelowpoverty)

## [1] 0.9334186

Continuous Variables

The approach with summarize(X = mean(X)) works great!
But think about it with a continuous variable…!

midwest <- midwest %>% group_by(as.factor(percollege)) %>%
  mutate(avebycoll = mean(percbelowpoverty)) %>%
  mutate(collresidual = percbelowpoverty - avebycoll) %>% ungroup()
head(select(midwest,state,county,percbelowpoverty,avebycoll,collresidual))

## # A tibble: 6 x 5
##   state county    percbelowpoverty avebycoll collresidual
##   <chr> <chr>                <dbl>     <dbl>        <dbl>
## 1 IL    ADAMS                13.2      13.2             0
## 2 IL    ALEXANDER            32.2      32.2             0
## 3 IL    BOND                 12.1      12.1             0
## 4 IL    BOONE                 7.21      7.21            0
## 5 IL    BROWN                13.5      13.5             0
## 6 IL    BUREAU               10.4      10.4             0

1-var(midwest$collresidual)/var(midwest$percbelowpoverty)

## [1] 1

Overfitting

We’re going to have to do something odd - we’ll need to make our prediction worse in order to make it better
There are 437 obs and 437 different values of poptotal - of course I can predict it perfectly!
But if I got another data set, using the model I have would do a terrible job
Because I’ve “fit” my model so closely to the data I have

Overfitting

Imagine I were predicting the height of everyone in the room - first row, first chair? I predict you have the height of the person in the first row, first chair
I’ll predict perfectly!
But when the next class comes in, if I reuse the same predictions I’ll be very wrong!
Meaning that my “predictions” aren’t all that useful
I’ll do better with less flexibility - I’ll just take average height now. Bigger residuals now, but I’ll do a better job predicting the next class

Continuous Variables

So in this context, we’re going to do this by splitting the continuous variable X up into bins, and taking the mean of Y within those bins using mutate().
We can do this with the cut() function with the breaks option, which splits the continuous variable into breaks bins of equal length

midwest <- midwest %>% mutate(collbins = cut(percollege,breaks=10))
head(midwest %>% select(county,percbelowpoverty,percollege,collbins))

## # A tibble: 6 x 4
##   county    percbelowpoverty percollege collbins   
##   <chr>                <dbl>      <dbl> <fct>      
## 1 CLARK                13.4        17.8 (15.5,19.6]
## 2 ST CLAIR             10.9        17.6 (15.5,19.6]
## 3 FOREST               21.8        13.6 (11.4,15.5]
## 4 DAVIESS              15.5        13.2 (11.4,15.5]
## 5 CHAMPAIGN             8.83       13.9 (11.4,15.5]
## 6 SHEBOYGAN             6.49       20.8 (19.6,23.6]

Continuous Variables

Then, just like before with state, we can use mutate to take means within these bins

midwest <- midwest %>% mutate(collbins = cut(percollege,10)) %>%
  group_by(collbins) %>% mutate(avebycoll = mean(percbelowpoverty)) %>% ungroup()
head(midwest %>% select(county,percbelowpoverty,percollege,avebycoll))

## # A tibble: 6 x 4
##   county    percbelowpoverty percollege avebycoll
##   <chr>                <dbl>      <dbl>     <dbl>
## 1 CLARK                13.4        17.8      11.5
## 2 ST CLAIR             10.9        17.6      11.5
## 3 FOREST               21.8        13.6      13.9
## 4 DAVIESS              15.5        13.2      13.9
## 5 CHAMPAIGN             8.83       13.9      13.9
## 6 SHEBOYGAN             6.49       20.8      10.5

Continuous Variables

Let’s see our relationship, what’s explained, and what’s not!

plot(midwest$percollege,midwest$percbelowpoverty,xlab="Percent College",ylab="Percent below Poverty")
points(midwest$percollege,midwest$avebycoll,col='red')

Continuous Variables

And so how much can we explain of poverty with college percentage in bins?

var(midwest$percbelowpoverty-midwest$avebycoll)

## [1] 22.04112

var(midwest$percbelowpoverty)

## [1] 26.5241

1-var(midwest$percbelowpoverty-midwest$avebycoll)/var(midwest$percbelowpoverty)

## [1] 0.1690152

Continuous Variables

Note our approach is sensitive to how many bins we pick. So how many is right?
More bins = more explained, but more overfitting risk.
In this class we’re going to be a little arbitrary. But there are good ways to pick bins, and other non-bin ways to do this (future classes!)

for (brks in c(2,10,20,50)) {
  print(1-var(midwest$percbelowpoverty-
                (midwest %>% group_by(cut(percollege,breaks=brks)) %>%
                   mutate(avebycoll = mean(percbelowpoverty)))$avebycoll)/
          var(midwest$percbelowpoverty))
}

## [1] 0.005783466
## [1] 0.1690152
## [1] 0.2114173
## [1] 0.3168701

Practice

Get the BudgetFood (rename:BF) data from Ecdat and examine it
Explain wfood with town and then with totexp (breaks=10). For each:
Create a variable with the predicted/explained values
Calculate the residuals
Show the proportion of variance explained
plot the raw data and add red points for the explained values
For totexp, also do a plot of residuals with a red horizontal line at 0

Practice answers

library(Ecdat)
data(BudgetFood)
help(BudgetFood)
str(BudgetFood)
BF <- BudgetFood

BF <- BF %>%
  group_by(town) %>%
  mutate(avebytown = mean(wfood)) %>%
  mutate(townresid = wfood - avebytown)
1-var(BF$townresid)/var(BF$wfood)
plot(BF$town,BF$wfood,xlab='Town',ylab='Food as Pct. of Expenditure')
points(BF$town,BF$avebytown,col='red')

BF <- BF %>%
  mutate(expbins = cut(totexp,breaks=10)) %>%
  group_by(expbins) %>%
  mutate(avebyexp = mean(wfood)) %>%
  mutate(expresid = wfood - avebyexp)
1-var(BF$expresid)/var(BF$wfood)
plot(BF$totexp,BF$wfood,xlab='Total Expenditure',ylab='Food as Pct. of Expenditure')
points(BF$totexp,BF$avebyexp,col='red')

plot(BF$totexp,BF$expresid,xlab='Total Expenditure',ylab='Residuals')
abline(0,0,col='red')