Lecture 6: Working with Data Part 2

Nick Huntington-Klein

January 23, 2019

Recap

We can get data.frames by making them with data.frame(), or reading in data with data() or read.csv
data.frames are a list of vectors - we know vectors!
We can pull the vectors back out with $
We can assign new variables, or update them, using $ as well

Today

We are going to continue working with data.frames/tibbles
And we’re going to introduce an important aspect of data analysis: splitting the data
In other words, selecting only part of the data that we have
In other words, to subset the data

Why?

Why would we want to do this?
Many statistical questions require us to!
We might be interested in how a variable differs for two different groups
Or how one variable is related to another (i.e. how A looks for different values of B)
Or how those relationships differ for different groups

Example

Let’s read in some data on male heights, from Our World in Data, and look at it
Always look at the data before you use it!
It has height in CM, let’s change that to feet

df <- read.csv('http://www.nickchk.com/average-height-men-OWID.csv')
str(df)

## 'data.frame':    1250 obs. of  4 variables:
##  $ Entity  : Factor w/ 152 levels "Afghanistan",..: 1 1 1 2 2 2 3 3 3 4 ...
##  $ Code    : Factor w/ 152 levels "AFG","AGO","ALB",..: 1 1 1 3 3 3 41 41 41 2 ...
##  $ Year    : int  1870 1880 1930 1880 1890 1900 1910 1920 1930 1810 ...
##  $ Heightcm: num  168 166 167 170 170 ...

df <- df %>% mutate(Heightft = Heightcm/30.48)

Example

If we look at height overall we will see that mean height is 167.64728
But if we look at the data we can see that some countries aren’t present every year. So this isn’t exactly representative

table(df$Year)

## 
## 1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 1950 
##   35   28   29   41   49   59   72   80   82   86   80   85   91   72   91 
## 1960 1970 1980 
##   89   88   93

So let’s just pick some countries that we DO have the full range of years on: the UK, Germany, France, the Congo, Gabon, and Nigeria!

Example

If we limit the data just to those three countries, we can see that the average height in these three, which covers 1810s-1980s evenly, is 168.8619048
What if we want to compare the countries to each other? We need to split off each country by itself (let’s convert to feet, too).

## # A tibble: 6 x 2
##   Entity  Heightft
##   <chr>      <dbl>
## 1 Congo       5.44
## 2 France      5.52
## 3 Gabon       5.52
## 4 Germany     5.61
## 5 Nigeria     5.48
## 6 UK          5.60

Example

What questions does this answer?
- What is average height of men over this time period?
- How does average height differ across countries?
What can’t we answer yet?
- How has height changed over time?
- What causes these height differences [later!]

Example

If we want to know how height changed over time, we need to evaluate each year separately too.

Example

To compare the changes over time ACROSS countries, we evaluate separately by each year AND each country

Example

You can see how subsetting can tell a much more complete story than looking at the aggregated data!
So how can we do this subsetting?
There are plenty of ways
For today we’re going to focus on the filter() and select() commands.
We’ll also learn to write a for loop.

Subset

The filter() and select() commands will allow you to pick just certain parts of your data
You can select certain rows/observations using logicals with filter()
And you can select certain columns/variables with select()
The syntax is:

data.frame %>% filter(logical.for.rows)
filter(data.frame, logical.for.rows)
data.frame %>% select(variables,you,want)
select(data.frame,variables,you,want)

Subset for observations

Let’s start by selecting rows from our data
We do this by creating a logical - filter will choose all the observations for which that logical is true!
Here’s the logical I used to pick those six countries: Code %in% c('GBR','DEU','FRA','GAB','COD','NGA')
This will be equal to TRUE if the Code variable is %in% that list of six I gave

Subset for observations

We can use this logical with the filter command in dplyr - note I don’t need to store the logical as a variable first

str(df)

## 'data.frame':    1250 obs. of  5 variables:
##  $ Entity  : Factor w/ 152 levels "Afghanistan",..: 1 1 1 2 2 2 3 3 3 4 ...
##  $ Code    : Factor w/ 152 levels "AFG","AGO","ALB",..: 1 1 1 3 3 3 41 41 41 2 ...
##  $ Year    : int  1870 1880 1930 1880 1890 1900 1910 1920 1930 1810 ...
##  $ Heightcm: num  168 166 167 170 170 ...
##  $ Heightft: num  5.52 5.44 5.47 5.58 5.57 ...

dfsubset <- df %>% filter(Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'))
str(dfsubset)

## 'data.frame':    84 obs. of  5 variables:
##  $ Entity  : Factor w/ 152 levels "Afghanistan",..: 38 38 38 38 38 38 38 38 38 38 ...
##  $ Code    : Factor w/ 152 levels "AFG","AGO","ALB",..: 28 28 28 28 28 28 28 28 28 28 ...
##  $ Year    : int  1810 1870 1880 1890 1910 1920 1930 1950 1960 1970 ...
##  $ Heightcm: num  163 166 163 163 165 ...
##  $ Heightft: num  5.36 5.44 5.35 5.33 5.42 ...

Subset for observations

Minor note: see that it still retains the original factor levels. Generally this is what we want. If not, you can “relevel” the factor by re-declaring it as a factor, using only the levels we have:

dfsubset <- dfsubset %>% mutate(Code=factor(dfsubset$Code))
str(dfsubset)

## 'data.frame':    84 obs. of  5 variables:
##  $ Entity  : Factor w/ 152 levels "Afghanistan",..: 38 38 38 38 38 38 38 38 38 38 ...
##  $ Code    : Factor w/ 6 levels "COD","DEU","FRA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year    : int  1810 1870 1880 1890 1910 1920 1930 1950 1960 1970 ...
##  $ Heightcm: num  163 166 163 163 165 ...
##  $ Heightft: num  5.36 5.44 5.35 5.33 5.42 ...

While we’re at it we might want Entity to be a character variable (why?)

dfsubset <- dfsubset %>% mutate(Entity = as.character(dfsubset$Entity))

Subset for observations

Everything we know about constructing logicals can work here
We can also combine multiple variables when doing this
What if we want to see the dataset for 1980 only for these six?

#filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA') & Year == 1980)
filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'),Year == 1980)

##                         Entity Code Year Heightcm Heightft
## 1 Democratic Republic of Congo  COD 1980    167.5 5.495407
## 2                       France  FRA 1980    176.5 5.790682
## 3                        Gabon  GAB 1980    169.7 5.567585
## 4                      Germany  DEU 1980    180.5 5.921916
## 5                      Nigeria  NGA 1980    169.5 5.561024
## 6               United Kingdom  GBR 1980    176.8 5.800525

Subset for observations

We can treat the subset like any normal data frame
Let’s get the mean height in these countries in 1980 and 1810
Can do it directly like this, or assign to a new data.frame like we did with dfsubset and use that

mean(filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'),
            Year == 1980)$Heightft)

## [1] 5.689523

mean(filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'), 
            Year == 1810)$Heightft)

## [1] 5.442367

Subset for variables

Subsetting for variables is easy! Just use select() with a vector or list of variables you want!
Or you can do - a vector of variables you DON’T want!
We don’t need Heightcm, let’s get rid of it

Subset for variables

str(dfsubset %>% select(Entity,Code,Year,Heightft))

## 'data.frame':    84 obs. of  4 variables:
##  $ Entity  : chr  "Democratic Republic of Congo" "Democratic Republic of Congo" "Democratic Republic of Congo" "Democratic Republic of Congo" ...
##  $ Code    : Factor w/ 6 levels "COD","DEU","FRA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year    : int  1810 1870 1880 1890 1910 1920 1930 1950 1960 1970 ...
##  $ Heightft: num  5.36 5.44 5.35 5.33 5.42 ...

str(dfsubset %>% select(-c(Heightcm)))

## 'data.frame':    84 obs. of  4 variables:
##  $ Entity  : chr  "Democratic Republic of Congo" "Democratic Republic of Congo" "Democratic Republic of Congo" "Democratic Republic of Congo" ...
##  $ Code    : Factor w/ 6 levels "COD","DEU","FRA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year    : int  1810 1870 1880 1890 1910 1920 1930 1950 1960 1970 ...
##  $ Heightft: num  5.36 5.44 5.35 5.33 5.42 ...

Subset for both!

We can do both at the same time, chaining one to the other

dfsubset %>% filter(Year == 1980) %>%
  select(Entity,Heightft)

##                         Entity Heightft
## 1 Democratic Republic of Congo 5.495407
## 2                       France 5.790682
## 3                        Gabon 5.567585
## 4                      Germany 5.921916
## 5                      Nigeria 5.561024
## 6               United Kingdom 5.800525

Practice

Get the dataset mtcars using the data() function
Look at it with str() and help()
Limit the dataset to just the variables mpg, cyl, and hp
Get the mean hp for cars at or above the median value of cyl
Get the mean hp for cars below the median value of cyl
Do the same for mpg instead of hp
Calculate the difference between above-median and below-median mpg and hp
How do you interpret these differences?

Practice answers

data(mtcars)
help(mtcars)
str(mtcars)
mtcars <- mtcars %>% select(mpg,cyl,hp)

mean(filter(mtcars,cyl >= median(cyl))$hp)
mean(filter(mtcars,cyl < median(cyl))$hp)
mean(filter(mtcars,cyl >= median(cyl))$hp) - mean(filter(mtcars,cyl < median(cyl))$hp)

mean(filter(mtcars,cyl >= median(cyl))$mpg)
mean(filter(mtcars,cyl < median(cyl))$mpg)
mean(filter(mtcars,cyl >= median(cyl))$mpg) - mean(filter(mtcars,cyl < median(cyl))$mpg)

For loops

Sometimes we want to subset things in many different ways
Typing everything out over and over is a waste of time!
You don’t understand how powerful computers are until you’ve written a for loop
This is an incredibly standard programming tool
R has another way of writing loops using the apply family, but we’re not going to go there

For loops

The basic idea of a for loop is that you have a vector of values, and you have an “iterator” variable
You go through the vector one by one, setting that iterator variable to each value one at a time
Then you run a chunk of that code with the iterator variable set

For loops

In R the syntax is

for (iteratorvariable in vector) {
  code chunk
}

For loops

Let’s rewrite our hp and mpg differences with a for loop. Heck, let’s do a bunch more variables too. (don’t forget to get data(mtcars) again - why?)
Note if we want it to display results inside a loop we need print()
Also, unfortunately, if we’re looping over variables, $ won’t work - we need to use [[]]

data(mtcars)
abovemed <- mtcars %>% filter(cyl >= median(cyl))
belowmed <- mtcars %>% filter(cyl < median(cyl))
for (i in c('mpg','disp','hp','wt')) {
  print(mean(abovemed[[i]])-mean(belowmed[[i]]))
}

data(mtcars)
abovemed <- mtcars %>% filter(cyl >= median(cyl))
belowmed <- mtcars %>% filter(cyl < median(cyl))
for (i in c('mpg','disp','hp','wt')) {
  print(mean(abovemed[[i]])-mean(belowmed[[i]]))
}

## [1] -10.01602
## [1] 191.3684
## [1] 97.60173
## [1] 1.419463

For loops

It’s also sometimes useful to loop over different values
Let’s get the average height by country, like before
unique() can give us the levels to loop over

unique(dfsubset$Entity)

## [1] "Democratic Republic of Congo" "France"                      
## [3] "Gabon"                        "Germany"                     
## [5] "Nigeria"                      "United Kingdom"

for (countryn in unique(dfsubset$Entity)) {
  print(countryn)
  print(mean(filter(dfsubset,Entity==countryn)$Heightft))
}

## [1] "Democratic Republic of Congo"
## [1] 5.439633
## [1] "France"
## [1] 5.520195
## [1] "Gabon"
## [1] 5.515092
## [1] "Germany"
## [1] 5.612059
## [1] "Nigeria"
## [1] 5.475175
## [1] "United Kingdom"
## [1] 5.602399

For loop practice

Get back the full mtcars again
Use unique() to see the different values of cyl
Use unique to loop over those values and get median mpg within each level of cyl
Use : to construct a vector to repeat the same loop
[Hard!] Use paste0 to print out “The median mpg for cyl = # is X” where # is the iterator number and X is the answer.

For loop practice answers

data(mtcars)
unique(mtcars$cyl)
for (c in unique(mtcars$cyl)) {
  print(median(filter(mtcars,cyl==c)$mpg))
}
for (c in 2:4*2) {
  print(median(filter(mtcars,cyl==c)$mpg))
}
for (c in unique(mtcars$cyl)) {
  print(paste0(c("The median mpg for cyl = ",c,
                 " is ",median(filter(mtcars,cyl==c)$mpg)),
               collapse=''))
}
#Printing out just the last one:

## [1] "The median mpg for cyl = 6 is 19.7"
## [1] "The median mpg for cyl = 4 is 26"
## [1] "The median mpg for cyl = 8 is 15.2"