Lecture 6: Working with Data Part 2

Nick Huntington-Klein

January 23, 2019

Recap

  • We can get data.frames by making them with data.frame(), or reading in data with data() or read.csv
  • data.frames are a list of vectors - we know vectors!
  • We can pull the vectors back out with $
  • We can assign new variables, or update them, using $ as well

Today

  • We are going to continue working with data.frames/tibbles
  • And we’re going to introduce an important aspect of data analysis: splitting the data
  • In other words, selecting only part of the data that we have
  • In other words, to subset the data

Why?

  • Why would we want to do this?
  • Many statistical questions require us to!
  • We might be interested in how a variable differs for two different groups
  • Or how one variable is related to another (i.e. how A looks for different values of B)
  • Or how those relationships differ for different groups

Example

  • Let’s read in some data on male heights, from Our World in Data, and look at it
  • Always look at the data before you use it!
  • It has height in CM, let’s change that to feet
df <- read.csv('http://www.nickchk.com/average-height-men-OWID.csv')
str(df)
## 'data.frame':    1250 obs. of  4 variables:
##  $ Entity  : Factor w/ 152 levels "Afghanistan",..: 1 1 1 2 2 2 3 3 3 4 ...
##  $ Code    : Factor w/ 152 levels "AFG","AGO","ALB",..: 1 1 1 3 3 3 41 41 41 2 ...
##  $ Year    : int  1870 1880 1930 1880 1890 1900 1910 1920 1930 1810 ...
##  $ Heightcm: num  168 166 167 170 170 ...
df <- df %>% mutate(Heightft = Heightcm/30.48)

Example

  • If we look at height overall we will see that mean height is 167.64728
  • But if we look at the data we can see that some countries aren’t present every year. So this isn’t exactly representative
table(df$Year)
## 
## 1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 1950 
##   35   28   29   41   49   59   72   80   82   86   80   85   91   72   91 
## 1960 1970 1980 
##   89   88   93
  • So let’s just pick some countries that we DO have the full range of years on: the UK, Germany, France, the Congo, Gabon, and Nigeria!

Example

  • If we limit the data just to those three countries, we can see that the average height in these three, which covers 1810s-1980s evenly, is 168.8619048
  • What if we want to compare the countries to each other? We need to split off each country by itself (let’s convert to feet, too).
## # A tibble: 6 x 2
##   Entity  Heightft
##   <chr>      <dbl>
## 1 Congo       5.44
## 2 France      5.52
## 3 Gabon       5.52
## 4 Germany     5.61
## 5 Nigeria     5.48
## 6 UK          5.60

Example

  • What questions does this answer?
    • What is average height of men over this time period?
    • How does average height differ across countries?
  • What can’t we answer yet?
    • How has height changed over time?
    • What causes these height differences [later!]

Example

  • If we want to know how height changed over time, we need to evaluate each year separately too.

Example

  • To compare the changes over time ACROSS countries, we evaluate separately by each year AND each country

Example

  • You can see how subsetting can tell a much more complete story than looking at the aggregated data!
  • So how can we do this subsetting?
  • There are plenty of ways
  • For today we’re going to focus on the filter() and select() commands.
  • We’ll also learn to write a for loop.

Subset

  • The filter() and select() commands will allow you to pick just certain parts of your data
  • You can select certain rows/observations using logicals with filter()
  • And you can select certain columns/variables with select()
  • The syntax is:
data.frame %>% filter(logical.for.rows)
filter(data.frame, logical.for.rows)
data.frame %>% select(variables,you,want)
select(data.frame,variables,you,want)

Subset for observations

  • Let’s start by selecting rows from our data
  • We do this by creating a logical - filter will choose all the observations for which that logical is true!
  • Here’s the logical I used to pick those six countries: Code %in% c('GBR','DEU','FRA','GAB','COD','NGA')
  • This will be equal to TRUE if the Code variable is %in% that list of six I gave

Subset for observations

  • We can use this logical with the filter command in dplyr - note I don’t need to store the logical as a variable first
str(df)
## 'data.frame':    1250 obs. of  5 variables:
##  $ Entity  : Factor w/ 152 levels "Afghanistan",..: 1 1 1 2 2 2 3 3 3 4 ...
##  $ Code    : Factor w/ 152 levels "AFG","AGO","ALB",..: 1 1 1 3 3 3 41 41 41 2 ...
##  $ Year    : int  1870 1880 1930 1880 1890 1900 1910 1920 1930 1810 ...
##  $ Heightcm: num  168 166 167 170 170 ...
##  $ Heightft: num  5.52 5.44 5.47 5.58 5.57 ...
dfsubset <- df %>% filter(Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'))
str(dfsubset)
## 'data.frame':    84 obs. of  5 variables:
##  $ Entity  : Factor w/ 152 levels "Afghanistan",..: 38 38 38 38 38 38 38 38 38 38 ...
##  $ Code    : Factor w/ 152 levels "AFG","AGO","ALB",..: 28 28 28 28 28 28 28 28 28 28 ...
##  $ Year    : int  1810 1870 1880 1890 1910 1920 1930 1950 1960 1970 ...
##  $ Heightcm: num  163 166 163 163 165 ...
##  $ Heightft: num  5.36 5.44 5.35 5.33 5.42 ...

Subset for observations

  • Minor note: see that it still retains the original factor levels. Generally this is what we want. If not, you can “relevel” the factor by re-declaring it as a factor, using only the levels we have:
dfsubset <- dfsubset %>% mutate(Code=factor(dfsubset$Code))
str(dfsubset)
## 'data.frame':    84 obs. of  5 variables:
##  $ Entity  : Factor w/ 152 levels "Afghanistan",..: 38 38 38 38 38 38 38 38 38 38 ...
##  $ Code    : Factor w/ 6 levels "COD","DEU","FRA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year    : int  1810 1870 1880 1890 1910 1920 1930 1950 1960 1970 ...
##  $ Heightcm: num  163 166 163 163 165 ...
##  $ Heightft: num  5.36 5.44 5.35 5.33 5.42 ...
  • While we’re at it we might want Entity to be a character variable (why?)
dfsubset <- dfsubset %>% mutate(Entity = as.character(dfsubset$Entity))

Subset for observations

  • Everything we know about constructing logicals can work here
  • We can also combine multiple variables when doing this
  • What if we want to see the dataset for 1980 only for these six?
#filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA') & Year == 1980)
filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'),Year == 1980)
##                         Entity Code Year Heightcm Heightft
## 1 Democratic Republic of Congo  COD 1980    167.5 5.495407
## 2                       France  FRA 1980    176.5 5.790682
## 3                        Gabon  GAB 1980    169.7 5.567585
## 4                      Germany  DEU 1980    180.5 5.921916
## 5                      Nigeria  NGA 1980    169.5 5.561024
## 6               United Kingdom  GBR 1980    176.8 5.800525

Subset for observations

  • We can treat the subset like any normal data frame
  • Let’s get the mean height in these countries in 1980 and 1810
  • Can do it directly like this, or assign to a new data.frame like we did with dfsubset and use that
mean(filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'),
            Year == 1980)$Heightft)
## [1] 5.689523
mean(filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'), 
            Year == 1810)$Heightft)
## [1] 5.442367

Subset for variables

  • Subsetting for variables is easy! Just use select() with a vector or list of variables you want!
  • Or you can do - a vector of variables you DON’T want!
  • We don’t need Heightcm, let’s get rid of it

Subset for variables

str(dfsubset %>% select(Entity,Code,Year,Heightft))
## 'data.frame':    84 obs. of  4 variables:
##  $ Entity  : chr  "Democratic Republic of Congo" "Democratic Republic of Congo" "Democratic Republic of Congo" "Democratic Republic of Congo" ...
##  $ Code    : Factor w/ 6 levels "COD","DEU","FRA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year    : int  1810 1870 1880 1890 1910 1920 1930 1950 1960 1970 ...
##  $ Heightft: num  5.36 5.44 5.35 5.33 5.42 ...
str(dfsubset %>% select(-c(Heightcm)))
## 'data.frame':    84 obs. of  4 variables:
##  $ Entity  : chr  "Democratic Republic of Congo" "Democratic Republic of Congo" "Democratic Republic of Congo" "Democratic Republic of Congo" ...
##  $ Code    : Factor w/ 6 levels "COD","DEU","FRA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year    : int  1810 1870 1880 1890 1910 1920 1930 1950 1960 1970 ...
##  $ Heightft: num  5.36 5.44 5.35 5.33 5.42 ...

Subset for both!

  • We can do both at the same time, chaining one to the other
dfsubset %>% filter(Year == 1980) %>%
  select(Entity,Heightft)
##                         Entity Heightft
## 1 Democratic Republic of Congo 5.495407
## 2                       France 5.790682
## 3                        Gabon 5.567585
## 4                      Germany 5.921916
## 5                      Nigeria 5.561024
## 6               United Kingdom 5.800525

Practice

  • Get the dataset mtcars using the data() function
  • Look at it with str() and help()
  • Limit the dataset to just the variables mpg, cyl, and hp
  • Get the mean hp for cars at or above the median value of cyl
  • Get the mean hp for cars below the median value of cyl
  • Do the same for mpg instead of hp
  • Calculate the difference between above-median and below-median mpg and hp
  • How do you interpret these differences?

Practice answers

data(mtcars)
help(mtcars)
str(mtcars)
mtcars <- mtcars %>% select(mpg,cyl,hp)

mean(filter(mtcars,cyl >= median(cyl))$hp)
mean(filter(mtcars,cyl < median(cyl))$hp)
mean(filter(mtcars,cyl >= median(cyl))$hp) - mean(filter(mtcars,cyl < median(cyl))$hp)

mean(filter(mtcars,cyl >= median(cyl))$mpg)
mean(filter(mtcars,cyl < median(cyl))$mpg)
mean(filter(mtcars,cyl >= median(cyl))$mpg) - mean(filter(mtcars,cyl < median(cyl))$mpg)

For loops

  • Sometimes we want to subset things in many different ways
  • Typing everything out over and over is a waste of time!
  • You don’t understand how powerful computers are until you’ve written a for loop
  • This is an incredibly standard programming tool
  • R has another way of writing loops using the apply family, but we’re not going to go there

For loops

  • The basic idea of a for loop is that you have a vector of values, and you have an “iterator” variable
  • You go through the vector one by one, setting that iterator variable to each value one at a time
  • Then you run a chunk of that code with the iterator variable set

For loops

  • In R the syntax is
for (iteratorvariable in vector) {
  code chunk
}

For loops

  • Let’s rewrite our hp and mpg differences with a for loop. Heck, let’s do a bunch more variables too. (don’t forget to get data(mtcars) again - why?)
  • Note if we want it to display results inside a loop we need print()
  • Also, unfortunately, if we’re looping over variables, $ won’t work - we need to use [[]]
data(mtcars)
abovemed <- mtcars %>% filter(cyl >= median(cyl))
belowmed <- mtcars %>% filter(cyl < median(cyl))
for (i in c('mpg','disp','hp','wt')) {
  print(mean(abovemed[[i]])-mean(belowmed[[i]]))
}
data(mtcars)
abovemed <- mtcars %>% filter(cyl >= median(cyl))
belowmed <- mtcars %>% filter(cyl < median(cyl))
for (i in c('mpg','disp','hp','wt')) {
  print(mean(abovemed[[i]])-mean(belowmed[[i]]))
}
## [1] -10.01602
## [1] 191.3684
## [1] 97.60173
## [1] 1.419463

For loops

  • It’s also sometimes useful to loop over different values
  • Let’s get the average height by country, like before
  • unique() can give us the levels to loop over
unique(dfsubset$Entity)
## [1] "Democratic Republic of Congo" "France"                      
## [3] "Gabon"                        "Germany"                     
## [5] "Nigeria"                      "United Kingdom"
for (countryn in unique(dfsubset$Entity)) {
  print(countryn)
  print(mean(filter(dfsubset,Entity==countryn)$Heightft))
}
## [1] "Democratic Republic of Congo"
## [1] 5.439633
## [1] "France"
## [1] 5.520195
## [1] "Gabon"
## [1] 5.515092
## [1] "Germany"
## [1] 5.612059
## [1] "Nigeria"
## [1] 5.475175
## [1] "United Kingdom"
## [1] 5.602399

For loop practice

  • Get back the full mtcars again
  • Use unique() to see the different values of cyl
  • Use unique to loop over those values and get median mpg within each level of cyl
  • Use : to construct a vector to repeat the same loop
  • [Hard!] Use paste0 to print out “The median mpg for cyl = # is X” where # is the iterator number and X is the answer.

For loop practice answers

data(mtcars)
unique(mtcars$cyl)
for (c in unique(mtcars$cyl)) {
  print(median(filter(mtcars,cyl==c)$mpg))
}
for (c in 2:4*2) {
  print(median(filter(mtcars,cyl==c)$mpg))
}
for (c in unique(mtcars$cyl)) {
  print(paste0(c("The median mpg for cyl = ",c,
                 " is ",median(filter(mtcars,cyl==c)$mpg)),
               collapse=''))
}
#Printing out just the last one:
## [1] "The median mpg for cyl = 6 is 19.7"
## [1] "The median mpg for cyl = 4 is 26"
## [1] "The median mpg for cyl = 8 is 15.2"