Lecture 6: Working with Data Part 2

Nick Huntington-Klein

January 23, 2019

Recap

  • We can get data.frames by making them with data.frame(), or reading in data with data() or read.csv
  • data.frames are a list of vectors - we know vectors!
  • We can pull the vectors back out with $
  • We can assign new variables, or update them, using $ as well

Today

  • We are going to continue working with data.frames/tibbles
  • And we’re going to introduce an important aspect of data analysis: splitting the data
  • In other words, selecting only part of the data that we have
  • In other words, to subset the data

Why?

  • Why would we want to do this?
  • Many statistical questions require us to!
  • We might be interested in how a variable differs for two different groups
  • Or how one variable is related to another (i.e. how A looks for different values of B)
  • Or how those relationships differ for different groups

Example

  • Let’s read in some data on male heights, from Our World in Data, and look at it
  • Always look at the data before you use it!
  • It has height in CM, let’s change that to feet
df <- read.csv('http://www.nickchk.com/average-height-men-OWID.csv')
str(df)
## 'data.frame':    1250 obs. of  4 variables:
##  $ Entity  : Factor w/ 152 levels "Afghanistan",..: 1 1 1 2 2 2 3 3 3 4 ...
##  $ Code    : Factor w/ 152 levels "AFG","AGO","ALB",..: 1 1 1 3 3 3 41 41 41 2 ...
##  $ Year    : int  1870 1880 1930 1880 1890 1900 1910 1920 1930 1810 ...
##  $ Heightcm: num  168 166 167 170 170 ...
df <- df %>% mutate(Heightft = Heightcm/30.48)

Example

  • If we look at height overall we will see that mean height is 167.64728
  • But if we look at the data we can see that some countries aren’t present every year. So this isn’t exactly representative
table(df$Year)
## 
## 1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 1950 
##   35   28   29   41   49   59   72   80   82   86   80   85   91   72   91 
## 1960 1970 1980 
##   89   88   93
  • So let’s just pick some countries that we DO have the full range of years on: the UK, Germany, France, the Congo, Gabon, and Nigeria!

Example

  • If we limit the data just to those three countries, we can see that the average height in these three, which covers 1810s-1980s evenly, is 168.8619048
  • What if we want to compare the countries to each other? We need to split off each country by itself (let’s convert to feet, too).
## # A tibble: 6 x 2
##   Entity  Heightft
##   <chr>      <dbl>
## 1 Congo       5.44
## 2 France      5.52
## 3 Gabon       5.52
## 4 Germany     5.61
## 5 Nigeria     5.48
## 6 UK          5.60

Example

  • What questions does this answer?
    • What is average height of men over this time period?
    • How does average height differ across countries?
  • What can’t we answer yet?
    • How has height changed over time?
    • What causes these height differences [later!]

Example

  • If we want to know how height changed over time, we need to evaluate each year separately too.

Example

  • To compare the changes over time ACROSS countries, we evaluate separately by each year AND each country