Lecture 6: Working with Data Part 2
Nick Huntington-Klein
January 23, 2019
Recap
- We can get
data.frame
s by making them with data.frame()
, or reading in data with data()
or read.csv
data.frame
s are a list of vectors - we know vectors!
- We can pull the vectors back out with
$
- We can assign new variables, or update them, using
$
as well
Today
- We are going to continue working with
data.frame
s/tibble
s
- And we’re going to introduce an important aspect of data analysis: splitting the data
- In other words, selecting only part of the data that we have
- In other words, to subset the data
Why?
- Why would we want to do this?
- Many statistical questions require us to!
- We might be interested in how a variable differs for two different groups
- Or how one variable is related to another (i.e. how A looks for different values of B)
- Or how those relationships differ for different groups
Example
- Let’s read in some data on male heights, from Our World in Data, and look at it
- Always look at the data before you use it!
- It has height in CM, let’s change that to feet
df <- read.csv('http://www.nickchk.com/average-height-men-OWID.csv')
str(df)
## 'data.frame': 1250 obs. of 4 variables:
## $ Entity : Factor w/ 152 levels "Afghanistan",..: 1 1 1 2 2 2 3 3 3 4 ...
## $ Code : Factor w/ 152 levels "AFG","AGO","ALB",..: 1 1 1 3 3 3 41 41 41 2 ...
## $ Year : int 1870 1880 1930 1880 1890 1900 1910 1920 1930 1810 ...
## $ Heightcm: num 168 166 167 170 170 ...
df <- df %>% mutate(Heightft = Heightcm/30.48)
Example
- If we look at height overall we will see that mean height is 167.64728
- But if we look at the data we can see that some countries aren’t present every year. So this isn’t exactly representative
##
## 1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 1950
## 35 28 29 41 49 59 72 80 82 86 80 85 91 72 91
## 1960 1970 1980
## 89 88 93
- So let’s just pick some countries that we DO have the full range of years on: the UK, Germany, France, the Congo, Gabon, and Nigeria!
Example
- If we limit the data just to those three countries, we can see that the average height in these three, which covers 1810s-1980s evenly, is 168.8619048
- What if we want to compare the countries to each other? We need to split off each country by itself (let’s convert to feet, too).
## # A tibble: 6 x 2
## Entity Heightft
## <chr> <dbl>
## 1 Congo 5.44
## 2 France 5.52
## 3 Gabon 5.52
## 4 Germany 5.61
## 5 Nigeria 5.48
## 6 UK 5.60
Example
- What questions does this answer?
- What is average height of men over this time period?
- How does average height differ across countries?
- What can’t we answer yet?
- How has height changed over time?
- What causes these height differences [later!]
Example
- If we want to know how height changed over time, we need to evaluate each year separately too.
Example
- To compare the changes over time ACROSS countries, we evaluate separately by each year AND each country
Example
- You can see how subsetting can tell a much more complete story than looking at the aggregated data!
- So how can we do this subsetting?
- There are plenty of ways
- For today we’re going to focus on the
filter()
and select()
commands.
- We’ll also learn to write a
for
loop.
Subset
- The
filter()
and select()
commands will allow you to pick just certain parts of your data
- You can select certain rows/observations using logicals with
filter()
- And you can select certain columns/variables with
select()
- The syntax is:
data.frame %>% filter(logical.for.rows)
filter(data.frame, logical.for.rows)
data.frame %>% select(variables,you,want)
select(data.frame,variables,you,want)
Subset for observations
- Let’s start by selecting rows from our data
- We do this by creating a logical -
filter
will choose all the observations for which that logical is true!
- Here’s the logical I used to pick those six countries:
Code %in% c('GBR','DEU','FRA','GAB','COD','NGA')
- This will be equal to
TRUE
if the Code
variable is %in%
that list of six I gave
Subset for observations
- We can use this logical with the
filter
command in dplyr - note I don’t need to store the logical as a variable first
## 'data.frame': 1250 obs. of 5 variables:
## $ Entity : Factor w/ 152 levels "Afghanistan",..: 1 1 1 2 2 2 3 3 3 4 ...
## $ Code : Factor w/ 152 levels "AFG","AGO","ALB",..: 1 1 1 3 3 3 41 41 41 2 ...
## $ Year : int 1870 1880 1930 1880 1890 1900 1910 1920 1930 1810 ...
## $ Heightcm: num 168 166 167 170 170 ...
## $ Heightft: num 5.52 5.44 5.47 5.58 5.57 ...
dfsubset <- df %>% filter(Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'))
str(dfsubset)
## 'data.frame': 84 obs. of 5 variables:
## $ Entity : Factor w/ 152 levels "Afghanistan",..: 38 38 38 38 38 38 38 38 38 38 ...
## $ Code : Factor w/ 152 levels "AFG","AGO","ALB",..: 28 28 28 28 28 28 28 28 28 28 ...
## $ Year : int 1810 1870 1880 1890 1910 1920 1930 1950 1960 1970 ...
## $ Heightcm: num 163 166 163 163 165 ...
## $ Heightft: num 5.36 5.44 5.35 5.33 5.42 ...
Subset for observations
- Minor note: see that it still retains the original factor levels. Generally this is what we want. If not, you can “relevel” the factor by re-declaring it as a factor, using only the levels we have:
dfsubset <- dfsubset %>% mutate(Code=factor(dfsubset$Code))
str(dfsubset)
## 'data.frame': 84 obs. of 5 variables:
## $ Entity : Factor w/ 152 levels "Afghanistan",..: 38 38 38 38 38 38 38 38 38 38 ...
## $ Code : Factor w/ 6 levels "COD","DEU","FRA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 1810 1870 1880 1890 1910 1920 1930 1950 1960 1970 ...
## $ Heightcm: num 163 166 163 163 165 ...
## $ Heightft: num 5.36 5.44 5.35 5.33 5.42 ...
- While we’re at it we might want
Entity
to be a character variable (why?)
dfsubset <- dfsubset %>% mutate(Entity = as.character(dfsubset$Entity))
Subset for observations
- Everything we know about constructing logicals can work here
- We can also combine multiple variables when doing this
- What if we want to see the dataset for 1980 only for these six?
#filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA') & Year == 1980)
filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'),Year == 1980)
## Entity Code Year Heightcm Heightft
## 1 Democratic Republic of Congo COD 1980 167.5 5.495407
## 2 France FRA 1980 176.5 5.790682
## 3 Gabon GAB 1980 169.7 5.567585
## 4 Germany DEU 1980 180.5 5.921916
## 5 Nigeria NGA 1980 169.5 5.561024
## 6 United Kingdom GBR 1980 176.8 5.800525
Subset for observations
- We can treat the subset like any normal data frame
- Let’s get the mean height in these countries in 1980 and 1810
- Can do it directly like this, or assign to a new
data.frame
like we did with dfsubset
and use that
mean(filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'),
Year == 1980)$Heightft)
## [1] 5.689523
mean(filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'),
Year == 1810)$Heightft)
## [1] 5.442367
Subset for variables
- Subsetting for variables is easy! Just use
select()
with a vector or list of variables you want!
- Or you can do
-
a vector of variables you DON’T want!
- We don’t need Heightcm, let’s get rid of it
Subset for variables
str(dfsubset %>% select(Entity,Code,Year,Heightft))
## 'data.frame': 84 obs. of 4 variables:
## $ Entity : chr "Democratic Republic of Congo" "Democratic Republic of Congo" "Democratic Republic of Congo" "Democratic Republic of Congo" ...
## $ Code : Factor w/ 6 levels "COD","DEU","FRA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 1810 1870 1880 1890 1910 1920 1930 1950 1960 1970 ...
## $ Heightft: num 5.36 5.44 5.35 5.33 5.42 ...
str(dfsubset %>% select(-c(Heightcm)))
## 'data.frame': 84 obs. of 4 variables:
## $ Entity : chr "Democratic Republic of Congo" "Democratic Republic of Congo" "Democratic Republic of Congo" "Democratic Republic of Congo" ...
## $ Code : Factor w/ 6 levels "COD","DEU","FRA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 1810 1870 1880 1890 1910 1920 1930 1950 1960 1970 ...
## $ Heightft: num 5.36 5.44 5.35 5.33 5.42 ...
Subset for both!
- We can do both at the same time, chaining one to the other
dfsubset %>% filter(Year == 1980) %>%
select(Entity,Heightft)
## Entity Heightft
## 1 Democratic Republic of Congo 5.495407
## 2 France 5.790682
## 3 Gabon 5.567585
## 4 Germany 5.921916
## 5 Nigeria 5.561024
## 6 United Kingdom 5.800525
Practice
- Get the dataset
mtcars
using the data()
function
- Look at it with
str()
and help()
- Limit the dataset to just the variables
mpg, cyl
, and hp
- Get the mean
hp
for cars at or above the median value of cyl
- Get the mean
hp
for cars below the median value of cyl
- Do the same for
mpg
instead of hp
- Calculate the difference between above-median and below-median
mpg
and hp
- How do you interpret these differences?
Practice answers
data(mtcars)
help(mtcars)
str(mtcars)
mtcars <- mtcars %>% select(mpg,cyl,hp)
mean(filter(mtcars,cyl >= median(cyl))$hp)
mean(filter(mtcars,cyl < median(cyl))$hp)
mean(filter(mtcars,cyl >= median(cyl))$hp) - mean(filter(mtcars,cyl < median(cyl))$hp)
mean(filter(mtcars,cyl >= median(cyl))$mpg)
mean(filter(mtcars,cyl < median(cyl))$mpg)
mean(filter(mtcars,cyl >= median(cyl))$mpg) - mean(filter(mtcars,cyl < median(cyl))$mpg)
For loops
- Sometimes we want to subset things in many different ways
- Typing everything out over and over is a waste of time!
- You don’t understand how powerful computers are until you’ve written a
for
loop
- This is an incredibly standard programming tool
- R has another way of writing loops using the
apply
family, but we’re not going to go there
For loops
- The basic idea of a
for
loop is that you have a vector of values, and you have an “iterator” variable
- You go through the vector one by one, setting that iterator variable to each value one at a time
- Then you run a chunk of that code with the iterator variable set
For loops
for (iteratorvariable in vector) {
code chunk
}
For loops
- Let’s rewrite our
hp
and mpg
differences with a for loop. Heck, let’s do a bunch more variables too. (don’t forget to get data(mtcars)
again - why?)
- Note if we want it to display results inside a loop we need
print()
- Also, unfortunately, if we’re looping over variables,
$
won’t work - we need to use [[]]
data(mtcars)
abovemed <- mtcars %>% filter(cyl >= median(cyl))
belowmed <- mtcars %>% filter(cyl < median(cyl))
for (i in c('mpg','disp','hp','wt')) {
print(mean(abovemed[[i]])-mean(belowmed[[i]]))
}
data(mtcars)
abovemed <- mtcars %>% filter(cyl >= median(cyl))
belowmed <- mtcars %>% filter(cyl < median(cyl))
for (i in c('mpg','disp','hp','wt')) {
print(mean(abovemed[[i]])-mean(belowmed[[i]]))
}
## [1] -10.01602
## [1] 191.3684
## [1] 97.60173
## [1] 1.419463
For loops
- It’s also sometimes useful to loop over different values
- Let’s get the average height by country, like before
unique()
can give us the levels to loop over
## [1] "Democratic Republic of Congo" "France"
## [3] "Gabon" "Germany"
## [5] "Nigeria" "United Kingdom"
for (countryn in unique(dfsubset$Entity)) {
print(countryn)
print(mean(filter(dfsubset,Entity==countryn)$Heightft))
}
## [1] "Democratic Republic of Congo"
## [1] 5.439633
## [1] "France"
## [1] 5.520195
## [1] "Gabon"
## [1] 5.515092
## [1] "Germany"
## [1] 5.612059
## [1] "Nigeria"
## [1] 5.475175
## [1] "United Kingdom"
## [1] 5.602399
For loop practice
- Get back the full
mtcars
again
- Use
unique()
to see the different values of cyl
- Use
unique
to loop over those values and get median mpg
within each level of cyl
- Use
:
to construct a vector to repeat the same loop
- [Hard!] Use
paste0
to print out “The median mpg for cyl = # is X” where # is the iterator number and X is the answer.
For loop practice answers
data(mtcars)
unique(mtcars$cyl)
for (c in unique(mtcars$cyl)) {
print(median(filter(mtcars,cyl==c)$mpg))
}
for (c in 2:4*2) {
print(median(filter(mtcars,cyl==c)$mpg))
}
for (c in unique(mtcars$cyl)) {
print(paste0(c("The median mpg for cyl = ",c,
" is ",median(filter(mtcars,cyl==c)$mpg)),
collapse=''))
}
#Printing out just the last one:
## [1] "The median mpg for cyl = 6 is 19.7"
## [1] "The median mpg for cyl = 4 is 26"
## [1] "The median mpg for cyl = 8 is 15.2"