Lecture 5: Working with Data Part 1

Nick Huntington-Klein

January 18, 2019

Working with Data

R is all about working with data!
Today we’re going to start going over the use of data.frames and tibbles
data.frames are an object type; tibbles are basically data.frames with some extra bells and whistles, from the tidyverse package
Most of the time, you’ll be doing calculations using them

The Basic Idea

Conceptually, data.frames are basically spreadsheets
Technically, they’re a list of vectors

Spreadsheet	data.frame

Example

It’s a list of vectors… we can make one by listing some (same-length) vectors!
(Note the use of = here, not <-)

df <- data.frame(RacePosition = 1:5,
                 WayTheySayHi = as.factor(c('Hi','Hello','Hey','Yo','Hi')),
                 NumberofKids = c(3,5,1,0,2))
df <- tibble(RacePosition = 1:5,
                 WayTheySayHi = as.factor(c('Hi','Hello','Hey','Yo','Hi')),
                 NumberofKids = c(3,5,1,0,2))
df

## # A tibble: 5 x 3
##   RacePosition WayTheySayHi NumberofKids
##          <int> <fct>               <dbl>
## 1            1 Hi                      3
## 2            2 Hello                   5
## 3            3 Hey                     1
## 4            4 Yo                      0
## 5            5 Hi                      2

Looking Over Data

Now that we have our data, how can we take a look at it?
We can just name it in the Console and look at the whole thing, but that’s usually too much data
We can look at the whole thing by clicking on it in Environment to open it up

Glancing at Data

What if we just want a quick overview, rather than looking at the whole spreadsheet?
- Down-arrow in the Environment tab
- head() (look at the head of the data - first six rows)
- str() (structure)

str(df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5 obs. of  3 variables:
##  $ RacePosition: int  1 2 3 4 5
##  $ WayTheySayHi: Factor w/ 4 levels "Hello","Hey",..: 3 1 2 4 3
##  $ NumberofKids: num  3 5 1 0 2

So What?

What do we want to know about our data?
- What is this data OF? (won’t get that with str())
- Data types
- The kinds of values it takes
- How many observations
- Variable names.
- Summary statistics and observation level (we’ll get to that later)

Getting at Data

Now we have a data frame, df. How do we use it?
One way is that we can pull those vectors back out with $! Note autocompletion of variable names.
We can treat it just like the vectors we had before

df$NumberofKids

## [1] 3 5 1 0 2

df$NumberofKids[2]

## [1] 5

df$NumberofKids >= 3

## [1]  TRUE  TRUE FALSE FALSE FALSE

Quick Note

There are actually many many ways to do this
(some of which I even go over in the videos)
For example, you can use [row,column] to get at data, for example df$NumberofKids >= 3 is equivalent to df[,3] >= 3 or df[,'NumberofKids']>=3

That Said!

We can run the same calculations on these vectors as we were doing before

mean(df$RacePosition)

## [1] 3

df$WayTheySayHi[4]

## [1] Yo
## Levels: Hello Hey Hi Yo

sum(df$NumberofKids <= 1)

## [1] 2

Practice

Create df2 <- data.frame(a = 1:20, b = 0:19*2, c = sample(101:200,20,replace=TRUE))
What is the average of c?
What is the sum of a times b?
Did you get any values of c 103 or below? (make a logical)
What is on the 8th row of b?
How many rows have b above 10 AND c below 150?

Practice Answers

mean(df2$c)
sum(df2$a*df2$b)
sum(df2$c <= 103) > 0
df2$b[8]
sum(df2$b > 10 & df2$c < 150)

## [1] 152.7

## [1] 5320

## [1] FALSE

## [1] 14

## [1] 6

The Importance of Rows

So far we’ve basically just taken data frames and pulled the vectors (columns) back out
So… why not just stick with the vectors?
Because before long we’re not just going to be interested in the columns one at a time
We’ll want to keep track of each row - each row is an observation. The same observation!

The Importance of Rows

Going back to df, that fourth row says that
- The person in the fourth position…
- Says hello by saying “Yo”
- And has no kids
We’re going to want to keep that straight when we want to, say, look at the relationship between having kids and your position in the race.
Or how the number of kids relates to how you say hello!

## # A tibble: 5 x 3
##   RacePosition WayTheySayHi NumberofKids
##          <int> <fct>               <dbl>
## 1            1 Hi                      3
## 2            2 Hello                   5
## 3            3 Hey                     1
## 4            4 Yo                      0
## 5            5 Hi                      2

Working With Data Frames

Not to mention, we can manipulate data frames and tibbles!
Let’s figure out how we can:
- Create new variables
- Change variables
- Rename variables
It’s very common that you’ll have to work with data a little before analyzing it

Creating New Variables

Easy! data.frames are just lists of vectors
So create a vector and tell R where in that list to stick it!
Use descriptive names so you know what the variable is

df$State <- c('Alaska','California','California','Maine','Florida')
df

## # A tibble: 5 x 4
##   RacePosition WayTheySayHi NumberofKids State     
##          <int> <fct>               <dbl> <chr>     
## 1            1 Hi                      3 Alaska    
## 2            2 Hello                   5 California
## 3            3 Hey                     1 California
## 4            4 Yo                      0 Maine     
## 5            5 Hi                      2 Florida

Our Approach - DPLYR and Tidyverse

That’s the base-R way to do it, anyway
We’re going to be using dplyr (think pliers) for data manipulation instead
dplyr syntax is inspired by SQL - so learning dplyr will give you a leg up if you want to learn SQL later. Plus it’s just better.

Packages

tidyverse isn’t a part of base R. It’s in a package, so we’ll need to install it
We can install packages using install.packages('nameofpackage')

install.packages('tidyverse')

We can then check whether it’s installed in the Packages tab

Packages

Before we can use it we must then use the library() command to open it up
We’ll need to run library() for it again every time we open up R if we want to use the package

library(tidyverse)

There are literally thousands of useful packages for R, and we’re going to be using a few! Tidyverse will just be our first of many
Google R package X to look for packages that do X.

Varable creation with dplyr

The mutate command will “mutate” our data frame to have a new column in it. We can then overwrite it.
The pipe %>% says “take df and send it to that mutate command to use”
Or we can stick the data frame itself in the mutate command

library(tidyverse)
df <- df %>%
  mutate(State = c('Alaska','California','California','Maine','Florida'))
df <- mutate(df,State = c('Alaska','California','California','Maine','Florida'))

Creating New Variables

We can use all the tricks we already know about creating vectors
We can create multiple new variables in one mutate command

df <- df %>% mutate(MoreThanTwoKids = NumberofKids > 2,
                    One = 1,
                    KidsPlusPosition = NumberofKids + RacePosition)
df

## # A tibble: 5 x 7
##   RacePosition WayTheySayHi NumberofKids State MoreThanTwoKids   One
##          <int> <fct>               <dbl> <chr> <lgl>           <dbl>
## 1            1 Hi                      3 Alas~ TRUE                1
## 2            2 Hello                   5 Cali~ TRUE                1
## 3            3 Hey                     1 Cali~ FALSE               1
## 4            4 Yo                      0 Maine FALSE               1
## 5            5 Hi                      2 Flor~ FALSE               1
## # ... with 1 more variable: KidsPlusPosition <dbl>

Manipulating Variables

We can’t really change variables, but we sure can overwrite them!
We can drop variables with - in the dplyr select command
Note we chain multiple dplyr commands with %>%

df <- df %>% 
  select(-KidsPlusPosition,-WayTheySayHi,-One) %>%
  mutate(State = as.factor(State),
         RacePosition = RacePosition - 1)
df$State[3] <- 'Alaska'
str(df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5 obs. of  4 variables:
##  $ RacePosition   : num  0 1 2 3 4
##  $ NumberofKids   : num  3 5 1 0 2
##  $ State          : Factor w/ 4 levels "Alaska","California",..: 1 2 1 4 3
##  $ MoreThanTwoKids: logi  TRUE TRUE FALSE FALSE FALSE

Renaming Variables

Sometimes it will make sense to change the names of the variables we have.
Names are stored in names(df) which we can edit directly
Or the rename() command in dplyr has us covered

names(df)

## [1] "RacePosition"    "NumberofKids"    "State"           "MoreThanTwoKids"

#names(df) <- c('Pos','Num.Kids','State','mt2Kids')
df <- df %>% rename(Pos = RacePosition, Num.Kids=NumberofKids,
                    mt2Kids = MoreThanTwoKids)
names(df)

## [1] "Pos"      "Num.Kids" "State"    "mt2Kids"

tidylog

Protip: after loading the tidyverse, also load the tidylog package. This will tell you what each step of your dplyr command does!

library(tidyverse)
library(tidylog)
df <- df %>% mutate(Pos = Pos + 1,
                    Num.Kids = 10)

## mutate: changed 5 values (100%) of 'Pos' (0 new NA)

## mutate: changed 5 values (100%) of 'Num.Kids' (0 new NA)

Practice

Create a data set data with three variables: a is all even numbers from 2 to 20, b is c(0,1) over and over, and c is any ten-element numeric vector of your choice.
Rename them to EvenNumbers, Treatment, Outcome.
Add a logical variable called Big that’s true whenever EvenNumbers is greater than 15
Increase Outcome by 1 for all the rows where Treatment is 1.
Create a logical AboveMean that is true whenever Outcome is above the mean of Outcome.
Display the data structure

Practice Answers

data <- data.frame(a = 1:10*2,
                   b = c(0,1),
                   c = sample(1:100,10,replace=FALSE)) %>%
  rename(EvenNumbers = a, Treatment = b, Outcome = c)

data <- data %>%
  mutate(Big = EvenNumbers > 15,
         Outcome = Outcome + Treatment,
         AboveMean = Outcome > mean(Outcome))
str(data)

Other Ways to Get Data

Of course, most of the time we aren’t making up data
We get it from the real world!
Two main ways to do this are the data() function in R
Or reading in files, usually with one of the read commands like read.csv()

data()

R has many baked-in data sets, and more in packages!
Just type in data( and see what options it autocompletes
We can load in data and look at it
Many of these data sets have help files too

data(LifeCycleSavings)
help(LifeCycleSavings)
head(LifeCycleSavings)

##              sr pop15 pop75     dpi ddpi
## Australia 11.43 29.35  2.87 2329.68 2.87
## Austria   12.07 23.32  4.41 1507.99 3.93
## Belgium   13.17 23.80  4.43 2108.47 3.82
## Bolivia    5.75 41.89  1.67  189.13 0.22
## Brazil    12.88 42.19  0.83  728.47 4.56
## Canada     8.79 31.72  2.85 2982.88 2.43

read

Often there will be data files on the internet or your computer
You can read this in with one of the many read commands, like read.csv
CSV is a very basic spreadsheet format stored in a text file, you can create it from Excel or Sheets (or just write it)
There are different read commands for different file types
Make sure your working directory is set to where the data is!
Documentation will usually be in a different file

datafromCSV <- read.csv('mydatafile.csv')

Practice

Use data() to open up a data set - any data set (although it should be in data.frame or tibble form - try again if you get something else)
Use str() and help() to examine that data set
- What is it data of (help file)? How was it collected and what do the variables represent?
- What kinds of variables are in there and what kinds of values do they have (str() and head())?
Create a new variable using the variables that are already in there
Take a mean of one of the variables
Rename a variable to be more descriptive based on what you saw in help().