Lecture 5: Working with Data Part 1

Nick Huntington-Klein

January 18, 2019

Working with Data

  • R is all about working with data!
  • Today we’re going to start going over the use of data.frames and tibbles
  • data.frames are an object type; tibbles are basically data.frames with some extra bells and whistles, from the tidyverse package
  • Most of the time, you’ll be doing calculations using them

The Basic Idea

  • Conceptually, data.frames are basically spreadsheets
  • Technically, they’re a list of vectors
Spreadsheet data.frame

Example

  • It’s a list of vectors… we can make one by listing some (same-length) vectors!
  • (Note the use of = here, not <-)
df <- data.frame(RacePosition = 1:5,
                 WayTheySayHi = as.factor(c('Hi','Hello','Hey','Yo','Hi')),
                 NumberofKids = c(3,5,1,0,2))
df <- tibble(RacePosition = 1:5,
                 WayTheySayHi = as.factor(c('Hi','Hello','Hey','Yo','Hi')),
                 NumberofKids = c(3,5,1,0,2))
df
## # A tibble: 5 x 3
##   RacePosition WayTheySayHi NumberofKids
##          <int> <fct>               <dbl>
## 1            1 Hi                      3
## 2            2 Hello                   5
## 3            3 Hey                     1
## 4            4 Yo                      0
## 5            5 Hi                      2

Looking Over Data

  • Now that we have our data, how can we take a look at it?
  • We can just name it in the Console and look at the whole thing, but that’s usually too much data
  • We can look at the whole thing by clicking on it in Environment to open it up

Glancing at Data

  • What if we just want a quick overview, rather than looking at the whole spreadsheet?
    • Down-arrow in the Environment tab
    • head() (look at the head of the data - first six rows)
    • str() (structure)
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    5 obs. of  3 variables:
##  $ RacePosition: int  1 2 3 4 5
##  $ WayTheySayHi: Factor w/ 4 levels "Hello","Hey",..: 3 1 2 4 3
##  $ NumberofKids: num  3 5 1 0 2

So What?

  • What do we want to know about our data?
    • What is this data OF? (won’t get that with str())
    • Data types
    • The kinds of values it takes
    • How many observations
    • Variable names.
    • Summary statistics and observation level (we’ll get to that later)

Getting at Data

  • Now we have a data frame, df. How do we use it?
  • One way is that we can pull those vectors back out with $! Note autocompletion of variable names.
  • We can treat it just like the vectors we had before
df$NumberofKids
## [1] 3 5 1 0 2
df$NumberofKids[2]
## [1] 5
df$NumberofKids >= 3
## [1]  TRUE  TRUE FALSE FALSE FALSE

Quick Note

  • There are actually many many ways to do this
  • (some of which I even go over in the videos)
  • For example, you can use [row,column] to get at data, for example df$NumberofKids >= 3 is equivalent to df[,3] >= 3 or df[,'NumberofKids']>=3

That Said!

  • We can run the same calculations on these vectors as we were doing before
mean(df$RacePosition)
## [1] 3
df$WayTheySayHi[4]
## [1] Yo
## Levels: Hello Hey Hi Yo
sum(df$NumberofKids <= 1)
## [1] 2

Practice

  • Create df2 <- data.frame(a = 1:20, b = 0:19*2, c = sample(101:200,20,replace=TRUE))
  • What is the average of c?
  • What is the sum of a times b?
  • Did you get any values of c 103 or below? (make a logical)
  • What is on the 8th row of b?
  • How many rows have b above 10 AND c below 150?

Practice Answers

mean(df2$c)
sum(df2$a*df2$b)
sum(df2$c <= 103) > 0
df2$b[8]
sum(df2$b > 10 & df2$c < 150)
## [1] 152.7
## [1] 5320
## [1] FALSE
## [1] 14
## [1] 6

The Importance of Rows

  • So far we’ve basically just taken data frames and pulled the vectors (columns) back out
  • So… why not just stick with the vectors?
  • Because before long we’re not just going to be interested in the columns one at a time
  • We’ll want to keep track of each row - each row is an observation. The same observation!

The Importance of Rows

  • Going back to df, that fourth row says that
    • The person in the fourth position…
    • Says hello by saying “Yo”
    • And has no kids
  • We’re going to want to keep that straight when we want to, say, look at the relationship between having kids and your position in the race.
  • Or how the number of kids relates to how you say hello!
## # A tibble: 5 x 3
##   RacePosition WayTheySayHi NumberofKids
##          <int> <fct>               <dbl>
## 1            1 Hi                      3
## 2            2 Hello                   5
## 3            3 Hey                     1
## 4            4 Yo                      0
## 5            5 Hi                      2

Working With Data Frames

  • Not to mention, we can manipulate data frames and tibbles!
  • Let’s figure out how we can:
    • Create new variables
    • Change variables
    • Rename variables
  • It’s very common that you’ll have to work with data a little before analyzing it

Creating New Variables

  • Easy! data.frames are just lists of vectors
  • So create a vector and tell R where in that list to stick it!
  • Use descriptive names so you know what the variable is
df$State <- c('Alaska','California','California','Maine','Florida')
df
## # A tibble: 5 x 4
##   RacePosition WayTheySayHi NumberofKids State     
##          <int> <fct>               <dbl> <chr>     
## 1            1 Hi                      3 Alaska    
## 2            2 Hello                   5 California
## 3            3 Hey                     1 California
## 4            4 Yo                      0 Maine     
## 5            5 Hi                      2 Florida

Our Approach - DPLYR and Tidyverse

  • That’s the base-R way to do it, anyway
  • We’re going to be using dplyr (think pliers) for data manipulation instead
  • dplyr syntax is inspired by SQL - so learning dplyr will give you a leg up if you want to learn SQL later. Plus it’s just better.

Packages

  • tidyverse isn’t a part of base R. It’s in a package, so we’ll need to install it
  • We can install packages using install.packages('nameofpackage')
install.packages('tidyverse')
  • We can then check whether it’s installed in the Packages tab

Packages

  • Before we can use it we must then use the library() command to open it up
  • We’ll need to run library() for it again every time we open up R if we want to use the package
library(tidyverse)
  • There are literally thousands of useful packages for R, and we’re going to be using a few! Tidyverse will just be our first of many
  • Google R package X to look for packages that do X.

Varable creation with dplyr

  • The mutate command will “mutate” our data frame to have a new column in it. We can then overwrite it.
  • The pipe %>% says “take df and send it to that mutate command to use”
  • Or we can stick the data frame itself in the mutate command
library(tidyverse)
df <- df %>%
  mutate(State = c('Alaska','California','California','Maine','Florida'))
df <- mutate(df,State = c('Alaska','California','California','Maine','Florida'))

Creating New Variables

  • We can use all the tricks we already know about creating vectors
  • We can create multiple new variables in one mutate command
df <- df %>% mutate(MoreThanTwoKids = NumberofKids > 2,
                    One = 1,
                    KidsPlusPosition = NumberofKids + RacePosition)
df
## # A tibble: 5 x 7
##   RacePosition WayTheySayHi NumberofKids State MoreThanTwoKids   One
##          <int> <fct>               <dbl> <chr> <lgl>           <dbl>
## 1            1 Hi                      3 Alas~ TRUE                1
## 2            2 Hello                   5 Cali~ TRUE                1
## 3            3 Hey                     1 Cali~ FALSE               1
## 4            4 Yo                      0 Maine FALSE               1
## 5            5 Hi                      2 Flor~ FALSE               1
## # ... with 1 more variable: KidsPlusPosition <dbl>

Manipulating Variables

  • We can’t really change variables, but we sure can overwrite them!
  • We can drop variables with - in the dplyr select command
  • Note we chain multiple dplyr commands with %>%
df <- df %>% 
  select(-KidsPlusPosition,-WayTheySayHi,-One) %>%
  mutate(State = as.factor(State),
         RacePosition = RacePosition - 1)
df$State[3] <- 'Alaska'
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    5 obs. of  4 variables:
##  $ RacePosition   : num  0 1 2 3 4
##  $ NumberofKids   : num  3 5 1 0 2
##  $ State          : Factor w/ 4 levels "Alaska","California",..: 1 2 1 4 3
##  $ MoreThanTwoKids: logi  TRUE TRUE FALSE FALSE FALSE

Renaming Variables

  • Sometimes it will make sense to change the names of the variables we have.
  • Names are stored in names(df) which we can edit directly
  • Or the rename() command in dplyr has us covered
names(df)
## [1] "RacePosition"    "NumberofKids"    "State"           "MoreThanTwoKids"
#names(df) <- c('Pos','Num.Kids','State','mt2Kids')
df <- df %>% rename(Pos = RacePosition, Num.Kids=NumberofKids,
                    mt2Kids = MoreThanTwoKids)
names(df)
## [1] "Pos"      "Num.Kids" "State"    "mt2Kids"

tidylog

  • Protip: after loading the tidyverse, also load the tidylog package. This will tell you what each step of your dplyr command does!
library(tidyverse)
library(tidylog)
df <- df %>% mutate(Pos = Pos + 1,
                    Num.Kids = 10)
## mutate: changed 5 values (100%) of 'Pos' (0 new NA)
## mutate: changed 5 values (100%) of 'Num.Kids' (0 new NA)

Practice

  • Create a data set data with three variables: a is all even numbers from 2 to 20, b is c(0,1) over and over, and c is any ten-element numeric vector of your choice.
  • Rename them to EvenNumbers, Treatment, Outcome.
  • Add a logical variable called Big that’s true whenever EvenNumbers is greater than 15
  • Increase Outcome by 1 for all the rows where Treatment is 1.
  • Create a logical AboveMean that is true whenever Outcome is above the mean of Outcome.
  • Display the data structure

Practice Answers

data <- data.frame(a = 1:10*2,
                   b = c(0,1),
                   c = sample(1:100,10,replace=FALSE)) %>%
  rename(EvenNumbers = a, Treatment = b, Outcome = c)

data <- data %>%
  mutate(Big = EvenNumbers > 15,
         Outcome = Outcome + Treatment,
         AboveMean = Outcome > mean(Outcome))
str(data)

Other Ways to Get Data

  • Of course, most of the time we aren’t making up data
  • We get it from the real world!
  • Two main ways to do this are the data() function in R
  • Or reading in files, usually with one of the read commands like read.csv()

data()

  • R has many baked-in data sets, and more in packages!
  • Just type in data( and see what options it autocompletes
  • We can load in data and look at it
  • Many of these data sets have help files too
data(LifeCycleSavings)
help(LifeCycleSavings)
head(LifeCycleSavings)
##              sr pop15 pop75     dpi ddpi
## Australia 11.43 29.35  2.87 2329.68 2.87
## Austria   12.07 23.32  4.41 1507.99 3.93
## Belgium   13.17 23.80  4.43 2108.47 3.82
## Bolivia    5.75 41.89  1.67  189.13 0.22
## Brazil    12.88 42.19  0.83  728.47 4.56
## Canada     8.79 31.72  2.85 2982.88 2.43

read

  • Often there will be data files on the internet or your computer
  • You can read this in with one of the many read commands, like read.csv
  • CSV is a very basic spreadsheet format stored in a text file, you can create it from Excel or Sheets (or just write it)
  • There are different read commands for different file types
  • Make sure your working directory is set to where the data is!
  • Documentation will usually be in a different file
datafromCSV <- read.csv('mydatafile.csv')

Practice

  • Use data() to open up a data set - any data set (although it should be in data.frame or tibble form - try again if you get something else)
  • Use str() and help() to examine that data set
    • What is it data of (help file)? How was it collected and what do the variables represent?
    • What kinds of variables are in there and what kinds of values do they have (str() and head())?
  • Create a new variable using the variables that are already in there
  • Take a mean of one of the variables
  • Rename a variable to be more descriptive based on what you saw in help().