Lecture 5: Working with Data Part 1
Nick Huntington-Klein
January 18, 2019
Working with Data
- R is all about working with data!
- Today we’re going to start going over the use of data.frames and tibbles
- data.frames are an object type; tibbles are basically data.frames with some extra bells and whistles, from the tidyverse package
- Most of the time, you’ll be doing calculations using them
The Basic Idea
- Conceptually, data.frames are basically spreadsheets
- Technically, they’re a list of vectors
Example
- It’s a list of vectors… we can make one by listing some (same-length) vectors!
- (Note the use of = here, not <-)
df <- data.frame(RacePosition = 1:5,
WayTheySayHi = as.factor(c('Hi','Hello','Hey','Yo','Hi')),
NumberofKids = c(3,5,1,0,2))
df <- tibble(RacePosition = 1:5,
WayTheySayHi = as.factor(c('Hi','Hello','Hey','Yo','Hi')),
NumberofKids = c(3,5,1,0,2))
df
## # A tibble: 5 x 3
## RacePosition WayTheySayHi NumberofKids
## <int> <fct> <dbl>
## 1 1 Hi 3
## 2 2 Hello 5
## 3 3 Hey 1
## 4 4 Yo 0
## 5 5 Hi 2
Looking Over Data
- Now that we have our data, how can we take a look at it?
- We can just name it in the Console and look at the whole thing, but that’s usually too much data
- We can look at the whole thing by clicking on it in Environment to open it up
Glancing at Data
- What if we just want a quick overview, rather than looking at the whole spreadsheet?
- Down-arrow in the Environment tab
head()
(look at the head of the data - first six rows)
str()
(structure)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of 3 variables:
## $ RacePosition: int 1 2 3 4 5
## $ WayTheySayHi: Factor w/ 4 levels "Hello","Hey",..: 3 1 2 4 3
## $ NumberofKids: num 3 5 1 0 2
So What?
- What do we want to know about our data?
- What is this data OF? (won’t get that with
str()
)
- Data types
- The kinds of values it takes
- How many observations
- Variable names.
- Summary statistics and observation level (we’ll get to that later)
Getting at Data
- Now we have a data frame,
df
. How do we use it?
- One way is that we can pull those vectors back out with
$
! Note autocompletion of variable names.
- We can treat it just like the vectors we had before
## [1] 3 5 1 0 2
## [1] 5
## [1] TRUE TRUE FALSE FALSE FALSE
Quick Note
- There are actually many many ways to do this
- (some of which I even go over in the videos)
- For example, you can use
[row,column]
to get at data, for example df$NumberofKids >= 3
is equivalent to df[,3] >= 3
or df[,'NumberofKids']>=3
That Said!
- We can run the same calculations on these vectors as we were doing before
## [1] 3
## [1] Yo
## Levels: Hello Hey Hi Yo
sum(df$NumberofKids <= 1)
## [1] 2
Practice
- Create
df2 <- data.frame(a = 1:20, b = 0:19*2,
c = sample(101:200,20,replace=TRUE))
- What is the average of
c
?
- What is the sum of
a
times b
?
- Did you get any values of
c
103 or below? (make a logical)
- What is on the 8th row of
b
?
- How many rows have
b
above 10 AND c
below 150?
Practice Answers
mean(df2$c)
sum(df2$a*df2$b)
sum(df2$c <= 103) > 0
df2$b[8]
sum(df2$b > 10 & df2$c < 150)
## [1] 152.7
## [1] 5320
## [1] FALSE
## [1] 14
## [1] 6
The Importance of Rows
- So far we’ve basically just taken data frames and pulled the vectors (columns) back out
- So… why not just stick with the vectors?
- Because before long we’re not just going to be interested in the columns one at a time
- We’ll want to keep track of each row - each row is an observation. The same observation!
The Importance of Rows
- Going back to
df
, that fourth row says that
- The person in the fourth position…
- Says hello by saying “Yo”
- And has no kids
- We’re going to want to keep that straight when we want to, say, look at the relationship between having kids and your position in the race.
- Or how the number of kids relates to how you say hello!
## # A tibble: 5 x 3
## RacePosition WayTheySayHi NumberofKids
## <int> <fct> <dbl>
## 1 1 Hi 3
## 2 2 Hello 5
## 3 3 Hey 1
## 4 4 Yo 0
## 5 5 Hi 2
Working With Data Frames
- Not to mention, we can manipulate data frames and tibbles!
- Let’s figure out how we can:
- Create new variables
- Change variables
- Rename variables
- It’s very common that you’ll have to work with data a little before analyzing it
Creating New Variables
- Easy! data.frames are just lists of vectors
- So create a vector and tell R where in that list to stick it!
- Use descriptive names so you know what the variable is
df$State <- c('Alaska','California','California','Maine','Florida')
df
## # A tibble: 5 x 4
## RacePosition WayTheySayHi NumberofKids State
## <int> <fct> <dbl> <chr>
## 1 1 Hi 3 Alaska
## 2 2 Hello 5 California
## 3 3 Hey 1 California
## 4 4 Yo 0 Maine
## 5 5 Hi 2 Florida
Our Approach - DPLYR and Tidyverse
- That’s the base-R way to do it, anyway
- We’re going to be using dplyr (think pliers) for data manipulation instead
- dplyr syntax is inspired by SQL - so learning dplyr will give you a leg up if you want to learn SQL later. Plus it’s just better.
Packages
- tidyverse isn’t a part of base R. It’s in a package, so we’ll need to install it
- We can install packages using
install.packages('nameofpackage')
install.packages('tidyverse')
- We can then check whether it’s installed in the Packages tab
Packages
- Before we can use it we must then use the
library()
command to open it up
- We’ll need to run
library()
for it again every time we open up R if we want to use the package
- There are literally thousands of useful packages for R, and we’re going to be using a few! Tidyverse will just be our first of many
- Google R package X to look for packages that do X.
Varable creation with dplyr
- The mutate command will “mutate” our data frame to have a new column in it. We can then overwrite it.
- The pipe
%>%
says “take df and send it to that mutate command to use”
- Or we can stick the data frame itself in the
mutate
command
library(tidyverse)
df <- df %>%
mutate(State = c('Alaska','California','California','Maine','Florida'))
df <- mutate(df,State = c('Alaska','California','California','Maine','Florida'))
Creating New Variables
- We can use all the tricks we already know about creating vectors
- We can create multiple new variables in one mutate command
df <- df %>% mutate(MoreThanTwoKids = NumberofKids > 2,
One = 1,
KidsPlusPosition = NumberofKids + RacePosition)
df
## # A tibble: 5 x 7
## RacePosition WayTheySayHi NumberofKids State MoreThanTwoKids One
## <int> <fct> <dbl> <chr> <lgl> <dbl>
## 1 1 Hi 3 Alas~ TRUE 1
## 2 2 Hello 5 Cali~ TRUE 1
## 3 3 Hey 1 Cali~ FALSE 1
## 4 4 Yo 0 Maine FALSE 1
## 5 5 Hi 2 Flor~ FALSE 1
## # ... with 1 more variable: KidsPlusPosition <dbl>
Manipulating Variables
- We can’t really change variables, but we sure can overwrite them!
- We can drop variables with
-
in the dplyr select
command
- Note we chain multiple dplyr commands with
%>%
df <- df %>%
select(-KidsPlusPosition,-WayTheySayHi,-One) %>%
mutate(State = as.factor(State),
RacePosition = RacePosition - 1)
df$State[3] <- 'Alaska'
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of 4 variables:
## $ RacePosition : num 0 1 2 3 4
## $ NumberofKids : num 3 5 1 0 2
## $ State : Factor w/ 4 levels "Alaska","California",..: 1 2 1 4 3
## $ MoreThanTwoKids: logi TRUE TRUE FALSE FALSE FALSE
Renaming Variables
- Sometimes it will make sense to change the names of the variables we have.
- Names are stored in
names(df)
which we can edit directly
- Or the
rename()
command in dplyr has us covered
## [1] "RacePosition" "NumberofKids" "State" "MoreThanTwoKids"
#names(df) <- c('Pos','Num.Kids','State','mt2Kids')
df <- df %>% rename(Pos = RacePosition, Num.Kids=NumberofKids,
mt2Kids = MoreThanTwoKids)
names(df)
## [1] "Pos" "Num.Kids" "State" "mt2Kids"
tidylog
- Protip: after loading the tidyverse, also load the
tidylog
package. This will tell you what each step of your dplyr command does!
library(tidyverse)
library(tidylog)
df <- df %>% mutate(Pos = Pos + 1,
Num.Kids = 10)
## mutate: changed 5 values (100%) of 'Pos' (0 new NA)
## mutate: changed 5 values (100%) of 'Num.Kids' (0 new NA)
Practice
- Create a data set
data
with three variables: a
is all even numbers from 2 to 20, b
is c(0,1)
over and over, and c
is any ten-element numeric vector of your choice.
- Rename them to
EvenNumbers
, Treatment
, Outcome
.
- Add a logical variable called Big that’s true whenever EvenNumbers is greater than 15
- Increase Outcome by 1 for all the rows where Treatment is 1.
- Create a logical AboveMean that is true whenever Outcome is above the mean of Outcome.
- Display the data structure
Practice Answers
data <- data.frame(a = 1:10*2,
b = c(0,1),
c = sample(1:100,10,replace=FALSE)) %>%
rename(EvenNumbers = a, Treatment = b, Outcome = c)
data <- data %>%
mutate(Big = EvenNumbers > 15,
Outcome = Outcome + Treatment,
AboveMean = Outcome > mean(Outcome))
str(data)
Other Ways to Get Data
- Of course, most of the time we aren’t making up data
- We get it from the real world!
- Two main ways to do this are the
data()
function in R
- Or reading in files, usually with one of the
read
commands like read.csv()
data()
- R has many baked-in data sets, and more in packages!
- Just type in
data(
and see what options it autocompletes
- We can load in data and look at it
- Many of these data sets have
help
files too
data(LifeCycleSavings)
help(LifeCycleSavings)
head(LifeCycleSavings)
## sr pop15 pop75 dpi ddpi
## Australia 11.43 29.35 2.87 2329.68 2.87
## Austria 12.07 23.32 4.41 1507.99 3.93
## Belgium 13.17 23.80 4.43 2108.47 3.82
## Bolivia 5.75 41.89 1.67 189.13 0.22
## Brazil 12.88 42.19 0.83 728.47 4.56
## Canada 8.79 31.72 2.85 2982.88 2.43
read
- Often there will be data files on the internet or your computer
- You can read this in with one of the many
read
commands, like read.csv
- CSV is a very basic spreadsheet format stored in a text file, you can create it from Excel or Sheets (or just write it)
- There are different
read
commands for different file types
- Make sure your working directory is set to where the data is!
- Documentation will usually be in a different file
datafromCSV <- read.csv('mydatafile.csv')
Practice
- Use
data()
to open up a data set - any data set (although it should be in data.frame
or tibble
form - try again if you get something else)
- Use
str()
and help()
to examine that data set
- What is it data of (help file)? How was it collected and what do the variables represent?
- What kinds of variables are in there and what kinds of values do they have (
str()
and head()
)?
- Create a new variable using the variables that are already in there
- Take a mean of one of the variables
- Rename a variable to be more descriptive based on what you saw in
help()
.