Data Wrangling in the Tidyverse

Nick Huntington-Klein

26 February, 2023

Data Wrangling

  • Data wrangling is the process of going from raw data to something usable
  • It is what you will spend most of your time as a data person doing, so you better be good at it!
  • Is it part of data communications? Yes, although not really any moreso than your other data classes
  • But do they cover it enough? Never! So we do here.
  • You will thank me later.

Data Wrangling

This week:

  • The goal of data wrangling
  • How to think about data wrangling
  • Technical tips for data wrangling in R using the tidyverse package (which, importantly, contains the dplyr and tidyr packages inside, as well as ggplot2 which we’ll use for visualization)
  • This is adapted from my workshop on data wrangling in the tidyverse. Video from that is here, as well as versions with data.table (R package with different syntax, very good for big data sets and extra speed) and pandas in Python.

Tech note: the pipe

scales::percent(mean(mtcars$am, na.rm = TRUE), accuracy = .1) can be rewritten

mtcars %>%
  pull(am) %>%
  mean(na.rm = TRUE) %>%
  scales::percent(accuracy = .1)
  • Like a conveyer belt! Nice and easy.
  • pull() is a dplyr function that says “give me back this one variable instead of a data set” but in a pipe-friendly way, so mtcars %>% pull(am) is the same as mtcars$am or mtcars[['am']]

Limitations

  • We only have so much time! I won’t be going into great detail on the use of all the technical commands, but by the end of this you will know what’s out there and generally how it’s used
  • As with any computer skill, a teacher’s comparative advantage is in letting you know what’s out there. The real learning comes from practice and Googling. So take what you see here today, find yourself a project, and do it! It will be awful but you will learn an astounding amount by the end

The Core of Data Wrangling

  • Always look directly at your data so you know what it looks like
  • Always think about what you want your data to look like when you’re done
  • Think about how you can take information from where it is and put it where you want it to be
  • After every step, look directly at your data again to make sure it’s doing what you think it’s doing

I help a lot of people with their problems with data wrangling. Their issues are almost always not doing one of these four things, much more so than having trouble coding or anything like that

The Core of Data Wrangling

  • How can you “look at your data”?
  • Literally is one way - click on the data set, or do View() to look at it
  • Summary statistics tables: sumtable() or vtable(lush = TRUE) in vtable for example
  • Checking what values it takes: unique() or summary() on individual variables
  • Check if it takes a certain value ever: sum(data$variable == value)
  • Look for: What values are there, what the observations look like, presence of missing or unusable data, how the data is structured

The Stages of Data Wrangling

For the purposes of this class:

  • From data to tidy data
  • From tidy data to data for your analysis

What we Can and Can’t Cover

  • We could be at this for literally weeks and not cover everything
  • And you wouldn’t remember it anyway
  • I will cover the stuff we can use immediately and will use all the time
  • The rest will be reducing unknown unknowns
  • If you need to do something later, and we didn’t cover it, maybe come back and check these slides!!
  • The video series, linked on Canvas and here, will go into these things in more detail when you need them

From Data to Tidy Data

From Data to Tidy Data

  • Data is any time you have your records stored in some structured format
  • But there are many such structures! They could be across a bunch of different tables, or perhaps a spreadsheet with different variables stored randomly in different areas, or one table per observation
  • These structures can be great for looking up values. That’s why they are often used in business or other settings where you say “I wonder what the value of X is for person/day/etc. N”
  • They’re rarely good for doing analysis (calculating statistics, fitting models, making visualizations)
  • For that, we will aim to get ourselves tidy data

Tidy Data

In tidy data:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Tidy Data

The variables in tidy data come in two types:

  1. Identifying Variables/Keys are the columns you’d look at to locate a particular observation.
  2. Measures/Values are the actual data.

Which are they in this data?

Tidy Data

  • Person and Year are our identifying variables. The combination of person and year uniquely identifies a row in the data. Our “observation level” is person and year. There’s only one row with Person == “Chidi” and Year == 2018
  • Points and ShrimpConsumption are our measures. They are the things we have measured for each of our observations
  • Notice how there’s one row per observation (combination of Person and Year), and one column per variable
  • Also this table contains only variables that are at the Person-Year observation level. Variables at a different level (perhaps things that vary between Person but don’t change over Year) would go in a different table, although this last one is less important

Tidying Non-Tidy Data

  • So what might data look like when it’s not like this, and how can we get it this way?
  • Here’s one common example, a count table (not tidy!) where each column is a value, not a variable

Tidying Non-tidy Data

  • Here’s a simple and very common case, where we have a category in each column
  • This will mess us up if we, say, want to categorize by a “category” variable!

Tidying Non-tidy Data

There are many more elements to this, but I will cover two tools: pivot (briefly) and join (in more detail)

Pivot will be our first review in a flash here, just remember that it exists, come back later or watch the video when you actually need it

Tidying Non-Tidy Data

  • The first big tool in our tidying toolbox is the pivot
  • A pivot takes a single row with K columns and turns it into K rows with 1 column, using the identifying variables/keys to keep things lined up.
  • This can also be referred to as going from “wide” data to “long” data
  • Long to wide is also an option
  • In every statistics package, pivot functions are notoriously fiddly. Always read the help file, and do trial-and-error! Make sure it worked as intended.

Pivot

  • In the tidyverse we have the functions pivot_longer() and pivot_wider(). Here we want wide-to-long so we use pivot_longer()
  • This asks for:
  • data (the data set you’re working with, also the first argument so we can pipe to it)
  • cols (the columns to pivot) - it will assume anything not named here are the keys
  • names_to (the name of the variable to store which column a given row came from, here “week”)
  • values_to (the name of the vairable to store the value in)
  • Many other options (see help(pivot_longer))

Pivot

# pivot is pretty good at just figuring stuff out for us, at least in this simple case!
multicat_long <- multicat %>%
  pivot_longer(cols = starts_with('Sales_'), # tidyselect functions help us pick columns based on name patterns
               names_to = 'Category', values_to = 'Sales')
multicat_long
## # A tibble: 8 × 3
##   Date       Category       Sales
##   <date>     <chr>          <dbl>
## 1 2000-01-01 Sales_Goods        4
## 2 2000-01-01 Sales_Services     4
## 3 2000-02-01 Sales_Goods        5
## 4 2000-02-01 Sales_Services     9
## 5 2000-03-01 Sales_Goods        1
## 6 2000-03-01 Sales_Services     2
## 7 2000-04-01 Sales_Goods        2
## 8 2000-04-01 Sales_Services     3

Variables Stored as Rows

  • Here we have tax form data where each variable is a row, but we have multiple tables For this one we can use pivot_wider(), and then combine multiple individuals with bind_rows()

Variables Stored as Rows

  • pivot_wider() needs:
  • data (first argument, the data we’re working with)
  • id_cols (the columns that give us the key - what should it be here?)
  • names_from (the column containing what will be the new variable names)
  • values_from (the column containing the new values)
  • Many others! See help(pivot_wider)

Variables Stored as Rows

taxdata %>%
  pivot_wider(names_from = 'TaxFormRow',
              values_from = 'Value')
## # A tibble: 1 × 4
##   Person        Income Deductible AGI  
##   <chr>         <chr>  <chr>      <chr>
## 1 James Acaster 112341 24000      88341

(note that the variables are all stored as character variables not numbers - that’s because the “person” row is a character, which forced the rest to be too. we’ll go through how to fix that later)

Merging Data

  • Commonly, you will need to link two datasets together based on some shared keys
  • For example, if one dataset has the variables “Person”, “Year”, and “Income” and the other has “Person” and “Birthplace”

Merging Data

That was person_year_data. And now for person_data:

Merging Data

  • The dplyr join family of functions will do this (see help(join)). The different varieties just determine what to do with rows you don’t find a match for. left_join() keeps non-matching rows from the first dataset but not the second, right_join() from the second not the first, full_join() from both, inner_join() from neither, and anti_join() JUST keeps non-matches

Merging Data

person_year_data %>%
  left_join(person_data, by = 'Person')
##    Person Year Income      Birthplace
## 1  Ramesh 2014  81314         Crawley
## 2  Ramesh 2015  82155         Crawley
## 3 Whitney 2014 131292 Washington D.C.
## 4 Whitney 2015 141262 Washington D.C.
## 5   David 2014 102452            <NA>
## 6   David 2015 105133            <NA>
person_year_data %>%
  right_join(person_data, by = 'Person')
##    Person Year Income      Birthplace
## 1  Ramesh 2014  81314         Crawley
## 2  Ramesh 2015  82155         Crawley
## 3 Whitney 2014 131292 Washington D.C.
## 4 Whitney 2015 141262 Washington D.C.

Merging Data

  • Things work great if the list of variables in by is the exact observation level in at least one of the two data sets
  • But if there are multiple observations per combination of by variables in both, that’s a problem! It will create all the potential matches, which may not be what you want:
a <- tibble(Name = c('A','A','B','C'), Year = c(2014, 2015, 2014, 2014), Value = 1:4)
b <- tibble(Name = c('A','A','B','C','C'), Characteristic = c('Up','Down','Up','Left','Right'))
a %>% left_join(b, by = 'Name')
## # A tibble: 7 × 4
##   Name   Year Value Characteristic
##   <chr> <dbl> <int> <chr>         
## 1 A      2014     1 Up            
## 2 A      2014     1 Down          
## 3 A      2015     2 Up            
## 4 A      2015     2 Down          
## 5 B      2014     3 Up            
## 6 C      2014     4 Left          
## 7 C      2014     4 Right

Merging Data

  • This is why it’s super important to always know the observation level of your data. You can check it by seeing if there are any duplicate rows among what you think are your key variables: if we think that Person is the key for data set a, then a %>% select(Person) %>% duplicated() %>% max() will return TRUE, showing us we’re wrong
  • At that point you can figure out how you want to proceed - drop observations so it’s the observation level in one? Accept the multi-match? Pick only one of the multi-matches?

Try It!

Join data1 with data2, maintaining both categories

data1 <- tibble(Month = 1:4, MonthName = c('Jan','Feb','Mar','Apr'),
                Sales = c(6,8,1,2), RD = c(4,1,2,4)) %>%
  pivot_longer(cols = c(Sales, RD), names_to = 'Department')
data2 <- tibble(Department = 'Sales', Director = 'Lavy')

Try It!

data1 %>%
  left_join(data2) # full_join would also work to retain RD despite no match
## # A tibble: 8 × 5
##   Month MonthName Department value Director
##   <int> <chr>     <chr>      <dbl> <chr>   
## 1     1 Jan       Sales          6 Lavy    
## 2     1 Jan       RD             4 <NA>    
## 3     2 Feb       Sales          8 Lavy    
## 4     2 Feb       RD             1 <NA>    
## 5     3 Mar       Sales          1 Lavy    
## 6     3 Mar       RD             2 <NA>    
## 7     4 Apr       Sales          2 Lavy    
## 8     4 Apr       RD             4 <NA>

From Tidy Data to Your Analysis

From Tidy Data to Your Analysis

  • Okay! We now have, hopefully, a nice tidy data set with one column per variable, one row per observation, we know what the observation level is!
  • That doesn’t mean our data is ready to go! We likely have plenty of cleaning and manipulation to go before we are ready for analysis
  • We will be doing this mostly with dplyr
  • You’ll (a) definitely want to know this, and (b) will almost certainly find it easier to do with these dplyr tools than with something else you Google

dplyr

  • dplyr uses a small set of “verbs” to very flexibly do all kinds of data cleaning and manipulation
  • The primary verbs are: filter(), select(), arrange(), mutate(), group_by(), and summarize().
  • Other important functions in dplyr: pull() (which we covered), case_when(), rename(), and slice()
  • See the dplyr cheatsheet

Subsetting Data

filter()

  • filter() limits the data to the observations that fulfill a certain logical condition. It picks rows.
  • For example, Income > 100000 is TRUE for everyone with income above 100000, and FALSE otherwise. filter(data, Income > 100000) would return just the rows of data that have Income > 100000
person_year_data %>%
  left_join(person_data, by = 'Person') %>%
  filter(Income > 100000)
##    Person Year Income      Birthplace
## 1 Whitney 2014 131292 Washington D.C.
## 2 Whitney 2015 141262 Washington D.C.
## 3   David 2014 102452            <NA>
## 4   David 2015 105133            <NA>

Logical Conditions

  • A lot of programming in general is based on writing logical conditions that check whether something is true
  • In R, if the condition is true, it returns TRUE, which turns into 1 if you do a calculation with it. If false, it returns FALSE, which turns into 0. (tip: ifelse() is rarely what you want, and ifelse(condition, TRUE, FALSE) is redundant)

Logical Conditions Tips

Handy tools for constructing logical conditions:

a > b, a >= b, a < b, a <= b, a == b, or a != b to compare two numbers and check if a is above, above-or-equal, below, below-or-equal, equal (note == to check equality, not =), or not equal

a %in% c(b, c, d, e, f) is SUPER HANDY and checks whether a is any of the values b, c, d, e, or f. Works for text too!

Reverse conditions with !, or chain together with & (and) or | (or)

slice()

  • Like filter(), slice() picks rows. The difference is that filter() picks them by logical condition, while slice() does it by row index number
  • slice(1:5) picks the first five rows, slice(-103) drops the 103rd row, etc.

select()

  • select() gives you back just a subset of the columns. It picks columns
  • It can do this by name or by column number
  • Use - to not pick certain columns

If our data has the columns “Person”, “Year”, and “Income”, then all of these do the same thing:

no_income <- person_year_data %>% select(Person, Year)
no_income <- person_year_data %>% select('Person','Year')
no_income <- person_year_data %>% select(1:2)
no_income <- person_year_data %>% select(-Income)
print(no_income)
##    Person Year
## 1  Ramesh 2014
## 2  Ramesh 2015
## 3 Whitney 2014
## 4 Whitney 2015
## 5   David 2014
## 6   David 2015

arrange()

  • arrange() sorts the data. That’s it! Give it the column names and it will sort the data by those columns.
  • Can be handy with slice(). arrange(-Income) %>% slice(1:5) gives you the five rows with the highest income, for example
  • There are also some data manipulation tricks that rely on the position of the data
person_year_data %>%
  arrange(Person, Year)
##    Person Year Income
## 1   David 2014 102452
## 2   David 2015 105133
## 3  Ramesh 2014  81314
## 4  Ramesh 2015  82155
## 5 Whitney 2014 131292
## 6 Whitney 2015 141262

Try It!

Load the construction data and:

  1. select() the Year, Month, and West columns
  2. filter() to just the months of July, August, and September (use %in%)
  3. arrange() by West in descending order (hint: -West)
  4. slice() to pick the top two rows
data("construction")

Try It!

construction %>%
  select(Year, Month, West) %>%
  filter(Month %in% c('July','August','September')) %>%
  arrange(-West) %>%
  slice(1:2)
## # A tibble: 2 × 3
##    Year Month      West
##   <dbl> <chr>     <dbl>
## 1  2018 July        310
## 2  2018 September   296

Creating Variables

mutate()

  • mutate() assigns columns/variables, i.e. you can create variables with it (note also its sibling transmute() which does the same thing and then drops any variables you don’t explicitly specify in the function)
  • You can assign multiple variables in the same mutate() call, separated by commas (,)
person_year_data %>%
  mutate(NextYear = Year + 1,
         Above100k = Income > 100000)
##    Person Year Income NextYear Above100k
## 1  Ramesh 2014  81314     2015     FALSE
## 2  Ramesh 2015  82155     2016     FALSE
## 3 Whitney 2014 131292     2015      TRUE
## 4 Whitney 2015 141262     2016      TRUE
## 5   David 2014 102452     2015      TRUE
## 6   David 2015 105133     2016      TRUE

case_when()

  • A function that comes in handy a lot when using mutate to create a categorical variable is case_when(), which is sort of like ifelse() except it can cleanly handle way more than one condition
  • Provide case_when() with a series of if ~ then conditions, separated by commas, and it will go through the ifs one by one for each observation until it finds a fitting one.
  • As soon as it finds one, it stops looking, so you can assume anyone that satisfied an earlier condition doesn’t count any more. Also, you can have the last if be TRUE to give a value for anyone who hasn’t been caught yet

case_when()

person_year_data %>%
  mutate(IncomeBracket = case_when(
    Income <= 50000 ~ 'Under 50k',
    Income > 50000 & Income <= 100000 ~ '50-100k',
    Income > 100000 & Income < 120000 ~ '100-120k',
    TRUE ~ 'Above 120k'
  ))
##    Person Year Income IncomeBracket
## 1  Ramesh 2014  81314       50-100k
## 2  Ramesh 2015  82155       50-100k
## 3 Whitney 2014 131292    Above 120k
## 4 Whitney 2015 141262    Above 120k
## 5   David 2014 102452      100-120k
## 6   David 2015 105133      100-120k

group_by()

  • group_by() turns the dataset into a grouped data set, splitting each combination of the grouping variables
  • Calculations like mutate() or (up next) summarize() or (if you want to get fancy) group_map() then process the data separately by each group
person_year_data %>% group_by(Person) %>%
  mutate(Income_Relative_to_Mean = Income - mean(Income))
## # A tibble: 6 × 4
## # Groups:   Person [3]
##   Person   Year Income Income_Relative_to_Mean
##   <chr>   <dbl>  <dbl>                   <dbl>
## 1 Ramesh   2014  81314                   -420.
## 2 Ramesh   2015  82155                    420.
## 3 Whitney  2014 131292                  -4985 
## 4 Whitney  2015 141262                   4985 
## 5 David    2014 102452                  -1340.
## 6 David    2015 105133                   1340.

group_by()

  • It will maintain this grouping structure until you re-group_by() it, or ungroup() it, or summarize() it (which removes one of the grouping variables)
  • How is this useful in preparing data?
  • Remember, we want to look at where information is and think about how we can get it where we need it to be
  • group_by() helps us move information from one row to another in a key variable - otherwise a difficult move!
  • It can also let us change the observation level with summarize()
  • Tip: n() gives the number of rows in the group - handy! and row_number() gives the row number within its group of that observation

summarize()

  • summarize() changes the observation level to a broader level
  • It returns only one row per group (or one row total if the data is ungrouped)
  • So now your keys are whatever you gave to group_by()
person_year_data %>%
  group_by(Person) %>%
  summarize(Mean_Income = mean(Income),
            Years_Tracked = n())
## # A tibble: 3 × 3
##   Person  Mean_Income Years_Tracked
##   <chr>         <dbl>         <int>
## 1 David       103792.             2
## 2 Ramesh       81734.             2
## 3 Whitney     136277              2

summarize()

  • group_by() %>% summarize() is the solution to many, many, many of your problems in data viz. If you’re having a hard time doing something, ask yourself if group_by() %>% summarize() will do it for you
  • To work right, summarize() needs to be given calculations to perform that return a single value. summarize(mean_x = x) won’t take the mean of x, it’ll just return the original data
  • Protip in general: the mean of a binary variable is the proportion!

rename()

  • Lastly, and really only for aesthetic purposes, we have rename() for renaming your variables
  • Often nice for making names manageable, or making them show up nicely in visualization (or especially table) output
  • If your variable names have spaces or weird characters, be sure to put backticks around them
person_year_data %>%
  rename(Name = Person, `Annual Income` = Income)
##      Name Year Annual Income
## 1  Ramesh 2014         81314
## 2  Ramesh 2015         82155
## 3 Whitney 2014        131292
## 4 Whitney 2015        141262
## 5   David 2014        102452
## 6   David 2015        105133

Try It!

Using the same construction data from before (may have to reload it if you overwrote it before)

  1. mutate() a new variable avg which adds Northeast, Midwest, South, and West, and then divides by 4
  2. rename() that variable to Price Average (note the space in the name - backticks!)
  3. mutate() a variable firsthalf that’s “First half” for the first six months of the year and “Second half” for the second half (hint: case_when() and row_number())
  4. group_by() the firsthalf variable, and then summarize() to get the mean of Price Average by groups

Try It!

construction %>%
  mutate(avg = (Northeast + Midwest + South + West)/4) %>%
  rename(`Price Average` = avg) %>%
  mutate(firsthalf = case_when(row_number() <= 6 ~ "First Half", 
                               row_number() > 6 ~ "Second Half")) %>%
  group_by(firsthalf) %>%
  summarize(`Price Average` = mean(`Price Average`))
## # A tibble: 2 × 2
##   firsthalf   `Price Average`
##   <chr>                 <dbl>
## 1 First Half             311.
## 2 Second Half            298.

group_by stuff

As I mentioned, students tend to forget that group_by() followed by summarize() or mutate() solves so many problems. Like what?

  • “Find the (average / maximum / sum / first / etc.) for each (group)”
  • Count the number of observations by group (group_by(X) %>% summarize(n = n()))
  • Calculate the change each period by group (arrange(date) %>% group_by(X) %>% mutate(change = Y - shift(Y)))
  • Really any time you want to do something repeatedly, once per group

Factors, Dates, and Strings

Manipulating Variables

  • Those are the base dplyr verbs we need to think about
  • They can be combined to do all sorts of things!
  • But important in using them is thinking about what kinds of variable manipulations we’re doing
  • That will feed into our mutate()s and our summarizes()
  • A lot of data cleaning is making an already-tidy variable usable!

Variable Types

Three common variable types that need to be manipulated regularly in application to our data visualization tasks:

  • Factors (categorical variables)
  • Dates
  • Strings

Variable Types

  • You can check the types of your variables by printing a tibble(), or is. and then the type, or doing str(data)
  • You can generally convert between types using as. and then the type
taxdata %>%
  pivot_wider(names_from = 'TaxFormRow',
              values_from = 'Value') %>%
  mutate(Person = as.factor(Person),
         Income = as.numeric(Income),
         Deductible = as.numeric(Deductible),
         AGI = as.numeric(AGI))
## # A tibble: 1 × 4
##   Person        Income Deductible   AGI
##   <fct>          <dbl>      <dbl> <dbl>
## 1 James Acaster 112341      24000 88341

Factors

  • The factor() function lets you specify these labels, and also specify the levels they go in - factors can be ordered!
  • When we get to graphing, setting the levels will determine the order in which things are graphed
tibble(Income = c('50k-100k','Less than 50k', '50k-100k', '100k+', '100k+')) %>%
  mutate(Income = factor(Income, levels = c('Less than 50k','50k-100k','100k+'))) %>%
  arrange(Income)
## # A tibble: 5 × 1
##   Income       
##   <fct>        
## 1 Less than 50k
## 2 50k-100k     
## 3 50k-100k     
## 4 100k+        
## 5 100k+

Factors

  • (make sure you spell those level labels correctly, by the way!)
  • Other neat tools for ordering factors:
  • reorder(x, Y) will reorder the factor x according to the values of Y. Very handy for, say, ordering a set of bars on a bar graph by the height of the bars
  • The entire forcats library, part of the tidyverse, has endless convenience functions for factors that I never bothered to learn much about

Dates

  • Dates are the scourge of data cleaners everywhere. They’re just plain hard to work with!
  • There are Date variables, Datetime variables, both of multiple different formats… eugh!
  • I strongly recommend using the lubridate package whenever working with dates. See the cheatsheet
  • If you’re dealing with dates, before you work hard, stop and ask “oh I should use lubridate, that will make it easy”

Convenient lubridate functions

  • Convert strings to dates by telling it the order the information comes in: ymd('2020-01-01'), ym('202001'), mdy('01-01-2020').
  • Combine that with paste0() (we’ll get there with strings) to make a date out of multiple columns: mutate(date = mdy(paste0(MONTH,DAY,YEAR)))
  • Get information about the date back out with year(date), quarter(date), month(date), day(date), etc.
  • Label months with built-in R month names: month.abb[month(date)] or month.name[month(date)]

Convenient lubriate functions

  • Move around in time by adding or subtracting days/months/years: ymd('2020-01-01') + years(1), ymd('2020-01-01') + months(3), ymd('2020-01-01') - days(7)
  • “Round down” the dates using floor_date(). floor_date(ymd('2021-05-15'), 'month') gives the first of May, 2021
  • Note you can also work with week() of the year, or with date-time objects

Characters/string

  • Specified with '' or ""
  • Use paste0() to stick stuff together! paste0('h','ello', sep = '_') is ‘’h_ello’`
  • Note that '\n' can be used for line breaks
  • Pasting together annotations and graph titles is a great idea!
  • (btw you really don’t want your numbers to be strings!! Don’t put quotes around numbers!)

stringr

  • Just like forcats for categories and lubridate for dates, stringr (in the tidyverse as well) makes working with strings as easy as possible. See the cheatsheet
  • When working with textual data, first ask: hmm why don’t I try looking in stringr to see if there’s a relevant function?
  • Conveniently most relevant functions start with str_ so you can just type stringr::str_ and let RStudio’s autocomplete give some suggestions.

Common String Manipulation Needs

The next few slides we’re going to skip past, but they’re here to come back to in case you need:

  • A substring. str_sub('hello',2,4) is 'ell', word('hello hi',2) is 'hi'
  • Splitting strings in two with str_split() and separate()
  • Decluttering strings and removing whitespace with str_trim()/str_squish()
  • Fixing strings: str_replace('MSFT stock','MSFT','Microsoft') is 'Microsoft Stock'
  • Prettifying strings by making them title case, upper case, etc.

Getting Substrings

  • When working with things like nested IDs (for example, NAICS codes are six digits, but the first two and first four digits have their own meaning), you will commonly want to pick just a certain range of characters
  • str_sub(string, start, end) will do this. str_sub('hello', 2, 4) is 'ell'
  • Note negative values read from end of string. str_sub('hello', -1) is 'o'

Getting Substrings

  • Another way to pick out parts of a string is to get word()s from them
  • word('Sales Department', 1) will give back just 'Sales'

Strings

  • Lots of data will try to stick multiple pieces of information in a single cell, so you need to split it out!
  • Generically, str_split() will do this. str_split('a,b', ',')[[1]] is c('a','b')
  • Often in already-tidy data, you want separate() from tidyr. Make sure you list enough new into columns to get everything!
tibble(category = c('Sales,Marketing','H&R,Marketing')) %>%
  separate(category, into = c('Category1', 'Category2'), ',')
## # A tibble: 2 × 2
##   Category1 Category2
##   <chr>     <chr>    
## 1 Sales     Marketing
## 2 H&R       Marketing

Cleaning Strings

  • Strings sometimes come with unwelcome extras! Garbage or extra whitespace at the beginning or end, or badly-used characters
  • str_trim() removes beginning/end whitespace, str_squish() removes additional whitespace from the middle too. str_trim(' hi hello ') is 'hi hello'.

Cleaning Strings

  • str_replace and str_replace_all() are often handy for eliminating (or fixing) unwanted characters
str_replace('My name is Biarn', 'Biarn', 'Brian')
## [1] "My name is Brian"
tibble(number = c('1,000', '2,003,124')) %>%
  mutate(number = number %>% str_replace_all(',', '') %>% as.numeric())
## # A tibble: 2 × 1
##    number
##     <dbl>
## 1    1000
## 2 2003124

Pretty Strings

  • Stringr has a few functions for changing the case of strings, ideally to make them look nicer
  • str_to_upper() and str_to_lower() can make all-caps or all-lowercase
  • str_to_title() is very nice for making something title-case, Like This
  • str_to_sentence() can easily convert something into sentence case, Like this.

Detecting Patterns in Strings

  • Often we want to do something a bit more complex. Unfortunately, this requires we dip our toes into the bottomless well that is regular expressions
  • Regular expressions are ways of describing patterns in strings so that the computer can recognize them. Technically this is what we did with str_replace_all(',','') - ',' is a regular expression saying “look for a comma”
  • There are a lot of options here. See the guide
  • Common: [0-9] to look for a digit, [a-zA-Z] for letters, * to repeat until you see the next thing… hard to condense here. Read the guide.

Detecting Patterns in Strings

  • For example, some companies are publicly listed and we want to indicate that but not keep the ticker. separate() won’t do it here, not easily!
  • On the next page we’ll use the regular expression '\\([A-Z].*\\)'
  • '\\([A-Z].*\\)' says “look for a (” (note the \\ to treat the usually-special ( character as an actual character), then “Look for a capital letter [A-Z]”, then “keep looking for capital letters .*”, then “look for a )”

Detecting Patterns in Strings

tibble(name = c('Amazon (AMZN) Holdings','Cargill Corp. (cool place!)')) %>%
  mutate(publicly_listed = str_detect(name, '\\([A-Z].*\\)'),
         name = str_replace_all(name, '\\([A-Z].*\\)', ''))
## # A tibble: 2 × 2
##   name                        publicly_listed
##   <chr>                       <lgl>          
## 1 Amazon  Holdings            TRUE           
## 2 Cargill Corp. (cool place!) FALSE

For Example

  • Let’s remove the “county” from a list of county names
data("fips_to_names", package = 'SafeGraphR')

# Always check if the data is what you think! Do they all end in "County"?
# word(n) pulls the nth "word", -1 starts from the end
fips_to_names %>% pull(countyname) %>% word(-1) %>% table()
## .
##                          1                         10 
##                          5                          4 
##                         11                         12 
##                          4                          3 
##                         13                         14 
##                          3                          3 
##                         15                         16 
##                          3                          3 
##                         17                         18 
##                          3                          3 
##                         19                          2 
##                          2                          5 
##                         20                         21 
##                          1                          1 
##                         22                         23 
##                          1                          1 
##                          3                          4 
##                          5                          5 
##                          5                          6 
##                          5                          5 
##                          7                          8 
##                          4                          4 
##                          9                    Abitibi 
##                          4                          1 
##              Abitibi-Ouest                      Acton 
##                          1                          1 
##                  Addington          Alberni-Clayoquot 
##                          1                          1 
##                     Albert                     Algoma 
##                          1                          1 
##                  Annapolis                 Antigonish 
##                          1                          1 
##            Antoine-Labelle                 Appalaches 
##                          1                          1 
##                       Area                 Argenteuil 
##                         10                          1 
##                 Arthabaska                    Avignon 
##                          1                          1 
##                     Baffin                    Basques 
##                          1                          1 
##                        Bay            Beauce-Sartigan 
##                          1                          1 
##      Beauharnois-Salaberry                  Bécancour 
##                          1                          1 
##                Bellechasse                 Blainville 
##                          1                          1 
##                Bonaventure                    Borough 
##                          1                         17 
##                   Boundary                      Brant 
##                          1                          1 
##                     Breton           Brome-Missisquoi 
##                          1                          1 
##                      Bruce            Bulkley-Nechako 
##                          1                          1 
##                    Capital                    Cariboo 
##                          1                          1 
##                   Carleton                 Charlevoix 
##                          1                          1 
##             Charlevoix-Est                  Charlotte 
##                          1                          2 
##               Chatham-Kent                       city 
##                          1                         40 
##                       City                      Coast 
##                          1                          2 
##                  Coaticook                   Cochrane 
##                          1                          1 
##                 Colchester    Collines-de-l'Outaouais 
##                          1                          1 
##                   Columbia           Columbia-Shuswap 
##                          1                          1 
##            Côte-de-Beaupré              Côte-de-Gaspé 
##                          1                          1 
##                     County                 Cumberland 
##                       3007                          1 
##                   D'Autray             Deux-Montagnes 
##                          1                          1 
##                      Digby             Domaine-du-Roy 
##                          1                          1 
##                   Drummond                   Dufferin 
##                          1                          1 
##                     Durham                     Edward 
##                          1                          1 
##                      Elgin                      Essex 
##                          1                          1 
##                  Etchemins               Francheville 
##                          1                          1 
##                  Frontenac                   Gatineau 
##                          1                          1 
##                     George                  Glengarry 
##                          1                          1 
##                 Gloucester     Golfe-du-Saint-Laurent 
##                          1                          1 
##                     Granit                  Grenville 
##                          1                          1 
##                       Grey                Guysborough 
##                          1                          1 
##          Haldimand-Norfolk                 Haliburton 
##                          1                          1 
##                    Halifax                     Halton 
##                          1                          1 
##                   Hamilton                      Hants 
##                          1                          1 
##                   Hastings             Haut-Richelieu 
##                          1                          1 
##        Haut-Saint-François         Haut-Saint-Laurent 
##                          1                          1 
##            Haute-Côte-Nord             Haute-Gaspésie 
##                          1                          1 
##              Haute-Yamaska                      Huron 
##                          1                          1 
##       Îles-de-la-Madeleine                  Inverness 
##                          1                          1 
##            Jacques-Cartier     Jardins-de-Napierville 
##                          1                          1 
##                       John                   Joliette 
##                          1                          1 
##                 Kamouraska                   Keewatin 
##                          1                          1 
##                     Kenora                       Kent 
##                          1                          1 
##                      Kings                  Kitikmeot 
##                          3                          1 
##            Kitimat-Stikine                   Kootenay 
##                          1                          2 
##               L'Assomption                   L'Érable 
##                          1                          1 
##            L'Île-d'Orléans                    L'Islet 
##                          1                          1 
##         Lac-Saint-Jean-Est                      Lakes 
##                          1                          1 
##                    Lambton                     Lanark 
##                          1                          1 
##                Laurentides                      Laval 
##                          1                          1 
##                      Lévis                  Longueuil 
##                          1                          1 
##                 Lotbinière                  Lunenburg 
##                          1                          1 
##                  Madawaska                Manicouagan 
##                          1                          1 
##                 Manitoulin      Marguerite-D'Youville 
##                          1                          1 
##          Maria-Chapdelaine                 Maskinongé 
##                          1                          1 
##                Maskoutains                     Matane 
##                          1                          1 
##                  Matapédia                  Matawinie 
##                          1                          1 
##                    Mékinac               Memphrémagog 
##                          1                          1 
##                  Middlesex                    Mirabel 
##                          1                          1 
##                      Mitis                   Montcalm 
##                          1                          1 
##                  Montmagny                   Montréal 
##                          1                          1 
##                    Moulins               Municipality 
##                          1                          2 
##                    Muskoka                    Nanaimo 
##                          1                          1 
##                    Niagara            Nicolet-Yamaska 
##                          1                          1 
##                  Nipissing             Nord-du-Québec 
##                          1                          1 
##             Northumberland            Nouvelle-Beauce 
##                          2                          1 
##                   Okanagan       Okanagan-Similkameen 
##                          2                          1 
##                     Ottawa                     Oxford 
##                          1                          1 
##                   Papineau                     Parish 
##                          1                         64 
##             Pays-d'en-Haut                       Peel 
##                          1                          1 
##                      Perth               Peterborough 
##                          1                          1 
##                     Pictou                    Pontiac 
##                          1                          1 
##                   Portneuf                     Prince 
##                          1                          1 
##                     Québec                     Queens 
##                          1                          3 
##                    Renfrew                Restigouche 
##                          1                          1 
##                   Richmond          Rimouski-Neigette 
##                          1                          1 
##                      River            Rivière-du-Loup 
##                          3                          1 
##            Rivière-du-Nord              Robert-Cliche 
##                          1                          1 
##               Rocher-Percé                    Rockies 
##                          1                          1 
##                 Roussillon                   Rouville 
##                          1                          1 
##              Rouyn-Noranda                    Russell 
##                          1                          1 
##      Saguenay-et-son-Fjord                     Saurel 
##                          1                          1 
## Sept-Rivières--Caniapiscau                 Shawinigan 
##                          1                          1 
##                  Shelburne                 Sherbrooke 
##                          1                          1 
##                     Simcoe                      Sound 
##                          1                          1 
##                    Sources          Squamish-Lillooet 
##                          1                          1 
##                    Stikine                 Strathcona 
##                          1                          1 
##                    Sudbury                    Sunbury 
##                          2                          1 
##              Témiscamingue                Témiscouata 
##                          1                          1 
##            Thompson-Nicola                Timiskaming 
##                          1                          1 
##                    Toronto                      Tuque 
##                          1                          1 
##         Val-Saint-François             Vallée-de-l'Or 
##                          1                          1 
##      Vallée-de-la-Gatineau        Vallée-du-Richelieu 
##                          1                          1 
##                     Valley                  Vancouver 
##                          3                          1 
##        Vaudreuil-Soulanges                   Victoria 
##                          1                          2 
##                 Waddington                   Waterloo 
##                          1                          1 
##                 Wellington                Westmorland 
##                          1                          1 
##                   Yarmouth                       York 
##                          1                          2 
##                      Yukon 
##                          1

For Example

  • The ones we’d want to remove are probably “County” and “Parish”
  • Let’s use the easiest kind of regular expression: | to mean “or”:

Try It!

Load the gss_cat data. Then :

  • Use ym() to create a birthdate variable equal to June 1 of the year in the year column, minus the number of years in the age column (-years())
  • Use floor_date() to set it back from June 1 of that year to January 1
  • Set the levels of the race factor to Black, White, Other, in that order
  • Use table(gss_cat$rincome) to look at the values income takes

(page 1!)

Try It!

Then, if you’re going into the string stuff, create income_num equal to the first number:

  1. use word() to get the first word,
  2. str_sub() or str_replace_all() to get rid of the “$”, and
  3. as.numeric() to make the result a number

(you’ll get some NAs for the non-numbers, and let’s ignore the “Lt $1000” case we miss for now)

Try It!

data(gss_cat)
gss_cat %>%
  mutate(birthdate = ym(paste0(year, '06')) - years(age)) %>%
  mutate(birthdate = floor_date(birthdate, 'year')) %>%
  mutate(race = factor(race, levels = c('Black','White','Other'))) %>%
  mutate(income_num = rincome %>% word(1) %>% str_sub(2) %>% as.numeric())

Automation

Automation

  • Data cleaning is often very repetitive
  • You shouldn’t let it be!
  • Not just to save yourself work and tedium, but also because standardizing your process so you only have to write the code once both reduces errors and means that if you have to change something you only have to change it once

Breezing Past

We’ve already gone through a lot! Just remember: if you’re doing the same thing a bunch of times in a row, please don’t copy/paste or actually type it all out. Come back to these slides. That includes:

  • Doing the same summarize/mutate calculation to a lot of different variables (use across())
  • Reusing the same few lines of code (or big chunk of code) over and over in different parts of your analysis (writing functions)
  • Performing the same actions to many different objects or elements of a vector (purrr)

across()

  • If you have a lot of variables, cleaning them all can be a pain. Who wants to write out the same thing a million times, say to convert all those read-in-as-text variables to numeric?
  • Old versions of dplyr used “scoped” variants like mutate_at() or mutate_if(). As of dplyr 1.0.0, these have been deprecated in favor of across()
  • across() lets you use all the variable-selection tricks available in select(), like starts_with() or a:z or 1:5, but then lets you apply functions to each of them in mutate() or summarize()
  • similarly rowwise() and c_across() lets you do stuff like “add up a bunch of columns”

across()

  • starts_with('price_growth') is the same here as 4:5 or c(price_growth_since_march_4, price_growth_daily)
stockgrowth <- tibble(ticker = c('AMZN','AMZN', 'AMZN', 'WMT', 'WMT','WMT'),
       date = as.Date(rep(c('2020-03-04','2020-03-05','2020-03-06'), 2)),
       stock_price = c(103,103.4,107,85.2, 86.3, 85.6)) %>%
  arrange(ticker, date) %>% group_by(ticker) %>%
  mutate(price_growth_since_march_4 = stock_price/first(stock_price) - 1,
         price_growth_daily = stock_price/lag(stock_price, 1) - 1) 
stockgrowth %>%
  mutate(across(starts_with('price_growth'), function(x) x*10000)) # Convert to basis points
## # A tibble: 6 × 5
## # Groups:   ticker [2]
##   ticker date       stock_price price_growth_since_march_4 price_growth_daily
##   <chr>  <date>           <dbl>                      <dbl>              <dbl>
## 1 AMZN   2020-03-04       103                          0                 NA  
## 2 AMZN   2020-03-05       103.                        38.8               38.8
## 3 AMZN   2020-03-06       107                        388.               348. 
## 4 WMT    2020-03-04        85.2                        0                 NA  
## 5 WMT    2020-03-05        86.3                      129.               129. 
## 6 WMT    2020-03-06        85.6                       46.9              -81.1

Writing Functions

  • Generally, if you’re going to do the same thing more than once, you’re probably better off writing a function
  • Reduces errors, saves time, makes code reusable later!
function_name <- function(argument1 = default1, argument2 = default2, etc.) {
  some code
  result <- more code
  return(result)
  # (or just do result by itself - the last object printed will be automatically returned if there's no return())
}

Writing Functions

The same basis-point conversion from before:

do_bps <- function(p) {
  bps <- p*10000
  return(bps)
}
stockgrowth %>%
  mutate(across(starts_with('price_growth'), do_bps)) # Convert to basis points
## # A tibble: 6 × 5
## # Groups:   ticker [2]
##   ticker date       stock_price price_growth_since_march_4 price_growth_daily
##   <chr>  <date>           <dbl>                      <dbl>              <dbl>
## 1 AMZN   2020-03-04       103                          0                 NA  
## 2 AMZN   2020-03-05       103.                        38.8               38.8
## 3 AMZN   2020-03-06       107                        388.               348. 
## 4 WMT    2020-03-04        85.2                        0                 NA  
## 5 WMT    2020-03-05        86.3                      129.               129. 
## 6 WMT    2020-03-06        85.6                       46.9              -81.1

purrr

  • One good way to apply functions iteratively (yours or not) is with the map() functions in purrr
  • map() usually generates a list(), map_dbl() a numeric vector, map_chr() a character vector, map_df() a tibble()
  • It iterates through a list, data.frame/tibble (which are technically lists, or vector, and then applies a function to each of the elements
person_year_data %>%
  map_chr(class)
##      Person        Year      Income 
## "character"   "numeric"   "numeric"

purrr

  • Or looping more generally for diagnostic or wrangling purposes. Perhaps you have a summary_profile() function you’ve made, and want to check each state’s data to see if its data looks right. You could do
data %>% pull(state) %>% unique() %>% map(summary_profile)

purrr

  • You can use it generally in place of a for() loop
my_function <- function(i) {
  print(i+1)
}
# Print the numbers 1:10
1:10 %>% map(my_function)

Finishing Up, and an Example!

Some Final Notes

  • Importing and exporting data is way easier than it used to be, with rio
  • import() for any filename will import it (watch your working directory!)
  • import_list(), given a list of filenames or a single Excel workbook with multiple sheets, will import them all to a list (access each one with [[1]], [[2]], etc., or if they’re row-bindable, just stack them all on top of each other into one data set with rbind = TRUE)
  • Then export() to save when you’re done
  • Note: import() can get a file that’s currently inside a .zip to save space, or export() to .zip

Data File Types

  • .Rdata: very compact, and flexible (can contain more than just one data set), but specific to R and can’t be used most other places
  • .csv: the most general and can be used by anything, but large filesizes and you lose some data detail (like, factors don’t carry over)
  • .parquet: very compact and also fairly general, but also new and not everyone will know what to do with it

Some Final Notes

  • Also, please, please, please DOCUMENT YOUR DATA
  • At the very least, keep a spreadsheet/ with a set of descriptions for each of your variables
  • Also look into the sjlabelled or haven packages to add variable labels directly to the data set itself
  • Once you have your variables labelled, vtable() in vtable can generate a documentation file for sharing

Let’s Try It!

  1. Write a function that takes x and a and returns x+a
  2. Load the mtcars data.
  3. Use mutate(across()) to add 1 to all columns ending in ‘p’, using the function you wrote
  4. Put steps 2 and 3 inside another function that takes a as an argument, change “add 1” to “add a” and returns the resulting data
  5. Use map_df to stack together data sets with a values from 1 to 5
  6. Save the result as a CSV with export() from rio

Let’s Try It!

library(rio)
x_plus_a <- function(x, a) { x + a }

add_to_mtcars <- function(a) {
  data(mtcars)
  adjusted <- mtcars %>% mutate(across(ends_with('p'), \(x) x_plus_a(x,a)))
  return(adjusted)
}

1:5 %>% map_df(add_to_mtcars) %>% export('myfile.csv')