This week:
scales::percent(mean(mtcars$am,
na.rm = TRUE), accuracy = .1) can be rewritten
pull() is a dplyr function that says
“give me back this one variable instead of a data set” but in a
pipe-friendly way, so mtcars %>% pull(am) is the same as
mtcars$am or mtcars[['am']]I help a lot of people with their problems with data wrangling. Their issues are almost always not doing one of these four things, much more so than having trouble coding or anything like that
View() to look at itsumtable() or
vtable(lush = TRUE) in vtable for
exampleunique() or
summary() on individual variablessum(data$variable == value)For the purposes of this class:
In tidy data:
The variables in tidy data come in two types:
Which are they in this data?
There are many more elements to this, but I will cover two tools:
pivot (briefly) and join (in more detail)
Pivot will be our first review in a flash here, just remember that it exists, come back later or watch the video when you actually need it
pivot_longer() and pivot_wider(). Here we want
wide-to-long so we use pivot_longer()data (the data set you’re working with, also the first
argument so we can pipe to it)cols (the columns to pivot) - it will assume anything
not named here are the keysnames_to (the name of the variable to store which
column a given row came from, here “week”)values_to (the name of the vairable to store the value
in)help(pivot_longer))# pivot is pretty good at just figuring stuff out for us, at least in this simple case!
multicat_long <- multicat %>%
  pivot_longer(cols = starts_with('Sales_'), # tidyselect functions help us pick columns based on name patterns
               names_to = 'Category', values_to = 'Sales')
multicat_long## # A tibble: 8 × 3
##   Date       Category       Sales
##   <date>     <chr>          <dbl>
## 1 2000-01-01 Sales_Goods        4
## 2 2000-01-01 Sales_Services     4
## 3 2000-02-01 Sales_Goods        5
## 4 2000-02-01 Sales_Services     9
## 5 2000-03-01 Sales_Goods        1
## 6 2000-03-01 Sales_Services     2
## 7 2000-04-01 Sales_Goods        2
## 8 2000-04-01 Sales_Services     3pivot_wider(), and
then combine multiple individuals with bind_rows()pivot_wider() needs:data (first argument, the data we’re working with)id_cols (the columns that give us the key - what should
it be here?)names_from (the column containing what will be the new
variable names)values_from (the column containing the new values)help(pivot_wider)## # A tibble: 1 × 4
##   Person        Income Deductible AGI  
##   <chr>         <chr>  <chr>      <chr>
## 1 James Acaster 112341 24000      88341(note that the variables are all stored as character variables not numbers - that’s because the “person” row is a character, which forced the rest to be too. we’ll go through how to fix that later)
That was person_year_data. And now for
person_data:
join family of functions
will do this (see help(join)). The different varieties just
determine what to do with rows you don’t find a match for.
left_join() keeps non-matching rows from the first dataset
but not the second, right_join() from the second not the
first, full_join() from both, inner_join()
from neither, and anti_join() JUST keeps non-matches##    Person Year Income      Birthplace
## 1  Ramesh 2014  81314         Crawley
## 2  Ramesh 2015  82155         Crawley
## 3 Whitney 2014 131292 Washington D.C.
## 4 Whitney 2015 141262 Washington D.C.
## 5   David 2014 102452            <NA>
## 6   David 2015 105133            <NA>##    Person Year Income      Birthplace
## 1  Ramesh 2014  81314         Crawley
## 2  Ramesh 2015  82155         Crawley
## 3 Whitney 2014 131292 Washington D.C.
## 4 Whitney 2015 141262 Washington D.C.by is the
exact observation level in at least one of the two data
setsby variables in both, that’s a problem! It will
create all the potential matches, which may not be what you want:a <- tibble(Name = c('A','A','B','C'), Year = c(2014, 2015, 2014, 2014), Value = 1:4)
b <- tibble(Name = c('A','A','B','C','C'), Characteristic = c('Up','Down','Up','Left','Right'))
a %>% left_join(b, by = 'Name')## # A tibble: 7 × 4
##   Name   Year Value Characteristic
##   <chr> <dbl> <int> <chr>         
## 1 A      2014     1 Up            
## 2 A      2014     1 Down          
## 3 A      2015     2 Up            
## 4 A      2015     2 Down          
## 5 B      2014     3 Up            
## 6 C      2014     4 Left          
## 7 C      2014     4 RightPerson is the key for data set
a, then
a %>% select(Person) %>% duplicated() %>% max()
will return TRUE, showing us we’re wrongJoin data1 with data2, maintaining both
categories
## # A tibble: 8 × 5
##   Month MonthName Department value Director
##   <int> <chr>     <chr>      <dbl> <chr>   
## 1     1 Jan       Sales          6 Lavy    
## 2     1 Jan       RD             4 <NA>    
## 3     2 Feb       Sales          8 Lavy    
## 4     2 Feb       RD             1 <NA>    
## 5     3 Mar       Sales          1 Lavy    
## 6     3 Mar       RD             2 <NA>    
## 7     4 Apr       Sales          2 Lavy    
## 8     4 Apr       RD             4 <NA>filter(), select(),
arrange(), mutate(), group_by(),
and summarize().pull() (which we covered), case_when(),
rename(), and slice()filter() limits the data to the observations that
fulfill a certain logical condition. It picks
rows.Income > 100000 is TRUE
for everyone with income above 100000, and FALSE otherwise.
filter(data, Income > 100000) would return just the rows
of data that have Income > 100000##    Person Year Income      Birthplace
## 1 Whitney 2014 131292 Washington D.C.
## 2 Whitney 2015 141262 Washington D.C.
## 3   David 2014 102452            <NA>
## 4   David 2015 105133            <NA>TRUE, which
turns into 1 if you do a calculation with it. If false, it returns
FALSE, which turns into 0. (tip: ifelse() is
rarely what you want, and ifelse(condition, TRUE, FALSE) is
redundant)Handy tools for constructing logical conditions:
a > b, a >= b, a < b,
a <= b, a == b, or a != b to
compare two numbers and check if a is above,
above-or-equal, below, below-or-equal, equal (note == to
check equality, not =), or not equal
a %in% c(b, c, d, e, f) is SUPER HANDY and checks
whether a is any of the values b, c, d, e, or
f. Works for text too!
Reverse conditions with !, or chain together with
& (and) or | (or)
filter(), slice() picks rows. The
difference is that filter() picks them by logical
condition, while slice() does it by row index numberslice(1:5) picks the first five rows,
slice(-103) drops the 103rd row, etc.select() gives you back just a subset of the columns.
It picks columns- to not pick certain columnsIf our data has the columns “Person”, “Year”, and “Income”, then all of these do the same thing:
no_income <- person_year_data %>% select(Person, Year)
no_income <- person_year_data %>% select('Person','Year')
no_income <- person_year_data %>% select(1:2)
no_income <- person_year_data %>% select(-Income)
print(no_income)##    Person Year
## 1  Ramesh 2014
## 2  Ramesh 2015
## 3 Whitney 2014
## 4 Whitney 2015
## 5   David 2014
## 6   David 2015arrange() sorts the data. That’s it! Give it the column
names and it will sort the data by those columns.slice().
arrange(-Income) %>% slice(1:5) gives you the five rows
with the highest income, for example##    Person Year Income
## 1   David 2014 102452
## 2   David 2015 105133
## 3  Ramesh 2014  81314
## 4  Ramesh 2015  82155
## 5 Whitney 2014 131292
## 6 Whitney 2015 141262Load the construction data and:
select() the Year, Month, and West columnsfilter() to just the months of July, August, and
September (use %in%)arrange() by West in descending order (hint:
-West)slice() to pick the top two rowsconstruction %>%
  select(Year, Month, West) %>%
  filter(Month %in% c('July','August','September')) %>%
  arrange(-West) %>%
  slice(1:2)## # A tibble: 2 × 3
##    Year Month      West
##   <dbl> <chr>     <dbl>
## 1  2018 July        310
## 2  2018 September   296mutate() assigns columns/variables, i.e. you
can create variables with it (note also its sibling
transmute() which does the same thing and then drops any
variables you don’t explicitly specify in the function)mutate()
call, separated by commas (,)##    Person Year Income NextYear Above100k
## 1  Ramesh 2014  81314     2015     FALSE
## 2  Ramesh 2015  82155     2016     FALSE
## 3 Whitney 2014 131292     2015      TRUE
## 4 Whitney 2015 141262     2016      TRUE
## 5   David 2014 102452     2015      TRUE
## 6   David 2015 105133     2016      TRUEcase_when(),
which is sort of like ifelse() except it can cleanly handle
way more than one conditioncase_when() with a series of
if ~ then conditions, separated by commas, and it will go
through the ifs one by one for each observation until it
finds a fitting one.if be TRUE to give a value
for anyone who hasn’t been caught yetperson_year_data %>%
  mutate(IncomeBracket = case_when(
    Income <= 50000 ~ 'Under 50k',
    Income > 50000 & Income <= 100000 ~ '50-100k',
    Income > 100000 & Income < 120000 ~ '100-120k',
    TRUE ~ 'Above 120k'
  ))##    Person Year Income IncomeBracket
## 1  Ramesh 2014  81314       50-100k
## 2  Ramesh 2015  82155       50-100k
## 3 Whitney 2014 131292    Above 120k
## 4 Whitney 2015 141262    Above 120k
## 5   David 2014 102452      100-120k
## 6   David 2015 105133      100-120kgroup_by() turns the dataset into a grouped
data set, splitting each combination of the grouping variablesmutate() or (up next)
summarize() or (if you want to get fancy)
group_map() then process the data separately by each
group## # A tibble: 6 × 4
## # Groups:   Person [3]
##   Person   Year Income Income_Relative_to_Mean
##   <chr>   <dbl>  <dbl>                   <dbl>
## 1 Ramesh   2014  81314                   -420.
## 2 Ramesh   2015  82155                    420.
## 3 Whitney  2014 131292                  -4985 
## 4 Whitney  2015 141262                   4985 
## 5 David    2014 102452                  -1340.
## 6 David    2015 105133                   1340.group_by() it, or ungroup() it, or
summarize() it (which removes one of the grouping
variables)group_by() helps us move information from one row
to another in a key variable - otherwise a difficult move!summarize()n() gives the number of rows in the group - handy!
and row_number() gives the row number within its group of
that observationsummarize() changes the observation level to a
broader levelgroup_by()person_year_data %>%
  group_by(Person) %>%
  summarize(Mean_Income = mean(Income),
            Years_Tracked = n())## # A tibble: 3 × 3
##   Person  Mean_Income Years_Tracked
##   <chr>         <dbl>         <int>
## 1 David       103792.             2
## 2 Ramesh       81734.             2
## 3 Whitney     136277              2group_by() %>% summarize() is the solution to many,
many, many of your problems in data viz. If you’re having a hard time
doing something, ask yourself if
group_by() %>% summarize() will do it for yousummarize() needs to be given
calculations to perform that return a single value.
summarize(mean_x = x) won’t take the mean of
x, it’ll just return the original datarename() for renaming your variables##      Name Year Annual Income
## 1  Ramesh 2014         81314
## 2  Ramesh 2015         82155
## 3 Whitney 2014        131292
## 4 Whitney 2015        141262
## 5   David 2014        102452
## 6   David 2015        105133Using the same construction data from before (may have
to reload it if you overwrote it before)
mutate() a new variable avg which adds
Northeast, Midwest, South, and West, and then divides by 4rename() that variable to Price Average (note the space
in the name - backticks!)mutate() a variable firsthalf that’s
“First half” for the first six months of the year and “Second half” for
the second half (hint: case_when() and
row_number())group_by() the firsthalf variable, and
then summarize() to get the mean of
Price Average by groupsconstruction %>%
  mutate(avg = (Northeast + Midwest + South + West)/4) %>%
  rename(`Price Average` = avg) %>%
  mutate(firsthalf = case_when(row_number() <= 6 ~ "First Half", 
                               row_number() > 6 ~ "Second Half")) %>%
  group_by(firsthalf) %>%
  summarize(`Price Average` = mean(`Price Average`))## # A tibble: 2 × 2
##   firsthalf   `Price Average`
##   <chr>                 <dbl>
## 1 First Half             311.
## 2 Second Half            298.As I mentioned, students tend to forget that group_by()
followed by summarize() or mutate() solves
so many problems. Like what?
group_by(X) %>% summarize(n = n()))arrange(date) %>% group_by(X) %>%
mutate(change = Y - shift(Y)))mutate()s and our
summarizes()Three common variable types that need to be manipulated regularly in application to our data visualization tasks:
tibble(), or is. and then the type, or doing
str(data)as. and
then the typetaxdata %>%
  pivot_wider(names_from = 'TaxFormRow',
              values_from = 'Value') %>%
  mutate(Person = as.factor(Person),
         Income = as.numeric(Income),
         Deductible = as.numeric(Deductible),
         AGI = as.numeric(AGI))## # A tibble: 1 × 4
##   Person        Income Deductible   AGI
##   <fct>          <dbl>      <dbl> <dbl>
## 1 James Acaster 112341      24000 88341factor() function lets you specify these
labels, and also specify the levels they go in
- factors can be ordered!levels will
determine the order in which things are graphedtibble(Income = c('50k-100k','Less than 50k', '50k-100k', '100k+', '100k+')) %>%
  mutate(Income = factor(Income, levels = c('Less than 50k','50k-100k','100k+'))) %>%
  arrange(Income)## # A tibble: 5 × 1
##   Income       
##   <fct>        
## 1 Less than 50k
## 2 50k-100k     
## 3 50k-100k     
## 4 100k+        
## 5 100k+reorder(x, Y) will reorder the factor x
according to the values of Y. Very handy for, say, ordering
a set of bars on a bar graph by the height of the barsymd('2020-01-01'), ym('202001'),
mdy('01-01-2020').paste0() (we’ll get there with
strings) to make a date out of multiple columns:
mutate(date = mdy(paste0(MONTH,DAY,YEAR)))year(date), quarter(date),
month(date), day(date), etc.month.abb[month(date)] or
month.name[month(date)]ymd('2020-01-01') + years(1),
ymd('2020-01-01') + months(3),
ymd('2020-01-01') - days(7)floor_date().
floor_date(ymd('2021-05-15'), 'month') gives the first of
May, 2021week() of the year, or with
date-time objects'' or ""paste0() to stick stuff together!
paste0('h','ello', sep = '_') is ‘’h_ello’`'\n' can be used for line breaksstr_ so
you can just type stringr::str_ and let RStudio’s
autocomplete give some suggestions.The next few slides we’re going to skip past, but they’re here to come back to in case you need:
str_sub('hello',2,4) is
'ell', word('hello hi',2) is
'hi'str_split() and
separate()str_trim()/str_squish()str_replace('MSFT stock','MSFT','Microsoft') is
'Microsoft Stock'str_sub(string, start, end) will do this.
str_sub('hello', 2, 4) is 'ell'str_sub('hello', -1) is 'o'word()s from themword('Sales Department', 1) will give back just
'Sales'str_split() will do this.
str_split('a,b', ',')[[1]] is c('a','b')separate() from
tidyr. Make sure you list enough new into
columns to get everything!tibble(category = c('Sales,Marketing','H&R,Marketing')) %>%
  separate(category, into = c('Category1', 'Category2'), ',')## # A tibble: 2 × 2
##   Category1 Category2
##   <chr>     <chr>    
## 1 Sales     Marketing
## 2 H&R       Marketingstr_trim() removes beginning/end whitespace,
str_squish() removes additional whitespace from the middle
too. str_trim(' hi  hello ') is
'hi  hello'.str_replace and str_replace_all() are
often handy for eliminating (or fixing) unwanted characters## [1] "My name is Brian"tibble(number = c('1,000', '2,003,124')) %>%
  mutate(number = number %>% str_replace_all(',', '') %>% as.numeric())## # A tibble: 2 × 1
##    number
##     <dbl>
## 1    1000
## 2 2003124str_to_upper() and str_to_lower() can make
all-caps or all-lowercasestr_to_title() is very nice for making something
title-case, Like Thisstr_to_sentence() can easily convert something into
sentence case, Like this.str_replace_all(',','') - ',' is a
regular expression saying “look for a comma”[0-9] to look for a digit,
[a-zA-Z] for letters, * to repeat until you
see the next thing… hard to condense here. Read the guide.separate() won’t do
it here, not easily!'\\([A-Z].*\\)''\\([A-Z].*\\)' says “look for a (” (note the
\\ to treat the usually-special ( character as an actual
character), then “Look for a capital letter [A-Z]”, then
“keep looking for capital letters .*”, then “look for a
)”tibble(name = c('Amazon (AMZN) Holdings','Cargill Corp. (cool place!)')) %>%
  mutate(publicly_listed = str_detect(name, '\\([A-Z].*\\)'),
         name = str_replace_all(name, '\\([A-Z].*\\)', ''))## # A tibble: 2 × 2
##   name                        publicly_listed
##   <chr>                       <lgl>          
## 1 Amazon  Holdings            TRUE           
## 2 Cargill Corp. (cool place!) FALSEdata("fips_to_names", package = 'SafeGraphR')
# Always check if the data is what you think! Do they all end in "County"?
# word(n) pulls the nth "word", -1 starts from the end
fips_to_names %>% pull(countyname) %>% word(-1) %>% table()## .
##                          1                         10 
##                          5                          4 
##                         11                         12 
##                          4                          3 
##                         13                         14 
##                          3                          3 
##                         15                         16 
##                          3                          3 
##                         17                         18 
##                          3                          3 
##                         19                          2 
##                          2                          5 
##                         20                         21 
##                          1                          1 
##                         22                         23 
##                          1                          1 
##                          3                          4 
##                          5                          5 
##                          5                          6 
##                          5                          5 
##                          7                          8 
##                          4                          4 
##                          9                    Abitibi 
##                          4                          1 
##              Abitibi-Ouest                      Acton 
##                          1                          1 
##                  Addington          Alberni-Clayoquot 
##                          1                          1 
##                     Albert                     Algoma 
##                          1                          1 
##                  Annapolis                 Antigonish 
##                          1                          1 
##            Antoine-Labelle                 Appalaches 
##                          1                          1 
##                       Area                 Argenteuil 
##                         10                          1 
##                 Arthabaska                    Avignon 
##                          1                          1 
##                     Baffin                    Basques 
##                          1                          1 
##                        Bay            Beauce-Sartigan 
##                          1                          1 
##      Beauharnois-Salaberry                  Bécancour 
##                          1                          1 
##                Bellechasse                 Blainville 
##                          1                          1 
##                Bonaventure                    Borough 
##                          1                         17 
##                   Boundary                      Brant 
##                          1                          1 
##                     Breton           Brome-Missisquoi 
##                          1                          1 
##                      Bruce            Bulkley-Nechako 
##                          1                          1 
##                    Capital                    Cariboo 
##                          1                          1 
##                   Carleton                 Charlevoix 
##                          1                          1 
##             Charlevoix-Est                  Charlotte 
##                          1                          2 
##               Chatham-Kent                       city 
##                          1                         40 
##                       City                      Coast 
##                          1                          2 
##                  Coaticook                   Cochrane 
##                          1                          1 
##                 Colchester    Collines-de-l'Outaouais 
##                          1                          1 
##                   Columbia           Columbia-Shuswap 
##                          1                          1 
##            Côte-de-Beaupré              Côte-de-Gaspé 
##                          1                          1 
##                     County                 Cumberland 
##                       3007                          1 
##                   D'Autray             Deux-Montagnes 
##                          1                          1 
##                      Digby             Domaine-du-Roy 
##                          1                          1 
##                   Drummond                   Dufferin 
##                          1                          1 
##                     Durham                     Edward 
##                          1                          1 
##                      Elgin                      Essex 
##                          1                          1 
##                  Etchemins               Francheville 
##                          1                          1 
##                  Frontenac                   Gatineau 
##                          1                          1 
##                     George                  Glengarry 
##                          1                          1 
##                 Gloucester     Golfe-du-Saint-Laurent 
##                          1                          1 
##                     Granit                  Grenville 
##                          1                          1 
##                       Grey                Guysborough 
##                          1                          1 
##          Haldimand-Norfolk                 Haliburton 
##                          1                          1 
##                    Halifax                     Halton 
##                          1                          1 
##                   Hamilton                      Hants 
##                          1                          1 
##                   Hastings             Haut-Richelieu 
##                          1                          1 
##        Haut-Saint-François         Haut-Saint-Laurent 
##                          1                          1 
##            Haute-Côte-Nord             Haute-Gaspésie 
##                          1                          1 
##              Haute-Yamaska                      Huron 
##                          1                          1 
##       Îles-de-la-Madeleine                  Inverness 
##                          1                          1 
##            Jacques-Cartier     Jardins-de-Napierville 
##                          1                          1 
##                       John                   Joliette 
##                          1                          1 
##                 Kamouraska                   Keewatin 
##                          1                          1 
##                     Kenora                       Kent 
##                          1                          1 
##                      Kings                  Kitikmeot 
##                          3                          1 
##            Kitimat-Stikine                   Kootenay 
##                          1                          2 
##               L'Assomption                   L'Érable 
##                          1                          1 
##            L'Île-d'Orléans                    L'Islet 
##                          1                          1 
##         Lac-Saint-Jean-Est                      Lakes 
##                          1                          1 
##                    Lambton                     Lanark 
##                          1                          1 
##                Laurentides                      Laval 
##                          1                          1 
##                      Lévis                  Longueuil 
##                          1                          1 
##                 Lotbinière                  Lunenburg 
##                          1                          1 
##                  Madawaska                Manicouagan 
##                          1                          1 
##                 Manitoulin      Marguerite-D'Youville 
##                          1                          1 
##          Maria-Chapdelaine                 Maskinongé 
##                          1                          1 
##                Maskoutains                     Matane 
##                          1                          1 
##                  Matapédia                  Matawinie 
##                          1                          1 
##                    Mékinac               Memphrémagog 
##                          1                          1 
##                  Middlesex                    Mirabel 
##                          1                          1 
##                      Mitis                   Montcalm 
##                          1                          1 
##                  Montmagny                   Montréal 
##                          1                          1 
##                    Moulins               Municipality 
##                          1                          2 
##                    Muskoka                    Nanaimo 
##                          1                          1 
##                    Niagara            Nicolet-Yamaska 
##                          1                          1 
##                  Nipissing             Nord-du-Québec 
##                          1                          1 
##             Northumberland            Nouvelle-Beauce 
##                          2                          1 
##                   Okanagan       Okanagan-Similkameen 
##                          2                          1 
##                     Ottawa                     Oxford 
##                          1                          1 
##                   Papineau                     Parish 
##                          1                         64 
##             Pays-d'en-Haut                       Peel 
##                          1                          1 
##                      Perth               Peterborough 
##                          1                          1 
##                     Pictou                    Pontiac 
##                          1                          1 
##                   Portneuf                     Prince 
##                          1                          1 
##                     Québec                     Queens 
##                          1                          3 
##                    Renfrew                Restigouche 
##                          1                          1 
##                   Richmond          Rimouski-Neigette 
##                          1                          1 
##                      River            Rivière-du-Loup 
##                          3                          1 
##            Rivière-du-Nord              Robert-Cliche 
##                          1                          1 
##               Rocher-Percé                    Rockies 
##                          1                          1 
##                 Roussillon                   Rouville 
##                          1                          1 
##              Rouyn-Noranda                    Russell 
##                          1                          1 
##      Saguenay-et-son-Fjord                     Saurel 
##                          1                          1 
## Sept-Rivières--Caniapiscau                 Shawinigan 
##                          1                          1 
##                  Shelburne                 Sherbrooke 
##                          1                          1 
##                     Simcoe                      Sound 
##                          1                          1 
##                    Sources          Squamish-Lillooet 
##                          1                          1 
##                    Stikine                 Strathcona 
##                          1                          1 
##                    Sudbury                    Sunbury 
##                          2                          1 
##              Témiscamingue                Témiscouata 
##                          1                          1 
##            Thompson-Nicola                Timiskaming 
##                          1                          1 
##                    Toronto                      Tuque 
##                          1                          1 
##         Val-Saint-François             Vallée-de-l'Or 
##                          1                          1 
##      Vallée-de-la-Gatineau        Vallée-du-Richelieu 
##                          1                          1 
##                     Valley                  Vancouver 
##                          3                          1 
##        Vaudreuil-Soulanges                   Victoria 
##                          1                          2 
##                 Waddington                   Waterloo 
##                          1                          1 
##                 Wellington                Westmorland 
##                          1                          1 
##                   Yarmouth                       York 
##                          1                          2 
##                      Yukon 
##                          1| to
mean “or”:Load the gss_cat data. Then :
ym() to create a birthdate variable
equal to June 1 of the year in the year column, minus the
number of years in the age column
(-years())floor_date() to set it back from June 1 of that
year to January 1race factor to Black, White,
Other, in that ordertable(gss_cat$rincome) to look at the values income
takes(page 1!)
Then, if you’re going into the string stuff, create
income_num equal to the first number:
word() to get the first word,str_sub() or str_replace_all() to get rid
of the “$”, andas.numeric() to make the result a number(you’ll get some NAs for the non-numbers, and let’s ignore the “Lt $1000” case we miss for now)
We’ve already gone through a lot! Just remember: if you’re doing the same thing a bunch of times in a row, please don’t copy/paste or actually type it all out. Come back to these slides. That includes:
summarize/mutate calculation to a lot of
different variables (use across())mutate_at() or mutate_if(). As of
dplyr 1.0.0, these have been deprecated in favor of
across()across() lets you use all the variable-selection tricks
available in select(), like starts_with() or
a:z or 1:5, but then lets you apply functions
to each of them in mutate() or
summarize()rowwise() and c_across() lets
you do stuff like “add up a bunch of columns”starts_with('price_growth') is the same here as
4:5 or
c(price_growth_since_march_4, price_growth_daily)stockgrowth <- tibble(ticker = c('AMZN','AMZN', 'AMZN', 'WMT', 'WMT','WMT'),
       date = as.Date(rep(c('2020-03-04','2020-03-05','2020-03-06'), 2)),
       stock_price = c(103,103.4,107,85.2, 86.3, 85.6)) %>%
  arrange(ticker, date) %>% group_by(ticker) %>%
  mutate(price_growth_since_march_4 = stock_price/first(stock_price) - 1,
         price_growth_daily = stock_price/lag(stock_price, 1) - 1) 
stockgrowth %>%
  mutate(across(starts_with('price_growth'), function(x) x*10000)) # Convert to basis points## # A tibble: 6 × 5
## # Groups:   ticker [2]
##   ticker date       stock_price price_growth_since_march_4 price_growth_daily
##   <chr>  <date>           <dbl>                      <dbl>              <dbl>
## 1 AMZN   2020-03-04       103                          0                 NA  
## 2 AMZN   2020-03-05       103.                        38.8               38.8
## 3 AMZN   2020-03-06       107                        388.               348. 
## 4 WMT    2020-03-04        85.2                        0                 NA  
## 5 WMT    2020-03-05        86.3                      129.               129. 
## 6 WMT    2020-03-06        85.6                       46.9              -81.1The same basis-point conversion from before:
do_bps <- function(p) {
  bps <- p*10000
  return(bps)
}
stockgrowth %>%
  mutate(across(starts_with('price_growth'), do_bps)) # Convert to basis points## # A tibble: 6 × 5
## # Groups:   ticker [2]
##   ticker date       stock_price price_growth_since_march_4 price_growth_daily
##   <chr>  <date>           <dbl>                      <dbl>              <dbl>
## 1 AMZN   2020-03-04       103                          0                 NA  
## 2 AMZN   2020-03-05       103.                        38.8               38.8
## 3 AMZN   2020-03-06       107                        388.               348. 
## 4 WMT    2020-03-04        85.2                        0                 NA  
## 5 WMT    2020-03-05        86.3                      129.               129. 
## 6 WMT    2020-03-06        85.6                       46.9              -81.1map() functions in purrrmap() usually generates a list(),
map_dbl() a numeric vector, map_chr() a
character vector, map_df() a tibble()…list,
data.frame/tibble (which are technically
lists, or vector, and then applies a function
to each of the elements##      Person        Year      Income 
## "character"   "numeric"   "numeric"summary_profile() function you’ve made,
and want to check each state’s data to see if its data looks right. You
could dofor() loopimport() for any filename will import it (watch your
working directory!)import_list(), given a list of filenames or a single
Excel workbook with multiple sheets, will import them all to a list
(access each one with [[1]], [[2]], etc., or
if they’re row-bindable, just stack them all on top of each other into
one data set with rbind = TRUE)export() to save when you’re doneimport() can get a file that’s currently inside a
.zip to save space, or export() to
.zip.Rdata: very compact, and flexible (can contain more
than just one data set), but specific to R and can’t be used most other
places.csv: the most general and can be used by anything, but
large filesizes and you lose some data detail (like, factors don’t carry
over).parquet: very compact and also fairly general, but
also new and not everyone will know what to do with itvtable() in
vtable can generate a documentation file for
sharingx and a and
returns x+amtcars data.mutate(across()) to add 1 to all columns ending in
‘p’, using the function you wrotea
as an argument, change “add 1” to “add a” and returns the
resulting datamap_df to stack together data sets with
a values from 1 to 5export() from
rio