This week:
scales::percent(mean(mtcars$am,
na.rm = TRUE), accuracy = .1)
can be rewritten
pull()
is a dplyr function that says
“give me back this one variable instead of a data set” but in a
pipe-friendly way, so mtcars %>% pull(am)
is the same as
mtcars$am
or mtcars[['am']]
I help a lot of people with their problems with data wrangling. Their issues are almost always not doing one of these four things, much more so than having trouble coding or anything like that
View()
to look at itsumtable()
or
vtable(lush = TRUE)
in vtable for
exampleunique()
or
summary()
on individual variablessum(data$variable == value)
For the purposes of this class:
In tidy data:
The variables in tidy data come in two types:
Which are they in this data?
There are many more elements to this, but I will cover two tools:
pivot
(briefly) and join
(in more detail)
Pivot will be our first review in a flash here, just remember that it exists, come back later or watch the video when you actually need it
pivot_longer()
and pivot_wider()
. Here we want
wide-to-long so we use pivot_longer()
data
(the data set you’re working with, also the first
argument so we can pipe to it)cols
(the columns to pivot) - it will assume anything
not named here are the keysnames_to
(the name of the variable to store which
column a given row came from, here “week”)values_to
(the name of the vairable to store the value
in)help(pivot_longer)
)# pivot is pretty good at just figuring stuff out for us, at least in this simple case!
multicat_long <- multicat %>%
pivot_longer(cols = starts_with('Sales_'), # tidyselect functions help us pick columns based on name patterns
names_to = 'Category', values_to = 'Sales')
multicat_long
## # A tibble: 8 × 3
## Date Category Sales
## <date> <chr> <dbl>
## 1 2000-01-01 Sales_Goods 4
## 2 2000-01-01 Sales_Services 4
## 3 2000-02-01 Sales_Goods 5
## 4 2000-02-01 Sales_Services 9
## 5 2000-03-01 Sales_Goods 1
## 6 2000-03-01 Sales_Services 2
## 7 2000-04-01 Sales_Goods 2
## 8 2000-04-01 Sales_Services 3
pivot_wider()
, and
then combine multiple individuals with bind_rows()
pivot_wider()
needs:data
(first argument, the data we’re working with)id_cols
(the columns that give us the key - what should
it be here?)names_from
(the column containing what will be the new
variable names)values_from
(the column containing the new values)help(pivot_wider)
## # A tibble: 1 × 4
## Person Income Deductible AGI
## <chr> <chr> <chr> <chr>
## 1 James Acaster 112341 24000 88341
(note that the variables are all stored as character variables not numbers - that’s because the “person” row is a character, which forced the rest to be too. we’ll go through how to fix that later)
That was person_year_data
. And now for
person_data
:
join
family of functions
will do this (see help(join)
). The different varieties just
determine what to do with rows you don’t find a match for.
left_join()
keeps non-matching rows from the first dataset
but not the second, right_join()
from the second not the
first, full_join()
from both, inner_join()
from neither, and anti_join()
JUST keeps non-matches## Person Year Income Birthplace
## 1 Ramesh 2014 81314 Crawley
## 2 Ramesh 2015 82155 Crawley
## 3 Whitney 2014 131292 Washington D.C.
## 4 Whitney 2015 141262 Washington D.C.
## 5 David 2014 102452 <NA>
## 6 David 2015 105133 <NA>
## Person Year Income Birthplace
## 1 Ramesh 2014 81314 Crawley
## 2 Ramesh 2015 82155 Crawley
## 3 Whitney 2014 131292 Washington D.C.
## 4 Whitney 2015 141262 Washington D.C.
by
is the
exact observation level in at least one of the two data
setsby
variables in both, that’s a problem! It will
create all the potential matches, which may not be what you want:a <- tibble(Name = c('A','A','B','C'), Year = c(2014, 2015, 2014, 2014), Value = 1:4)
b <- tibble(Name = c('A','A','B','C','C'), Characteristic = c('Up','Down','Up','Left','Right'))
a %>% left_join(b, by = 'Name')
## # A tibble: 7 × 4
## Name Year Value Characteristic
## <chr> <dbl> <int> <chr>
## 1 A 2014 1 Up
## 2 A 2014 1 Down
## 3 A 2015 2 Up
## 4 A 2015 2 Down
## 5 B 2014 3 Up
## 6 C 2014 4 Left
## 7 C 2014 4 Right
Person
is the key for data set
a
, then
a %>% select(Person) %>% duplicated() %>% max()
will return TRUE
, showing us we’re wrongJoin data1
with data2
, maintaining both
categories
## # A tibble: 8 × 5
## Month MonthName Department value Director
## <int> <chr> <chr> <dbl> <chr>
## 1 1 Jan Sales 6 Lavy
## 2 1 Jan RD 4 <NA>
## 3 2 Feb Sales 8 Lavy
## 4 2 Feb RD 1 <NA>
## 5 3 Mar Sales 1 Lavy
## 6 3 Mar RD 2 <NA>
## 7 4 Apr Sales 2 Lavy
## 8 4 Apr RD 4 <NA>
filter(), select()
,
arrange()
, mutate()
, group_by()
,
and summarize()
.pull()
(which we covered), case_when()
,
rename()
, and slice()
filter()
limits the data to the observations that
fulfill a certain logical condition. It picks
rows.Income > 100000
is TRUE
for everyone with income above 100000, and FALSE
otherwise.
filter(data, Income > 100000)
would return just the rows
of data
that have Income > 100000
## Person Year Income Birthplace
## 1 Whitney 2014 131292 Washington D.C.
## 2 Whitney 2015 141262 Washington D.C.
## 3 David 2014 102452 <NA>
## 4 David 2015 105133 <NA>
TRUE
, which
turns into 1 if you do a calculation with it. If false, it returns
FALSE
, which turns into 0. (tip: ifelse()
is
rarely what you want, and ifelse(condition, TRUE, FALSE)
is
redundant)Handy tools for constructing logical conditions:
a > b
, a >= b
, a < b
,
a <= b
, a == b
, or a != b
to
compare two numbers and check if a
is above,
above-or-equal, below, below-or-equal, equal (note ==
to
check equality, not =
), or not equal
a %in% c(b, c, d, e, f)
is SUPER HANDY and checks
whether a
is any of the values b, c, d, e,
or
f
. Works for text too!
Reverse conditions with !
, or chain together with
&
(and) or |
(or)
filter()
, slice()
picks rows. The
difference is that filter()
picks them by logical
condition, while slice()
does it by row index numberslice(1:5)
picks the first five rows,
slice(-103)
drops the 103rd row, etc.select()
gives you back just a subset of the columns.
It picks columns-
to not pick certain columnsIf our data has the columns “Person”, “Year”, and “Income”, then all of these do the same thing:
no_income <- person_year_data %>% select(Person, Year)
no_income <- person_year_data %>% select('Person','Year')
no_income <- person_year_data %>% select(1:2)
no_income <- person_year_data %>% select(-Income)
print(no_income)
## Person Year
## 1 Ramesh 2014
## 2 Ramesh 2015
## 3 Whitney 2014
## 4 Whitney 2015
## 5 David 2014
## 6 David 2015
arrange()
sorts the data. That’s it! Give it the column
names and it will sort the data by those columns.slice()
.
arrange(-Income) %>% slice(1:5)
gives you the five rows
with the highest income, for example## Person Year Income
## 1 David 2014 102452
## 2 David 2015 105133
## 3 Ramesh 2014 81314
## 4 Ramesh 2015 82155
## 5 Whitney 2014 131292
## 6 Whitney 2015 141262
Load the construction
data and:
select()
the Year, Month, and West columnsfilter()
to just the months of July, August, and
September (use %in%
)arrange()
by West in descending order (hint:
-West
)slice()
to pick the top two rowsconstruction %>%
select(Year, Month, West) %>%
filter(Month %in% c('July','August','September')) %>%
arrange(-West) %>%
slice(1:2)
## # A tibble: 2 × 3
## Year Month West
## <dbl> <chr> <dbl>
## 1 2018 July 310
## 2 2018 September 296
mutate()
assigns columns/variables, i.e. you
can create variables with it (note also its sibling
transmute()
which does the same thing and then drops any
variables you don’t explicitly specify in the function)mutate()
call, separated by commas (,
)## Person Year Income NextYear Above100k
## 1 Ramesh 2014 81314 2015 FALSE
## 2 Ramesh 2015 82155 2016 FALSE
## 3 Whitney 2014 131292 2015 TRUE
## 4 Whitney 2015 141262 2016 TRUE
## 5 David 2014 102452 2015 TRUE
## 6 David 2015 105133 2016 TRUE
case_when()
,
which is sort of like ifelse()
except it can cleanly handle
way more than one conditioncase_when()
with a series of
if ~ then
conditions, separated by commas, and it will go
through the if
s one by one for each observation until it
finds a fitting one.if
be TRUE
to give a value
for anyone who hasn’t been caught yetperson_year_data %>%
mutate(IncomeBracket = case_when(
Income <= 50000 ~ 'Under 50k',
Income > 50000 & Income <= 100000 ~ '50-100k',
Income > 100000 & Income < 120000 ~ '100-120k',
TRUE ~ 'Above 120k'
))
## Person Year Income IncomeBracket
## 1 Ramesh 2014 81314 50-100k
## 2 Ramesh 2015 82155 50-100k
## 3 Whitney 2014 131292 Above 120k
## 4 Whitney 2015 141262 Above 120k
## 5 David 2014 102452 100-120k
## 6 David 2015 105133 100-120k
group_by()
turns the dataset into a grouped
data set, splitting each combination of the grouping variablesmutate()
or (up next)
summarize()
or (if you want to get fancy)
group_map()
then process the data separately by each
group## # A tibble: 6 × 4
## # Groups: Person [3]
## Person Year Income Income_Relative_to_Mean
## <chr> <dbl> <dbl> <dbl>
## 1 Ramesh 2014 81314 -420.
## 2 Ramesh 2015 82155 420.
## 3 Whitney 2014 131292 -4985
## 4 Whitney 2015 141262 4985
## 5 David 2014 102452 -1340.
## 6 David 2015 105133 1340.
group_by()
it, or ungroup()
it, or
summarize()
it (which removes one of the grouping
variables)group_by()
helps us move information from one row
to another in a key variable - otherwise a difficult move!summarize()
n()
gives the number of rows in the group - handy!
and row_number()
gives the row number within its group of
that observationsummarize()
changes the observation level to a
broader levelgroup_by()
person_year_data %>%
group_by(Person) %>%
summarize(Mean_Income = mean(Income),
Years_Tracked = n())
## # A tibble: 3 × 3
## Person Mean_Income Years_Tracked
## <chr> <dbl> <int>
## 1 David 103792. 2
## 2 Ramesh 81734. 2
## 3 Whitney 136277 2
group_by() %>% summarize()
is the solution to many,
many, many of your problems in data viz. If you’re having a hard time
doing something, ask yourself if
group_by() %>% summarize()
will do it for yousummarize()
needs to be given
calculations to perform that return a single value.
summarize(mean_x = x)
won’t take the mean of
x
, it’ll just return the original datarename()
for renaming your variables## Name Year Annual Income
## 1 Ramesh 2014 81314
## 2 Ramesh 2015 82155
## 3 Whitney 2014 131292
## 4 Whitney 2015 141262
## 5 David 2014 102452
## 6 David 2015 105133
Using the same construction
data from before (may have
to reload it if you overwrote it before)
mutate()
a new variable avg
which adds
Northeast, Midwest, South, and West, and then divides by 4rename()
that variable to Price Average (note the space
in the name - backticks!)mutate()
a variable firsthalf
that’s
“First half” for the first six months of the year and “Second half” for
the second half (hint: case_when()
and
row_number()
)group_by()
the firsthalf
variable, and
then summarize()
to get the mean of
Price Average
by groupsconstruction %>%
mutate(avg = (Northeast + Midwest + South + West)/4) %>%
rename(`Price Average` = avg) %>%
mutate(firsthalf = case_when(row_number() <= 6 ~ "First Half",
row_number() > 6 ~ "Second Half")) %>%
group_by(firsthalf) %>%
summarize(`Price Average` = mean(`Price Average`))
## # A tibble: 2 × 2
## firsthalf `Price Average`
## <chr> <dbl>
## 1 First Half 311.
## 2 Second Half 298.
As I mentioned, students tend to forget that group_by()
followed by summarize()
or mutate()
solves
so many problems. Like what?
group_by(X) %>% summarize(n = n())
)arrange(date) %>% group_by(X) %>%
mutate(change = Y - shift(Y))
)mutate()
s and our
summarizes()
Three common variable types that need to be manipulated regularly in application to our data visualization tasks:
tibble()
, or is.
and then the type, or doing
str(data)as.
and
then the typetaxdata %>%
pivot_wider(names_from = 'TaxFormRow',
values_from = 'Value') %>%
mutate(Person = as.factor(Person),
Income = as.numeric(Income),
Deductible = as.numeric(Deductible),
AGI = as.numeric(AGI))
## # A tibble: 1 × 4
## Person Income Deductible AGI
## <fct> <dbl> <dbl> <dbl>
## 1 James Acaster 112341 24000 88341
factor()
function lets you specify these
labels
, and also specify the levels
they go in
- factors can be ordered!levels
will
determine the order in which things are graphedtibble(Income = c('50k-100k','Less than 50k', '50k-100k', '100k+', '100k+')) %>%
mutate(Income = factor(Income, levels = c('Less than 50k','50k-100k','100k+'))) %>%
arrange(Income)
## # A tibble: 5 × 1
## Income
## <fct>
## 1 Less than 50k
## 2 50k-100k
## 3 50k-100k
## 4 100k+
## 5 100k+
reorder(x, Y)
will reorder the factor x
according to the values of Y
. Very handy for, say, ordering
a set of bars on a bar graph by the height of the barsymd('2020-01-01')
, ym('202001')
,
mdy('01-01-2020')
.paste0()
(we’ll get there with
strings) to make a date out of multiple columns:
mutate(date = mdy(paste0(MONTH,DAY,YEAR)))
year(date)
, quarter(date)
,
month(date)
, day(date)
, etc.month.abb[month(date)]
or
month.name[month(date)]
ymd('2020-01-01') + years(1)
,
ymd('2020-01-01') + months(3)
,
ymd('2020-01-01') - days(7)
floor_date()
.
floor_date(ymd('2021-05-15'), 'month')
gives the first of
May, 2021week()
of the year, or with
date-time objects''
or ""
paste0()
to stick stuff together!
paste0('h','ello', sep = '_')
is ‘’h_ello’`'\n'
can be used for line breaksstr_
so
you can just type stringr::str_
and let RStudio’s
autocomplete give some suggestions.The next few slides we’re going to skip past, but they’re here to come back to in case you need:
str_sub('hello',2,4)
is
'ell'
, word('hello hi',2)
is
'hi'
str_split()
and
separate()
str_trim()/str_squish()
str_replace('MSFT stock','MSFT','Microsoft')
is
'Microsoft Stock'
str_sub(string, start, end)
will do this.
str_sub('hello', 2, 4)
is 'ell'
str_sub('hello', -1)
is 'o'
word()
s from themword('Sales Department', 1)
will give back just
'Sales'
str_split()
will do this.
str_split('a,b', ',')[[1]]
is c('a','b')
separate()
from
tidyr. Make sure you list enough new into
columns to get everything!tibble(category = c('Sales,Marketing','H&R,Marketing')) %>%
separate(category, into = c('Category1', 'Category2'), ',')
## # A tibble: 2 × 2
## Category1 Category2
## <chr> <chr>
## 1 Sales Marketing
## 2 H&R Marketing
str_trim()
removes beginning/end whitespace,
str_squish()
removes additional whitespace from the middle
too. str_trim(' hi hello ')
is
'hi hello'
.str_replace
and str_replace_all()
are
often handy for eliminating (or fixing) unwanted characters## [1] "My name is Brian"
tibble(number = c('1,000', '2,003,124')) %>%
mutate(number = number %>% str_replace_all(',', '') %>% as.numeric())
## # A tibble: 2 × 1
## number
## <dbl>
## 1 1000
## 2 2003124
str_to_upper()
and str_to_lower()
can make
all-caps or all-lowercasestr_to_title()
is very nice for making something
title-case, Like Thisstr_to_sentence()
can easily convert something into
sentence case, Like this.str_replace_all(',','')
- ','
is a
regular expression saying “look for a comma”[0-9]
to look for a digit,
[a-zA-Z]
for letters, *
to repeat until you
see the next thing… hard to condense here. Read the guide.separate()
won’t do
it here, not easily!'\\([A-Z].*\\)'
'\\([A-Z].*\\)'
says “look for a (” (note the
\\
to treat the usually-special ( character as an actual
character), then “Look for a capital letter [A-Z]
”, then
“keep looking for capital letters .*
”, then “look for a
)”tibble(name = c('Amazon (AMZN) Holdings','Cargill Corp. (cool place!)')) %>%
mutate(publicly_listed = str_detect(name, '\\([A-Z].*\\)'),
name = str_replace_all(name, '\\([A-Z].*\\)', ''))
## # A tibble: 2 × 2
## name publicly_listed
## <chr> <lgl>
## 1 Amazon Holdings TRUE
## 2 Cargill Corp. (cool place!) FALSE
data("fips_to_names", package = 'SafeGraphR')
# Always check if the data is what you think! Do they all end in "County"?
# word(n) pulls the nth "word", -1 starts from the end
fips_to_names %>% pull(countyname) %>% word(-1) %>% table()
## .
## 1 10
## 5 4
## 11 12
## 4 3
## 13 14
## 3 3
## 15 16
## 3 3
## 17 18
## 3 3
## 19 2
## 2 5
## 20 21
## 1 1
## 22 23
## 1 1
## 3 4
## 5 5
## 5 6
## 5 5
## 7 8
## 4 4
## 9 Abitibi
## 4 1
## Abitibi-Ouest Acton
## 1 1
## Addington Alberni-Clayoquot
## 1 1
## Albert Algoma
## 1 1
## Annapolis Antigonish
## 1 1
## Antoine-Labelle Appalaches
## 1 1
## Area Argenteuil
## 10 1
## Arthabaska Avignon
## 1 1
## Baffin Basques
## 1 1
## Bay Beauce-Sartigan
## 1 1
## Beauharnois-Salaberry Bécancour
## 1 1
## Bellechasse Blainville
## 1 1
## Bonaventure Borough
## 1 17
## Boundary Brant
## 1 1
## Breton Brome-Missisquoi
## 1 1
## Bruce Bulkley-Nechako
## 1 1
## Capital Cariboo
## 1 1
## Carleton Charlevoix
## 1 1
## Charlevoix-Est Charlotte
## 1 2
## Chatham-Kent city
## 1 40
## City Coast
## 1 2
## Coaticook Cochrane
## 1 1
## Colchester Collines-de-l'Outaouais
## 1 1
## Columbia Columbia-Shuswap
## 1 1
## Côte-de-Beaupré Côte-de-Gaspé
## 1 1
## County Cumberland
## 3007 1
## D'Autray Deux-Montagnes
## 1 1
## Digby Domaine-du-Roy
## 1 1
## Drummond Dufferin
## 1 1
## Durham Edward
## 1 1
## Elgin Essex
## 1 1
## Etchemins Francheville
## 1 1
## Frontenac Gatineau
## 1 1
## George Glengarry
## 1 1
## Gloucester Golfe-du-Saint-Laurent
## 1 1
## Granit Grenville
## 1 1
## Grey Guysborough
## 1 1
## Haldimand-Norfolk Haliburton
## 1 1
## Halifax Halton
## 1 1
## Hamilton Hants
## 1 1
## Hastings Haut-Richelieu
## 1 1
## Haut-Saint-François Haut-Saint-Laurent
## 1 1
## Haute-Côte-Nord Haute-Gaspésie
## 1 1
## Haute-Yamaska Huron
## 1 1
## Îles-de-la-Madeleine Inverness
## 1 1
## Jacques-Cartier Jardins-de-Napierville
## 1 1
## John Joliette
## 1 1
## Kamouraska Keewatin
## 1 1
## Kenora Kent
## 1 1
## Kings Kitikmeot
## 3 1
## Kitimat-Stikine Kootenay
## 1 2
## L'Assomption L'Érable
## 1 1
## L'Île-d'Orléans L'Islet
## 1 1
## Lac-Saint-Jean-Est Lakes
## 1 1
## Lambton Lanark
## 1 1
## Laurentides Laval
## 1 1
## Lévis Longueuil
## 1 1
## Lotbinière Lunenburg
## 1 1
## Madawaska Manicouagan
## 1 1
## Manitoulin Marguerite-D'Youville
## 1 1
## Maria-Chapdelaine Maskinongé
## 1 1
## Maskoutains Matane
## 1 1
## Matapédia Matawinie
## 1 1
## Mékinac Memphrémagog
## 1 1
## Middlesex Mirabel
## 1 1
## Mitis Montcalm
## 1 1
## Montmagny Montréal
## 1 1
## Moulins Municipality
## 1 2
## Muskoka Nanaimo
## 1 1
## Niagara Nicolet-Yamaska
## 1 1
## Nipissing Nord-du-Québec
## 1 1
## Northumberland Nouvelle-Beauce
## 2 1
## Okanagan Okanagan-Similkameen
## 2 1
## Ottawa Oxford
## 1 1
## Papineau Parish
## 1 64
## Pays-d'en-Haut Peel
## 1 1
## Perth Peterborough
## 1 1
## Pictou Pontiac
## 1 1
## Portneuf Prince
## 1 1
## Québec Queens
## 1 3
## Renfrew Restigouche
## 1 1
## Richmond Rimouski-Neigette
## 1 1
## River Rivière-du-Loup
## 3 1
## Rivière-du-Nord Robert-Cliche
## 1 1
## Rocher-Percé Rockies
## 1 1
## Roussillon Rouville
## 1 1
## Rouyn-Noranda Russell
## 1 1
## Saguenay-et-son-Fjord Saurel
## 1 1
## Sept-Rivières--Caniapiscau Shawinigan
## 1 1
## Shelburne Sherbrooke
## 1 1
## Simcoe Sound
## 1 1
## Sources Squamish-Lillooet
## 1 1
## Stikine Strathcona
## 1 1
## Sudbury Sunbury
## 2 1
## Témiscamingue Témiscouata
## 1 1
## Thompson-Nicola Timiskaming
## 1 1
## Toronto Tuque
## 1 1
## Val-Saint-François Vallée-de-l'Or
## 1 1
## Vallée-de-la-Gatineau Vallée-du-Richelieu
## 1 1
## Valley Vancouver
## 3 1
## Vaudreuil-Soulanges Victoria
## 1 2
## Waddington Waterloo
## 1 1
## Wellington Westmorland
## 1 1
## Yarmouth York
## 1 2
## Yukon
## 1
|
to
mean “or”:Load the gss_cat
data. Then :
ym()
to create a birthdate
variable
equal to June 1 of the year in the year
column, minus the
number of years in the age
column
(-years()
)floor_date()
to set it back from June 1 of that
year to January 1race
factor to Black, White,
Other, in that ordertable(gss_cat$rincome)
to look at the values income
takes(page 1!)
Then, if you’re going into the string stuff, create
income_num
equal to the first number:
word()
to get the first word,str_sub()
or str_replace_all()
to get rid
of the “$”, andas.numeric()
to make the result a number(you’ll get some NAs for the non-numbers, and let’s ignore the “Lt $1000” case we miss for now)
We’ve already gone through a lot! Just remember: if you’re doing the same thing a bunch of times in a row, please don’t copy/paste or actually type it all out. Come back to these slides. That includes:
summarize/mutate
calculation to a lot of
different variables (use across()
)mutate_at()
or mutate_if()
. As of
dplyr 1.0.0, these have been deprecated in favor of
across()
across()
lets you use all the variable-selection tricks
available in select()
, like starts_with()
or
a:z
or 1:5
, but then lets you apply functions
to each of them in mutate()
or
summarize()
rowwise()
and c_across()
lets
you do stuff like “add up a bunch of columns”starts_with('price_growth')
is the same here as
4:5
or
c(price_growth_since_march_4, price_growth_daily)
stockgrowth <- tibble(ticker = c('AMZN','AMZN', 'AMZN', 'WMT', 'WMT','WMT'),
date = as.Date(rep(c('2020-03-04','2020-03-05','2020-03-06'), 2)),
stock_price = c(103,103.4,107,85.2, 86.3, 85.6)) %>%
arrange(ticker, date) %>% group_by(ticker) %>%
mutate(price_growth_since_march_4 = stock_price/first(stock_price) - 1,
price_growth_daily = stock_price/lag(stock_price, 1) - 1)
stockgrowth %>%
mutate(across(starts_with('price_growth'), function(x) x*10000)) # Convert to basis points
## # A tibble: 6 × 5
## # Groups: ticker [2]
## ticker date stock_price price_growth_since_march_4 price_growth_daily
## <chr> <date> <dbl> <dbl> <dbl>
## 1 AMZN 2020-03-04 103 0 NA
## 2 AMZN 2020-03-05 103. 38.8 38.8
## 3 AMZN 2020-03-06 107 388. 348.
## 4 WMT 2020-03-04 85.2 0 NA
## 5 WMT 2020-03-05 86.3 129. 129.
## 6 WMT 2020-03-06 85.6 46.9 -81.1
The same basis-point conversion from before:
do_bps <- function(p) {
bps <- p*10000
return(bps)
}
stockgrowth %>%
mutate(across(starts_with('price_growth'), do_bps)) # Convert to basis points
## # A tibble: 6 × 5
## # Groups: ticker [2]
## ticker date stock_price price_growth_since_march_4 price_growth_daily
## <chr> <date> <dbl> <dbl> <dbl>
## 1 AMZN 2020-03-04 103 0 NA
## 2 AMZN 2020-03-05 103. 38.8 38.8
## 3 AMZN 2020-03-06 107 388. 348.
## 4 WMT 2020-03-04 85.2 0 NA
## 5 WMT 2020-03-05 86.3 129. 129.
## 6 WMT 2020-03-06 85.6 46.9 -81.1
map()
functions in purrrmap()
usually generates a list()
,
map_dbl()
a numeric vector, map_chr()
a
character vector, map_df()
a tibble()
…list
,
data.frame/tibble
(which are technically
list
s, or vector
, and then applies a function
to each of the elements## Person Year Income
## "character" "numeric" "numeric"
summary_profile()
function you’ve made,
and want to check each state’s data to see if its data looks right. You
could dofor()
loopimport()
for any filename will import it (watch your
working directory!)import_list()
, given a list of filenames or a single
Excel workbook with multiple sheets, will import them all to a list
(access each one with [[1]]
, [[2]]
, etc., or
if they’re row-bindable, just stack them all on top of each other into
one data set with rbind = TRUE
)export()
to save when you’re doneimport()
can get a file that’s currently inside a
.zip
to save space, or export()
to
.zip
.Rdata
: very compact, and flexible (can contain more
than just one data set), but specific to R and can’t be used most other
places.csv
: the most general and can be used by anything, but
large filesizes and you lose some data detail (like, factors don’t carry
over).parquet
: very compact and also fairly general, but
also new and not everyone will know what to do with itvtable()
in
vtable can generate a documentation file for
sharingx
and a
and
returns x+a
mtcars
data.mutate(across())
to add 1 to all columns ending in
‘p’, using the function you wrotea
as an argument, change “add 1” to “add a
” and returns the
resulting datamap_df
to stack together data sets with
a
values from 1 to 5export()
from
rio