vignettes/dplyr_variants.Rmd
dplyr_variants.Rmd
library(pmdplyr)
The pmdplyr
package is an extension to dplyr
designed for cleaning and managing panel and hierarchical data. It contains variations on the dplyr::mutate()
and dplyr::join()
functions that address common panel data needs, and contains functions for managing and cleaning panel data.
Unlike other panel data packages, functions in pmdplyr
are all designed to work even if there is more than one observation per individual per period. This comes in handy if each individual is observed multiple times per period - for example, multiple classes per student per term; or if you have hierarchical data - for example, multiple companies per country.
There are three vignettes in total describing the contents of pmdplyr
:
“pmdplyr”/“Get Started”, which describes the pibble
panel data object type, and the pmdplyr
tools for creating well-behaved ID and time variables id_variable()
and time_variable()
.
CURRENT VIGNETTE “dplyr variants”, which that describes pmdplyr
variations on dplyr
functions mutate()
(mutate_cascade()
and mutate_subset()
), _join
(inexact_join
and safe_join()
), and lag
(tlag()
).
“Panel Tools”, which describes novel tools that pmdplyr
provides for cleaning and manipulating panel data (panel_fill()
, panel_locf()
, fixed_check()
, fixed_force()
, between_i()
, within_i()
, mode_order()
).
pmdplyr
offers a set of wrappers for the dplyr::join()
functions.
The set of inexact_join()
functions maps directly onto the set of dplyr::join()
functions:
inexact_inner_join()
, inexact_left_join()
, inexact_right_join()
, inexact_full_join()
inexact_semi_join()
, and inexact_anti_join()
inexact_nest_join()
Here we will focus specifically on inexact_left_join()
. For the differences between the functions, see dplyr::join()
.
join()
functions take two data sets and join them based on matching values of a set of shared variables.
left_df <- data.frame( i = c(1, 1, 1, 2, 2, 2), t = c(1, 2, 3, 1, 2, 3), v1 = 1:6 ) right_df <- data.frame( i = c(1, 1, 1, 2, 2, 2), t = c(0, 2, 4, 0, 2, 4), v2 = 7:12 ) # It automatically detects that i and t are the shared variables # and finds two combinations of those in left_df that are also # in right_df: i = 1, t = 2, and i = 2, t = 2. So it brings the # v2 values it can match up in to the joined data. # Other observations don't find a match left_join(left_df, right_df) #> i t v1 v2 #> 1 1 1 1 NA #> 2 1 2 2 8 #> 3 1 3 3 NA #> 4 2 1 4 NA #> 5 2 2 5 11 #> 6 2 3 6 NA
However, it is common (especially in a panel data context) to want to join two data frames where one of the variables does not line up exactly. For example, maybe we want those t = 1
values in left_df
to pick up the t = 0
values in right_df
.
We can do this, in a few different ways with an inexact_join()
:
inexact_left_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ..., var = NULL, jvar = NULL, method, exact = TRUE )
The first arguments: x, y, by, copy, suffix, ...
, are standard arguments to be passed to left_join()
. x
and y
are our left-hand and right-hand data sets, respectively. See dplyr::left_join()
for the rest.
We’ve added on here var, jvar, method
, and exact
.
var
is the variable in the left-hand data set that you would like to match inexactly on, and jvar
is the variable(s) in the right-hand data set that you would like to match inexactly on. It’s important that the names of these variables aren’t shared, because the resulting data set will show how var
and jvar
line up. So let’s prepare our data by renaming t
in right_df
to something else so it’s not t
in both data sets.
right_df <- right_df %>% rename(t_right = t)
method
determines how var
and jvar
will be matched up.
method = "last"
matches var
to the closest value of jvar
that is lower, so those t = 1
observations will get matched to t_right = 0
, and t = 3
will get matched to t_right = 2
(meaning that t_right = 2
will get matched to both t = 2
and t = 3
):inexact_left_join(left_df, right_df, var = t, jvar = t_right, method = "last" ) #> i t v1 t_right v2 #> 1 1 1 1 0 7 #> 2 1 2 2 2 8 #> 3 1 3 3 2 8 #> 4 2 1 4 0 10 #> 5 2 2 5 2 11 #> 6 2 3 6 2 11
method = "next"
matches var
to the closest value of jvar
that is higher, so now t = 1
will get matched to t_right = 2
, and t = 3
will get matched to t_right = 4
:inexact_left_join(left_df, right_df, var = t, jvar = t_right, method = "next" ) #> i t v1 t_right v2 #> 1 1 1 1 2 8 #> 2 1 2 2 2 8 #> 3 1 3 3 4 9 #> 4 2 1 4 2 11 #> 5 2 2 5 2 11 #> 6 2 3 6 4 12
method = "closest"
will match var
to the closest value of jvar
in either direction. If there’s a tie, it will pick the lower value of jvar
. So now t = 1
will pick t_right = 0
(out of a tie between 0
and 2
), and t = 3
will match to t = 2
:inexact_left_join(left_df, right_df, var = t, jvar = t_right, method = "closest" ) #> i t v1 t_right v2 #> 1 1 1 1 0 7 #> 2 1 2 2 2 8 #> 3 1 3 3 2 8 #> 4 2 1 4 0 10 #> 5 2 2 5 2 11 #> 6 2 3 6 2 11
method = "between"
is for matching var
to a set of two jvar
s that define the beginning and end of a range. Make sure that the ranges are non-overlapping within the joining variables, or else you will get strange results (specifically, it should join to the earliest-starting range). So now, given the way we define t_bottom
and t_top
below, t = 1
should go in the range t_bottom = 0, t_top = 2
, and t = 2
and t = 3
should both go in the range t_bottom = 2, t_top = 4
.right_df <- right_df %>% rename(t_bottom = t_right) %>% mutate(t_top = t_bottom + 2) inexact_left_join(left_df, right_df, var = t, jvar = c(t_bottom, t_top), method = "between" ) #> i t v1 t_bottom v2 t_top #> 1 1 1 1 0 7 2 #> 2 1 2 2 2 8 4 #> 3 1 3 3 2 8 4 #> 4 2 1 4 0 10 2 #> 5 2 2 5 2 11 4 #> 6 2 3 6 2 11 4
So that leaves us with exact
. exact
determines whether or not an exact match is an acceptable match, and interprets "last"
as “this value or earlier” and "next"
as “this value or later”. Generally, for joining purposes, you’ll want this to be TRUE
. But perhaps you don’t! Maybe you want “earlier” or “later” only to get something like “the most recent previous value” for method = "last"
. In that case, set this to FALSE
.
In the case of method = "between"
, it’s especially important to keep track of exact
because it’s common for one range to start at the exact endpoint of another. If the end of one range is the exact start of another, exact = c(TRUE, FALSE)
or exact = c(FALSE, TRUE)
is recommended to avoid overlaps. Defaults to exact = c(TRUE, FALSE)
.
When joining two data sets x
and y
on a set of shared variables by
, there are four ways in which they can be matched: one-to-many (by
uniquely identifies rows in x
but not y
, so each observation in x
will be matched to several in y
), many-to-one (by
uniquely identifies rows in y
but not x
, so each observation in y
will be matched to several in x
), one-to-one (by
uniquely identifies rows in both x
and y
, so each observation in x
will be matched to exactly one in y
), and many-to-many (by
does not uniquely identify rows in either x
or y
).
Unfortunately, when you perform a join()
or inexact_join()
, it doesn’t tell you which of those you’ve just done! This can be especially problematic if you’ve accidentally done a many-to-many join, since many-to-many join often leads to unexpected results.
safe_join()
is a wrapper for all join()
and inexact_join()
functions which tells you whether you are, in fact, doing the join you expect to be doing, and returns an error if you’re not.
safe_join(x, y, expect = NULL, join = NULL, ... )
x
, y
, and ...
are the standard join()
/inexact_join()
arguments that you would normally use. See help(join, package = "dplyr")
or the inexact_join
section above to see what arguments might go in ...
to pass through to those functions, such as suffix
or var
.
expect
is a character variable where you specify the type of join you think you’re about to do. You can specify this either as one-to-many / many-to-one / one-to-one directly, or you can specify which of the two data sets (x
or y
) you think should be uniquely identified by the joining variables.
expect = "1:1"
or expect = c("x", "y")
or expect = "xy"
indicates that you anticipate to join one-to-one.expect = "m:1"
or expect = “y”` indicates that you expect to join many-to-one.expect = "1:m"
or "expect = "x"
indicates that you expect to join one-to-many.expect = "no m:m"
indicates that you don’t care whether you’re one-to-one, one-to-many, or many-to-one, as long as you’re not many-to-many.expect
option that allows you to run a many-to-many join.safe_join()
will return an error if your data do not match your expect
selection.
If your data does match your expect
option, then it will look to your join
. join
is the function for the join
or inexact_join
you’d like to run, for example join = inexact_left_join
.
If run without a join
specified, safe_join()
will return TRUE
if you’re good to go. If run with a join
specified, then instead safe_join()
will pass your data on to the function and actually run the join for you.
There is little reason to run any join()
or inexact_join()
without going through safe_join()
. It will help you avoid some nasty surprises!
# left is panel data and i does not uniquely identify observations left <- data.frame( i = c(1, 1, 2, 2), t = c(1, 2, 1, 2), a = 1:4 ) # right is individual-level data uniquely identified by i right <- data.frame( i = c(1, 2), b = 1:2 ) # I think that I can do a one-to-one merge on i # Forgetting that left is identified by i and t together # So, this produces an error try( safe_join(left, right, expect = "1:1", join = left_join) ) #> Error in safe_join(left, right, expect = "1:1", join = left_join) : #> The left-hand data set x is not uniquely identified by the joining variable i. # If I realize I'm doing a many-to-one merge, that is correct, # so safe_join will return TRUE if we don't specify a join # or perform the join for us if we do safe_join(left, right, expect = "m:1") #> [1] TRUE safe_join(left, right, expect = "m:1", join = left_join) #> i t a b #> 1 1 1 1 1 #> 2 1 2 2 1 #> 3 2 1 3 2 #> 4 2 2 4 2
pmdplyr
adds several new versions of dplyr::mutate()
that help with some common panel-data manipulation needs.
mutate_subset()
is a function that performs a dplyr::summarize()
command on a subset (.filter
) of your data, and then takes the result and applies it to all observations in your data (or all observations within group, if grouped).
The most common use of this is to partially widen your data. Panel data can be stored in “wide” format, where there is one row per individual and, for each variable, one column per time period, or the more common (and assumed in pmdplyr
) “long” format where there is one (or possibly more than one) row per individual/time period.
The benefit of wide data is that it makes it very easy to compare variables across wide stretches of time. How much has asset 1
increased in value from the beginning of the sample? Easy in wide data, a little trickier in long (although it could be done with a tlag()
, see tlag() section).
If you only have a few such comparisons to make, mutate_subset()
lets you make them without fully widening the data. Just make a “value at the beginning of the sample” variable, if that’s all you need, without having to bother fully widening.
Another common use is to make specific comparisons within groups. If I want to know how your earnings compare to the average earnings in your state, I can just do a within_i()
calculation (see Additional Calculations section). But what if I want to know how your earnings compare to the average earnings of college graduates in your state? That’s harder. But mutate_subset()
makes it easy.
mutate_subset(.df, ..., .filter, .group_i = TRUE, .i = NULL, .t = NULL, .d = NA, .uniqcheck = FALSE, .setpanel = TRUE )
where .df
is the data set being mutated and ...
is a set of name-value pairs of expressions in the style of dplyr::mutate
. Note that, since the idea here is to get a summary measure from a filtered group, expressions should be written such that they would be valid arguments in dplyr::summarize()
.
.filter
is a logical condition that describes the observations that you want to perform the ...
calculations on.
Let’s perform the analysis we described above, comparing an individual’s earnings to the average earnings of college graduates in their state:
df <- pibble( state = c("CA", "CA", "CA", "NV", "NV", "NV"), college = c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE), earn = c(1, 2, 3, 2, 3, 2), .i = state ) df %>% # Calculate average earnings of college grads mutate_subset(college_earnings = mean(earn), .filter = college == TRUE) %>% # And compare to our own earnings mutate(earnings_vs_college = earn - college_earnings) #> # A tibble: 6 x 5 #> state college earn college_earnings earnings_vs_college #> <chr> <lgl> <dbl> <dbl> <dbl> #> 1 CA TRUE 1 1.5 -0.5 #> 2 CA TRUE 2 1.5 0.5 #> 3 CA FALSE 3 1.5 1.5 #> 4 NV TRUE 2 2 0 #> 5 NV FALSE 3 2 1 #> 6 NV FALSE 2 2 0
The rest of the options include .group_i
(by default, if .i
can be found, analysis will be performed within-individual. Set .group_i = FALSE
to ignore this), and standard arguments related to declaring the panel structure of the data (.i
, .t
, .d
, .uniqcheck
, see the “pibble” section above). The .d = NA
will become .d = 1
if either .i
or .t
are declared. .setpanel
ensures that if you declare the panel structure in the mutate_subset()
function, it will be maintained in the object you get back.
mutate_cascade()
performs dplyr::mutate()
on a data set one time period at a time, in order, allowing the mutate from each time period to finish before moving on to the next one. The changes “cascade down” through time. This can be handy if your mutate()
command makes reference to earlier time periods using (usually) tlag()
(see below) and you want changes in one period to be passed down to the next.
In effect, you can think of mutate_cascade()
as behaving much like cumsum()
, cumprod()
, cummax()
or cummin()
, except that it (1) respects the panel structure of the data, (2) works when you have multiple observations per .i
/.t
, (3) is much more flexible, and (4) is much slower.
As of this writing mutate_cascade()
is pretty darn slow (after all, if you have T time periods, you’re running T separate mutate
commands in a loop!), so be careful in using it.
mutate_cascade(.df, ..., .skip = TRUE, .backwards = FALSE, .group_i = TRUE, .i = NULL, .t = NULL, .d = NA, .uniqcheck = FALSE, .setpanel = TRUE )
where .df
is the data set being mutated, and ...
is the list of expressions to be passed to dplyr::mutate()
.
.skip
instructs mutate_cascade()
to skip over the first time period within the group (or last time period if backwards = TRUE
). This should usually be set to TRUE
, since most usages of mutate_cascade()
involve a tlag()
, and the tlag()
of something in the first time period is usually NA
. Then, you’ve filled in that first-period NA
- now the tlag()
in period 2 is NA as well, and it will cascade down to make your whole data set NA
. If you set .skip = FALSE
, you should have some way in ...
of handling NA
s (which you may want to do anyway in case there are NA
s elsewhere in your data).
.backwards
, unsurprisingly, tells mutate_cascade()
to start with the last time period and work backwards.
Let’s do a very simple example and use mutate_cascade()
to build a present discounted value. We have an asset with a payout
each period, and we have a discount factor .95
. We can build a present discounted value PDV
by taking the PDV
in the next period, multiplying it by .95
, and adding on the current payout
. But we need to calculate PDV
one period at a time, so that we can use each period’s calculation to calculate the previous one.
df <- pibble( t = c(1, 2, 3, 4, 5), payout = c(3, 4, 2, 2, 4), .t = t ) %>% mutate(PDV = payout) %>% mutate_cascade(PDV = payout + .95 * tlag(PDV, .n = -1), .backwards = TRUE) df #> # A tibble: 5 x 3 #> t payout PDV #> <dbl> <dbl> <dbl> #> 1 1 3 13.6 #> 2 2 4 11.1 #> 3 3 2 7.51 #> 4 4 2 5.8 #> 5 5 4 4
As expected, the PDV
in period 5
is just the payout: 4
. In period 4
it’s 2 + .95*4 = 5.8
. Then in period 3
it’s 2 + .95*5.8 = 7.51
, and so on.
The rest of the options include .group_i
(by default, if .i
can be found, analysis will be performed within-individual. Set .group_i = FALSE
to ignore this), and standard arguments related to declaring the panel structure of the data (.i
, .t
, .d
, .uniqcheck
, see the “pibble” section above). The .d = NA
will become .d = 1
if either .i
or .t
are declared. .setpanel
ensures that if you declare the panel structure in the mutaet_cascade()
function, it will be maintained in the object you get back.
tlag()
is a function that lags a variable in time. It respects the panel structure of the data, works with multiple observations per combination of .i
/.t
, and, unlike plm::lag()
, doesn’t run into masking problems by sharing a name with dplyr::lag()
. Do remember that dplyr::lag()
does not lag data in time, it lags data in the order of the data set.
tlag(.var, .df = get(".", envir = parent.frame()), .n = 1, .default = NA, .quick = FALSE, .resolve = "error", .group_i = TRUE, .i = NULL, .t = NULL, .d = NA, .uniqcheck = FALSE )
where .var
is the variable being lagged, , and .df
is the data set that variable lives in. If the data set is being passed in via %>% mutate()
or similar, then .df
will automatically pick it up and you don’t need to specify it.
.n
is the number of periods to lag. Negative values of .n
imply a lead instead of a lag (as in the example in mutate_cascade()
in the Mutate Variations section). There’s not a separate tlead()
function.
.default
is the value to use if a lag does not exist. By default, this is NA
. So if you have data in periods 1
and 3
but not 2
, then the tlag
in the third period will produce NA
.
.quick
is a setting you can use if your data is very nicely structured, with rows uniquely identified by .i
/.t
and either no gaps between time periods or .d = 0
. tlag()
will run more quickly with .quick = TRUE
, but will produce incorrect results if these conditions are not met.
df <- pibble( i = c(1, 1, 1, 2, 2, 2), t = c(1, 2, 3, 1, 2, 3), x = 1:6, .i = i, .t = t ) %>% # A lag and a lead, filling in the lead with 0 instead of NA mutate( x_lag = tlag(x), x_lead = tlag(x, .n = -1, .default = 0), # Our data satisfies the .quick conditions so we can # do that for a little extra speed x_quicklag = tlag(x, .quick = TRUE) ) df #> # A pibble: 6 x 6 #> # Individual-level identifier (.i): i [2] #> # Time variable (.t): t [3] #> # Gap (.d): 1 #> i t x x_lag x_lead x_quicklag #> <dbl> <dbl> <int> <int> <dbl> <int> #> 1 1 1 1 NA 2 NA #> 2 1 2 2 1 3 1 #> 3 1 3 3 2 0 2 #> 4 2 1 4 NA 5 NA #> 5 2 2 5 4 6 4 #> 6 2 3 6 5 0 5
If .var
is not constant within combinations of .i
and .t
we have a problem! Which value do we choose to use for the purpose of filling in other observations? .resolve
makes this choice. By default, there will be an “error” if values of .var
are inconsistent within .i
and .t
. Or, set .resolve
to a summary function like .resolve = mean
or .resolve = function(x) mean(x, na.rm = TRUE)
to resolve inconsistencies before filling in.
df <- pibble( i = c(1, 1, 1, 2, 2, 2), t = c(1, 1, 2, 1, 1, 2), x = 1:6, .i = i, .t = t ) %>% mutate(x_lag = tlag(x, .resolve = mean)) df #> # A pibble: 6 x 4 #> # Individual-level identifier (.i): i [2] #> # Time variable (.t): t [2] #> # Gap (.d): 1 #> i t x x_lag #> <dbl> <dbl> <int> <dbl> #> 1 1 1 1 NA #> 2 1 1 2 NA #> 3 1 2 3 1.5 #> 4 2 1 4 NA #> 5 2 1 5 NA #> 6 2 2 6 4.5
The rest of the options include .group_i
(by default, if .i
can be found, lags will be performed within-individual. Set .group_i = FALSE
to ignore this), and standard arguments related to declaring the panel structure of the data (.i
, .t
, .d
, .uniqcheck
, see the “pibble” section above). The .d = NA
will become .d = 1
if either .i
or .t
are declared.