These functions are modifications of the standard dplyr
join
functions, except that it allows a variable of an ordered type (like date or numeric) in x
to be matched in inexact ways to variables in y
.
inexact_inner_join( x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ..., var = NULL, jvar = NULL, method, exact = TRUE ) inexact_left_join( x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ..., var = NULL, jvar = NULL, method, exact = TRUE ) inexact_right_join( x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ..., var = NULL, jvar = NULL, method, exact = TRUE ) inexact_full_join( x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ..., var = NULL, jvar = NULL, method, exact = TRUE ) inexact_semi_join( x, y, by = NULL, copy = FALSE, ..., var = NULL, jvar = NULL, method, exact = TRUE ) inexact_nest_join( x, y, by = NULL, copy = FALSE, keep = FALSE, name = NULL, ..., var = NULL, jvar = NULL, method, exact = TRUE ) inexact_anti_join( x, y, by = NULL, copy = FALSE, ..., var = NULL, jvar = NULL, method, exact = TRUE )
x, y, by, copy, suffix, keep, name, ... | Arguments to be passed to the relevant |
---|---|
var | Quoted or unquoted variable from the |
jvar | Quoted or unquoted variable(s) from the |
method | The approach to be taken in performing the indirect matching. |
exact | A logical, where |
This allows matching, for example, if one data set contains data from multiple days in the week, while the other data set is weekly. Another example might be matching an observation in one data set to the *most recent* previous observation in the other.
The available methods for matching are:
method = "last"
matches var
to the closest value of jvar
that is *lower*.
method = "next"
matches var
to the closest value of jvar
that is *higher*.
method = "closest"
matches var
to the closest value of jvar
, above or below. If equidistant between two values, picks the lower of the two.
method = "between"
requires two variables in jvar
which constitute the beginning and end of a range, and matches var
to the range it is in. Make sure that the ranges are non-overlapping within the joining variables, or else you will get strange results (specifically, it should join to the earliest-starting range). If the end of one range is the exact start of another, exact = c(TRUE,FALSE)
or exact = c(FALSE,TRUE)
is recommended to avoid overlaps. Defaults to exact = c(TRUE,FALSE)
.
Note that if, given the method, var
finds no proper match, it will be merged with any is.na(jvar[1])
values.
data(Scorecard) # We also have this data on the December unemployment rate for US college grads nationally # but only every other year unemp_data <- data.frame( unemp_year = c(2006, 2008, 2010, 2012, 2014, 2016, 2018), unemp = c(.017, .036, .048, .040, .028, .025, .020) ) # I want to match the most recent unemployment data I have to each college Scorecard <- Scorecard %>% inexact_left_join(unemp_data, method = "last", var = year, jvar = unemp_year )#># Or perhaps I want to find the most recent lagged value (i.e. no exact matches, only recent ones) data(Scorecard) Scorecard <- Scorecard %>% inexact_left_join(unemp_data, method = "last", var = year, jvar = unemp_year, exact = FALSE )#> Error in inexact_join_prep(x = x, y = y, by = by, copy = copy, suffix = suffix, var = varcall, jvar = jvarcall, method = method, exact = exact): The variable names in jvar should not be in x# Another way to do the same thing would be to specify the range of unemp_years I want exactly data(Scorecard) unemp_data$unemp_year2 <- unemp_data$unemp_year + 2 Scorecard <- Scorecard %>% inexact_left_join(unemp_data, method = "between", var = year, jvar = c(unemp_year, unemp_year2) )#> Error in inexact_join_prep(x = x, y = y, by = by, copy = copy, suffix = suffix, var = varcall, jvar = jvarcall, method = method, exact = exact): The variable names in jvar should not be in x