Join two data frames inexactly — inexact

These functions are modifications of the standard dplyr join functions, except that it allows a variable of an ordered type (like date or numeric) in x to be matched in inexact ways to variables in y.

inexact_inner_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  var = NULL,
  jvar = NULL,
  method,
  exact = TRUE
)

inexact_left_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  var = NULL,
  jvar = NULL,
  method,
  exact = TRUE
)

inexact_right_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  var = NULL,
  jvar = NULL,
  method,
  exact = TRUE
)

inexact_full_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  var = NULL,
  jvar = NULL,
  method,
  exact = TRUE
)

inexact_semi_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  ...,
  var = NULL,
  jvar = NULL,
  method,
  exact = TRUE
)

inexact_nest_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  keep = FALSE,
  name = NULL,
  ...,
  var = NULL,
  jvar = NULL,
  method,
  exact = TRUE
)

inexact_anti_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  ...,
  var = NULL,
  jvar = NULL,
  method,
  exact = TRUE
)

Arguments

x, y, by, copy, suffix, keep, name, ...	Arguments to be passed to the relevant `join` function.
var	Quoted or unquoted variable from the `x` data frame which is to be indirectly matched.
jvar	Quoted or unquoted variable(s) from the `y` data frame which are to be indirectly matched. These cannot be variable names also in `x` or `var`.
method	The approach to be taken in performing the indirect matching.
exact	A logical, where `TRUE` indicates that exact matches are acceptable. For example, if `method = 'last'`, `x` contains `var = 2`, and `y` contains `jvar = 1` and `jvar = 2`, then `exact = TRUE` will match with the `jvar = 2` observation, and `exact = FALSE` will match with the `jvar = 1` observation. If `jvar` contains two variables and you want them treated differently, set to `c(TRUE,FALSE)` or `c(FALSE,TRUE)`.

Details

This allows matching, for example, if one data set contains data from multiple days in the week, while the other data set is weekly. Another example might be matching an observation in one data set to the *most recent* previous observation in the other.

The available methods for matching are:

method = "last" matches var to the closest value of jvar that is *lower*.
method = "next" matches var to the closest value of jvar that is *higher*.
method = "closest" matches var to the closest value of jvar, above or below. If equidistant between two values, picks the lower of the two.
method = "between" requires two variables in jvar which constitute the beginning and end of a range, and matches var to the range it is in. Make sure that the ranges are non-overlapping within the joining variables, or else you will get strange results (specifically, it should join to the earliest-starting range). If the end of one range is the exact start of another, exact = c(TRUE,FALSE) or exact = c(FALSE,TRUE) is recommended to avoid overlaps. Defaults to exact = c(TRUE,FALSE).

Note that if, given the method, var finds no proper match, it will be merged with any is.na(jvar[1]) values.

Examples


data(Scorecard)
# We also have this data on the December unemployment rate for US college grads nationally
# but only every other year
unemp_data <- data.frame(
  unemp_year = c(2006, 2008, 2010, 2012, 2014, 2016, 2018),
  unemp = c(.017, .036, .048, .040, .028, .025, .020)
)
# I want to match the most recent unemployment data I have to each college
Scorecard <- Scorecard %>%
  inexact_left_join(unemp_data,
    method = "last",
    var = year,
    jvar = unemp_year
  )
#> Joining, by = "unemp_year"

# Or perhaps I want to find the most recent lagged value (i.e. no exact matches, only recent ones)
data(Scorecard)
Scorecard <- Scorecard %>%
  inexact_left_join(unemp_data,
    method = "last",
    var = year,
    jvar = unemp_year,
    exact = FALSE
  )
#> Error in inexact_join_prep(x = x, y = y, by = by, copy = copy, suffix = suffix,     var = varcall, jvar = jvarcall, method = method, exact = exact): The variable names in jvar should not be in x

# Another way to do the same thing would be to specify the range of unemp_years I want exactly
data(Scorecard)
unemp_data$unemp_year2 <- unemp_data$unemp_year + 2
Scorecard <- Scorecard %>%
  inexact_left_join(unemp_data,
    method = "between",
    var = year,
    jvar = c(unemp_year, unemp_year2)
  )
#> Error in inexact_join_prep(x = x, y = y, by = by, copy = copy, suffix = suffix,     var = varcall, jvar = jvarcall, method = method, exact = exact): The variable names in jvar should not be in x