Join two data frames safely — safe

This function is a wrapper for the standard dplyr join functions and the pmdplyr inexact_join functions.

safe_join(x, y, expect = NULL, join = NULL, ...)

Arguments

x, y	The left and right data sets to join.
expect	Either `"1:m"` (or `"x"`), `"m:1"` (or `"y"`), or `"1:1"` (or `c("x","y")` or `"xy"`) - the match you expect to perform. You can specify this as the kind of match you expect to be performing (one-to-many, many-to-one, or one-to-one), or as the data set(s) you expect to be uniquely identified by the joining variables (`"x"`, `"y"`, or `c("x", "y")`/`"xy"`). Alternately, set to `expect = "no m:m"` if you don't care what join you're doing as long as it isn't many-to-many.
join	A `join` or `inexact_join` function to run if `safe_join` determines your join is safe. By default, simply returns `TRUE` instead of running the join.
...	Other arguments to be passed to the function specified in `join`. If performing an `inexact_join`, put the `var` and `jvar` arguments in as quoted variables.

Details

When performing a join, we generally expect that one or both of the joined data sets is uniquely identified by the set of joining variables.

If this is not true, the results of the join will often not be what you expect. Unfortunately, join does not warn you that you may have just done something strange.

This issue is especially likely to arise with panel data, where you may have multiple different data sets at different observation levels.

safe_join forces you to specify which of your data sets you think are uniquely identified by the joining variables. If you are wrong, it will return an error. If you are right, it will pass you on to your preferred join function, given in join. If join is not specified, it will just return TRUE.

Examples

# left is panel data and i does not uniquely identify observations
left <- data.frame(
  i = c(1, 1, 2, 2),
  t = c(1, 2, 1, 2),
  a = 1:4
)
# right is individual-level data uniquely identified by i
right <- data.frame(
  i = c(1, 2),
  b = 1:2
)

# I think that I can do a one-to-one merge on i
# Forgetting that left is identified by i and t together
# So, this produces an error
if (FALSE) {
safe_join(left, right, expect = "1:1", join = left_join)
}

# If I realize I'm doing a many-to-one merge, that is correct,
# so safe_join will perform it for us
safe_join(left, right, expect = "m:1", join = left_join)
#> Joining, by = "i"
#>   i t a b
#> 1 1 1 1 1
#> 2 1 2 2 1
#> 3 2 1 3 2
#> 4 2 2 4 2