This function is a wrapper for the standard dplyr
join
functions and the pmdplyr
inexact_join
functions.
safe_join(x, y, expect = NULL, join = NULL, ...)
x, y | The left and right data sets to join. |
---|---|
expect | Either |
join | A |
... | Other arguments to be passed to the function specified in |
When performing a join, we generally expect that one or both of the joined data sets is uniquely identified by the set of joining variables.
If this is not true, the results of the join will often not be what you expect. Unfortunately, join
does not warn you that you may have just done something strange.
This issue is especially likely to arise with panel data, where you may have multiple different data sets at different observation levels.
safe_join
forces you to specify which of your data sets you think are uniquely identified by the joining variables. If you are wrong, it will return an error. If you are right, it will pass you on to your preferred join
function, given in join
. If join
is not specified, it will just return TRUE
.
# left is panel data and i does not uniquely identify observations left <- data.frame( i = c(1, 1, 2, 2), t = c(1, 2, 1, 2), a = 1:4 ) # right is individual-level data uniquely identified by i right <- data.frame( i = c(1, 2), b = 1:2 ) # I think that I can do a one-to-one merge on i # Forgetting that left is identified by i and t together # So, this produces an error if (FALSE) { safe_join(left, right, expect = "1:1", join = left_join) } # If I realize I'm doing a many-to-one merge, that is correct, # so safe_join will perform it for us safe_join(left, right, expect = "m:1", join = left_join)#>#> i t a b #> 1 1 1 1 1 #> 2 1 2 2 1 #> 3 2 1 3 2 #> 4 2 2 4 2