Time-lag a variable — tlag • pmdplyr

This function retrieves the time-lagged values of a variable, using the time variable defined in .t in the function or by as_pibble(). tlag() is highly unusual among time-lag functions in that it is usable even if observations are not uniquely identified by .t (and .i, if defined).

tlag(
  .var,
  .df = get(".", envir = parent.frame()),
  .n = 1,
  .default = NA,
  .quick = FALSE,
  .resolve = "error",
  .group_i = TRUE,
  .i = NULL,
  .t = NULL,
  .d = NA,
  .uniqcheck = FALSE
)

Arguments

.var	Unquoted variable from `.df` to be lagged.
.df	Data frame, pibble, or tibble (usually the object that contains `.var`) that contains the panel structure variables either listed in `.i` and `.t`, or earlier declared with `as_pibble()`. If `tlag` is called inside of a `dplyr` verb, this can be omitted and the data will be picked up automatically.
.n	Number of periods to lag by. 1 by default. Note that this is automatically scaled by `.d`. If `.d = 2` and `.n = 1`, then the lag of `.t = 3` will be `.t = 1`. Allows negative values, equivalent to `tlead()` with the same value but positive. Note that `.n` is ignored if `.d = 0`.
.default	Fill-in value used when lagged observation is not present. Defaults to NA.
.quick	If `.i` and `.t` uniquely identify observations in your data, and there either `.d = 0` or there are no time gaps for any individuals (perhaps use `panel_fill()` first), set `.quick = TRUE` to improve speed. `tlag()` will not check if either of these things are true (except unique identification, which will be checked if `.uniqcheck = 1` or if `.i` or `.t` are specified in-function), so make sure they are or you will get strange results.
.resolve	If there is more than one observation per individal/period, and the value of `.var` is identical for all of them, that's no problem. But what should `tlag()` do if they're not identical? Set `.resolve = 'error'` (or, really, any string) to throw an error in this circumstance. Or, set `.resolve` to a function (ideally, a vectorized one) that can be used within `dplyr::summarize()` to select a single value per individual/period. For example, `.resolve = mean` to get the mean value of all observations present for that individual/period.
.group_i	By default, if `.i` is specified or found in the data, `tlag()` will group the data by `.i`, ignoring any grouping already implemented. Set `.group_i = FALSE` to avoid this.
.i	Quoted or unquotes variable(s) that identify the individual cases. Note that setting any one of `.i`, `.t`, or `.d` will override all three already applied to the data, and will return data that is `as_pibble()`d with all three, unless `.setpanel=FALSE`.
.t	Quoted or unquoted variable indicating the time. `pmdplyr` accepts two kinds of time variables: numeric variables where a fixed distance `.d` will take you from one observation to the next, or, if `.d=0`, any standard variable type with an order. Consider using the `time_variable()` function to create the necessary variable if your data uses a `Date` variable for time.
.d	Number indicating the gap in `.t` between one period and the next. For example, if `.t` indicates a single day but data is collected once a week, you might set `.d=7`. To ignore gap length and assume that "one period ago" is always the most recent prior observation in the data, set `.d = 0`. The default `.d = NA` here will become `.d = 1` if either `.i` or `.t` are declared.
.uniqcheck	Logical parameter. Set to TRUE to always check whether `.i` and `.t` uniquely identify observations in the data. By default this is set to FALSE and the check is only performed once per session, and only if at least one of `.i`, `.t`, or `.d` is set.

Examples


data(Scorecard)

# The Scorecard data is uniquely identified by unitid and year.
# However, there are sometimes gaps between years.
# In cases like this, using dplyr::lag() will still use the row before,
# whereas tlag() will respect the gap and give a NA, much like plm::lag()
# (although tlag is slower than either, sorry)
Scorecard <- Scorecard %>%
  dplyr::mutate(pmdplyr_tlag = tlag(earnings_med,
    .i = unitid,
    .t = year
  ))
Scorecard <- Scorecard %>%
  dplyr::arrange(year) %>%
  dplyr::group_by(unitid) %>%
  dplyr::mutate(dplyr_lag = dplyr::lag(earnings_med)) %>%
  dplyr::ungroup()

# more NAs in the pmdplyr version - observations with a gap and thus no real lag present in data
sum(is.na(Scorecard$pmdplyr_tlag))
#> [1] 26987
sum(is.na(Scorecard$dplyr_lag))
#> [1] 16950

# If we want to ignore gaps, or have .d = 0, and .i and .t uniquely identify observations,
# we can use the .quick option to match dplyr::lag()
Scorecard <- Scorecard %>%
  dplyr::mutate(pmdplyr_quick_tlag = tlag(earnings_med,
    .i = unitid,
    .t = year,
    .d = 0,
    .quick = TRUE
  ))
sum(Scorecard$dplyr_lag != Scorecard$pmdplyr_quick_tlag, na.rm = TRUE)
#> [1] 0

# Where tlag shines is when you have multiple observations per .i/.t
# If the value of .var is constant within .i/.t, it will work just as you expect.
# If it's not, it will throw an error, or you can set
# .resolve to tell tlag how to select a single value from the many
# Maybe we want to get the lagged average earnings within degree award type
Scorecard <- Scorecard %>%
  dplyr::mutate(
    last_year_earnings_by_category =
      tlag(earnings_med,
        .i = pred_degree_awarded_ipeds, .t = year,
        .resolve = function(x) mean(x, na.rm = TRUE)
      )
  )
# Or maybe I want the lagged earnings across all types - .i isn't necessary!
Scorecard <- Scorecard %>%
  dplyr::mutate(last_year_earnings_all = tlag(earnings_med,
    .t = "year",
    .resolve = function(x) mean(x, na.rm = TRUE)
  ))
# Curious why the first nonmissing obs show up in 2012?
# It's because there's no 2008 or 2010 in the data, so when 2009 or 2011 look back
# a year, they find nothing!
# We could get around this by setting .d = 0 to ignore gap length
# Note this can be a little slow.
Scorecard <- Scorecard %>%
  dplyr::mutate(last_year_earnings_all = tlag(earnings_med,
    .t = year, .d = 0,
    .resolve = function(x) mean(x, na.rm = TRUE)
  ))