Fill in missing (or other) values of a panel data set using known data

This function looks for a list of values (usually, just NA) in a variable .var and overwrites those values with the most recent (or next-coming) values that are not from that list ("last observation carried forward").

panel_locf(
  .var,
  .df = get(".", envir = parent.frame()),
  .fill = NA,
  .backwards = FALSE,
  .resolve = "error",
  .group_i = TRUE,
  .i = NULL,
  .t = NULL,
  .d = 1,
  .uniqcheck = FALSE
)

Arguments

.var	Vector to be modified.
.df	Data frame, pibble, or tibble (usually the one containing `.var`) that contains the panel structure variables either listed in `.i` and `.t`, or earlier declared with `as_pibble()`. If `tlag` is called inside of a `dplyr` verb, this can be omitted and the data will be picked up automatically.
.fill	Vector of values to be overwritten. Just `NA` by default.
.backwards	By default, values of newly-created observations are copied from the most recently available period. Set `.backwards = TRUE` to instead copy values from the closest following period.
.resolve	If there is more than one observation per individal/period, and the value of `.var` is identical for all of them, that's no problem. But what should `panel_locf()` do if they're not identical? Set `.resolve = 'error'` (or, really, any string) to throw an error in this circumstance. Or, set `.resolve` to a function that can be used within `dplyr::summarize()` to select a single value per individual/period. For example, `.resolve = function(x) mean(x)` to get the mean value of all observations present for that individual/period. `.resolve` will also be used to fill in values if some values in a given individual/period are to be overwritten and others aren't. Using a function will be quicker than `.resolve = 'error'`, so if you're certain there's no issue, you can speed up execution by setting, say, `.resolve = dplyr::first`.
.group_i	By default, if `.i` is specified or found in the data, `panel_locf()` will group the data by `.i`, ignoring any grouping already implemented. Set `.group_i = FALSE` to avoid this.
.i	Quoted or unquoted variables that identify the individual cases. Note that setting any one of `.i`, `.t`, or `.d` will override all three already applied to the data, and will return data that is `as_pibble()`d with all three, unless `.setpanel=FALSE`.
.t	Quoted or unquoted variable indicating the time. `pmdplyr` accepts two kinds of time variables: numeric variables where a fixed distance `.d` will take you from one observation to the next, or, if `.d=0`, any standard variable type with an order. Consider using the `time_variable()` function to create the necessary variable if your data uses a `Date` variable for time.
.d	Number indicating the gap in `.t` between one period and the next. For example, if `.t` indicates a single day but data is collected once a week, you might set `.d=7`. To ignore gap length and assume that "one period ago" is always the most recent prior observation in the data, set `.d=0`. By default, `.d=1`.
.uniqcheck	Logical parameter. Set to TRUE to always check whether `.i` and `.t` uniquely identify observations in the data. By default this is set to FALSE and the check is only performed once per session, and only if at least one of `.i`, `.t`, or `.d` is set.

Details

panel_locf() is unusual among last-observation-carried-forward functions (like zoo::na.locf()) in that it is usable even if observations are not uniquely identified by .t (and .i, if defined).

Examples



# The SPrail data has some missing price values.
# Let's fill them in!
# Note .d=0 tells it to ignore how big the gaps are
# between one period and the next, just look for the most recent insert_date
# .resolve tells it what value to pick if there are multiple
# observed prices for that route/insert_date
# (.resolve is not necessary if .i and .t uniquely identify obs,
# or if .var is either NA or constant within them)
# Also note - this will fill in using CURRENT-period
# data first (if available) before looking for lagged data.
data(SPrail)
sum(is.na(SPrail$price))
#> [1] 249
SPrail <- SPrail %>%
  dplyr::mutate(price = panel_locf(price,
    .i = c(origin, destination), .t = insert_date, .d = 0,
    .resolve = function(x) mean(x, na.rm = TRUE)
  ))

# The spec is a little easier with data like Scorecard where
# .i and .t uniquely identify observations
# so .resolve isn't needed.
data(Scorecard)
sum(is.na(Scorecard$earnings_med))
#> [1] 15706
Scorecard <- Scorecard %>%
  # Let's speed this up by just doing four-year colleges in Colorado
  dplyr::filter(
    pred_degree_awarded_ipeds == 3,
    state_abbr == "CO"
  ) %>%
  # Now let's fill in NAs and also in case there are any erroneous 0s
  dplyr::mutate(earnings_med = panel_locf(earnings_med,
    .fill = c(NA, 0),
    .i = unitid, .t = year
  ))
# Note that there are still some missings - these are missings that come before the first
# non-missing value in that unitid, so there's nothing to pull from.
sum(is.na(Scorecard$earnings_med))
#> [1] 17