R/panel_consistency.R
panel_locf.Rd
This function looks for a list of values (usually, just NA
) in a variable .var
and overwrites those values with the most recent (or next-coming) values that are not from that list ("last observation carried forward").
panel_locf( .var, .df = get(".", envir = parent.frame()), .fill = NA, .backwards = FALSE, .resolve = "error", .group_i = TRUE, .i = NULL, .t = NULL, .d = 1, .uniqcheck = FALSE )
.var | Vector to be modified. |
---|---|
.df | Data frame, pibble, or tibble (usually the one containing |
.fill | Vector of values to be overwritten. Just |
.backwards | By default, values of newly-created observations are copied from the most recently available period. Set |
.resolve | If there is more than one observation per individal/period, and the value of |
.group_i | By default, if |
.i | Quoted or unquoted variables that identify the individual cases. Note that setting any one of |
.t | Quoted or unquoted variable indicating the time. |
.d | Number indicating the gap in |
.uniqcheck | Logical parameter. Set to TRUE to always check whether |
panel_locf()
is unusual among last-observation-carried-forward functions (like zoo::na.locf()
) in that it is usable even if observations are not uniquely identified by .t
(and .i
, if defined).
# The SPrail data has some missing price values. # Let's fill them in! # Note .d=0 tells it to ignore how big the gaps are # between one period and the next, just look for the most recent insert_date # .resolve tells it what value to pick if there are multiple # observed prices for that route/insert_date # (.resolve is not necessary if .i and .t uniquely identify obs, # or if .var is either NA or constant within them) # Also note - this will fill in using CURRENT-period # data first (if available) before looking for lagged data. data(SPrail) sum(is.na(SPrail$price))#> [1] 249SPrail <- SPrail %>% dplyr::mutate(price = panel_locf(price, .i = c(origin, destination), .t = insert_date, .d = 0, .resolve = function(x) mean(x, na.rm = TRUE) )) # The spec is a little easier with data like Scorecard where # .i and .t uniquely identify observations # so .resolve isn't needed. data(Scorecard) sum(is.na(Scorecard$earnings_med))#> [1] 15706Scorecard <- Scorecard %>% # Let's speed this up by just doing four-year colleges in Colorado dplyr::filter( pred_degree_awarded_ipeds == 3, state_abbr == "CO" ) %>% # Now let's fill in NAs and also in case there are any erroneous 0s dplyr::mutate(earnings_med = panel_locf(earnings_med, .fill = c(NA, 0), .i = unitid, .t = year )) # Note that there are still some missings - these are missings that come before the first # non-missing value in that unitid, so there's nothing to pull from. sum(is.na(Scorecard$earnings_med))#> [1] 17