This function creates new observations to fill in any gaps in panel data. For example, if individual 1 has an observation in periods t = 1 and t = 3 but no others, this function will create an observation for t = 2. By default, the t = 2 observation will be identical to the t = 1 observation except for the time variable, but this can be adjusted. This function returns data sorted by .i and .t.

panel_fill(
  .df,
  .set_NA = FALSE,
  .min = NA,
  .max = NA,
  .backwards = FALSE,
  .group_i = TRUE,
  .flag = NA,
  .i = NULL,
  .t = NULL,
  .d = 1,
  .uniqcheck = FALSE,
  .setpanel = TRUE
)

Arguments

.df

Tibble or data frame which either has the .t and .d (and perhaps .i) attributes included by as_pibble(), or the appropriate panel structure is declared in the function.

.set_NA

Should values in newly-created observations be set to adjacent values or to NA? Set to TRUE to set all new values to NA except for .i and .t. To make only specific variables NA, list them as a character vector. Defaults to FALSE; all values are filled in using the most recently available data.

.min

Sets the first time period in the data for each individual to be .min, and fills in gaps between period .min and the actual start of the data. Copies data from the first period present in the data for each individual (if grouped). Handy for creating balanced panels.

.max

Sets the last time period in the data for each individual to be .max, and fills in gaps between period .max and the actual start of the data. Copies data from the flast period present in the data for each individual (if grouped). Handy for creating balanced panels.

.backwards

By default, values of newly-created observations are copied from the most recently available period. Set .backwards = TRUE to instead copy values from the closest *following* period.

.group_i

By default, panel_fill() will fill in gaps within values of .i. If .i is missing, it won't do that. If .i is in the data and you still don't want panel_fill() to run within .i, set .group_i = FALSE.

.flag

The name of a new variable indicating which observations are newly created by panel_fill().

.i

Quoted or unquoted variables that identify the individual cases. Note that setting any one of .i, .t, or .d will override all three already applied to the data, and will return data that is as_pibble()d with all three, unless .setpanel=FALSE.

.t

Quoted or unquoted variable indicating the time. pmdplyr accepts two kinds of time variables: numeric variables where a fixed distance .d will take you from one observation to the next, or, if .d=0, any standard variable type with an order. Consider using the time_variable() function to create the necessary variable if your data uses a Date variable for time.

.d

Number indicating the gap in .t between one period and the next. For example, if .t indicates a single day but data is collected once a week, you might set .d=7. To ignore gap length and assume that "one period ago" is always the most recent prior observation in the data, set .d=0. By default, .d=1.

.uniqcheck

Logical parameter. Set to TRUE to always check whether .i and .t uniquely identify observations in the data. By default this is set to FALSE and the check is only performed once per session, and only if at least one of .i, .t, or .d is set.

.setpanel

Logical parameter. TRUE by default, and so if .i, .t, and/or .d are declared, will return a pibble set in that way.

Details

Note that, in the case where there is more than one observation for a given individual/time period (or just time period if .group_i = FALSE), panel_fill() will create copies of *every observation* in the appropriate individual/time period for filling-in purposes. So if there are four t = 1 observations and nothing in t = 2, panel_fill() will create four new observations with t = 2, copying the original four in t = 1.

By default, the panel_fill() operation is grouped by .i, although it will return the data in the original grouping structure. Leave .i blank, or, if .i is already in the data from as_pibble, set .group_i=FALSE to run the function ungrouped, or with the existing group structure.

This function requires .t and .d to be declared in the function or already established in the data by as_pibble(). Also, this requires a cardinal .t. It must not be the case that .d=0.

Examples

# Examples are too slow to run - this function is slow! if (interactive()) { data(Scorecard) # Notice that, in the Scorecard data, the gap between one year and the next is not always constant table((Scorecard %>% dplyr::arrange(year) %>% dplyr::group_by(unitid) %>% dplyr::mutate(diff = year - dplyr::lag(year)))$diff) # And also that not all universities show up for the first or last times in the same year year_range <- Scorecard %>% dplyr::group_by(unitid) %>% dplyr::summarize(first_year = min(year), last_year = max(year)) table(year_range$first_year) table(year_range$last_year) rm(year_range) # We can deal with the inconsistent-gaps problem by creating new obs to fill in # this version will fill in the new obs with the most recently observed data, and flag them Scorecard_filled <- panel_fill(Scorecard, .i = unitid, .t = year, .flag = "new" ) # Or maybe we want those observations in there but don't want to treat them as real data # so instead of filling them in, just leave all the data in the new obs blank # (note this sets EVERYTHING not in .i or .t to NA - if you only want some variables NA, # make .set_NA a character vector of those variable names) Scorecard_filled <- panel_fill(Scorecard, .i = unitid, .t = year, .flag = "new", .set_NA = TRUE ) # Perhaps we want a perfectly balanced panel. So let's set .max and .min to the start and end # of the data, and it will fill in everything. Scorecard_filled <- panel_fill(Scorecard, .i = unitid, .t = year, .flag = "new", .min = min(Scorecard$year), .max = max(Scorecard$year) ) # how many obs of each college? Should be identical, and equal to the number of years there are table(table(Scorecard_filled$unitid)) length(unique(Scorecard_filled$year)) }