This function creates new observations to fill in any gaps in panel data. For example, if individual 1 has an observation in periods t = 1 and t = 3 but no others, this function will create an observation for t = 2. By default, the t = 2 observation will be identical to the t = 1 observation except for the time variable, but this can be adjusted. This function returns data sorted by .i
and .t
.
panel_fill( .df, .set_NA = FALSE, .min = NA, .max = NA, .backwards = FALSE, .group_i = TRUE, .flag = NA, .i = NULL, .t = NULL, .d = 1, .uniqcheck = FALSE, .setpanel = TRUE )
.df | Tibble or data frame which either has the |
---|---|
.set_NA | Should values in newly-created observations be set to adjacent values or to NA? Set to |
.min | Sets the first time period in the data for each individual to be |
.max | Sets the last time period in the data for each individual to be |
.backwards | By default, values of newly-created observations are copied from the most recently available period. Set |
.group_i | By default, |
.flag | The name of a new variable indicating which observations are newly created by |
.i | Quoted or unquoted variables that identify the individual cases. Note that setting any one of |
.t | Quoted or unquoted variable indicating the time. |
.d | Number indicating the gap in |
.uniqcheck | Logical parameter. Set to |
.setpanel | Logical parameter. |
Note that, in the case where there is more than one observation for a given individual/time period (or just time period if .group_i = FALSE
), panel_fill()
will create copies of *every observation* in the appropriate individual/time period for filling-in purposes. So if there are four t = 1 observations and nothing in t = 2, panel_fill()
will create four new observations with t = 2, copying the original four in t = 1.
By default, the panel_fill()
operation is grouped by .i
, although it will return the data in the original grouping structure. Leave .i
blank, or, if .i
is already in the data from as_pibble
, set .group_i=FALSE
to run the function ungrouped, or with the existing group structure.
This function requires .t
and .d
to be declared in the function or already established in the data by as_pibble()
. Also, this requires a cardinal .t
. It must not be the case that .d=0
.
# Examples are too slow to run - this function is slow! if (interactive()) { data(Scorecard) # Notice that, in the Scorecard data, the gap between one year and the next is not always constant table((Scorecard %>% dplyr::arrange(year) %>% dplyr::group_by(unitid) %>% dplyr::mutate(diff = year - dplyr::lag(year)))$diff) # And also that not all universities show up for the first or last times in the same year year_range <- Scorecard %>% dplyr::group_by(unitid) %>% dplyr::summarize(first_year = min(year), last_year = max(year)) table(year_range$first_year) table(year_range$last_year) rm(year_range) # We can deal with the inconsistent-gaps problem by creating new obs to fill in # this version will fill in the new obs with the most recently observed data, and flag them Scorecard_filled <- panel_fill(Scorecard, .i = unitid, .t = year, .flag = "new" ) # Or maybe we want those observations in there but don't want to treat them as real data # so instead of filling them in, just leave all the data in the new obs blank # (note this sets EVERYTHING not in .i or .t to NA - if you only want some variables NA, # make .set_NA a character vector of those variable names) Scorecard_filled <- panel_fill(Scorecard, .i = unitid, .t = year, .flag = "new", .set_NA = TRUE ) # Perhaps we want a perfectly balanced panel. So let's set .max and .min to the start and end # of the data, and it will fill in everything. Scorecard_filled <- panel_fill(Scorecard, .i = unitid, .t = year, .flag = "new", .min = min(Scorecard$year), .max = max(Scorecard$year) ) # how many obs of each college? Should be identical, and equal to the number of years there are table(table(Scorecard_filled$unitid)) length(unique(Scorecard_filled$year)) }