vignettes/panel_tools.Rmd
panel_tools.Rmd
library(pmdplyr)
The pmdplyr
package is an extension to dplyr
designed for cleaning and managing panel and hierarchical data. It contains variations on the dplyr::mutate()
and dplyr::join()
functions that address common panel data needs, and contains functions for managing and cleaning panel data.
Unlike other panel data packages, functions in pmdplyr
are all designed to work even if there is more than one observation per individual per period. This comes in handy if each individual is observed multiple times per period - for example, multiple classes per student per term; or if you have hierarchical data - for example, multiple companies per country.
There are three vignettes in total describing the contents of pmdplyr
:
“pmdplyr”/“Get Started”, which describes the pibble
panel data object type, and the pmdplyr
tools for creating well-behaved ID and time variables id_variable()
and time_variable()
.
“dplyr variants”, which that describes pmdplyr
variations on dplyr
functions mutate()
(mutate_cascade()
and mutate_subset()
), _join
(inexact_join
and safe_join()
), and lag
(tlag()
).
CURRENT VIGNETTE “Panel Tools”, which describes novel tools that pmdplyr
provides for cleaning and manipulating panel data (panel_fill()
, panel_locf()
, fixed_check()
, fixed_force()
, between_i()
, within_i()
, mode_order()
).
In a panel data context, missing data can be categorized into two kinds: explicit missing values for certain observations (NA
s), or observations that are missing entirely - for example if person 1
has observations in period 1
and period 3
, but not period 2
.
You may wish to fill in either of these kinds of missing data using the data you do have.
panel_fill()
will fill in gaps between time periods for individuals. For example, if person 1
has observations in period 1
and period 3
, but not period 2
, then panel_fill()
will add an observation to the data for person 1
in time period 2
. If there is more than one observation for person 1
in period 1
, then all of them will be copied for period 2
.
panel_fill(.df, .set_NA = FALSE, .min = NA, .max = NA, .backwards = FALSE, .group_i = TRUE, .flag = NA, .i = NULL, .t = NULL, .d = 1, .uniqcheck = FALSE, .setpanel = TRUE )
panel_fill()
will give us some newly-created observations, and we need to decide what to fill them in with. By default, it will fill in values using what we see in the most recent non-missing observation. But we can set .backwards = TRUE
to use the next non-missing observation instead, or use .set_NA
to fill the new observations with NA
.
.set_NA
is a character vector of variable names that should be set to NA
for newly-created observations, or set to TRUE
to set everything except .i
and .t
to NA
. You can also create a new variable indicating which observations are newly-created with .flag
.
# Note the gap between periods 2 and 4 for person 1. df <- pibble( i = c(1, 1, 1, 2, 2, 2), t = c(2, 4, 5, 1, 2, 3), x = 1:6, y = 7:12, .i = i, .t = t ) panel_fill(df, .set_NA = "y", .flag = "new_obs") #> # A pibble: 7 x 5 #> # Individual-level identifier (.i): i [2] #> # Time variable (.t): t [5] #> # Gap (.d): 1 #> i t x y new_obs #> <dbl> <dbl> <int> <int> <lgl> #> 1 1 2 1 7 FALSE #> 2 1 3 1 NA TRUE #> 3 1 4 2 8 FALSE #> 4 1 5 3 9 FALSE #> 5 2 1 4 10 FALSE #> 6 2 2 5 11 FALSE #> 7 2 3 6 12 FALSE panel_fill(df, .set_NA = "y", .backwards = TRUE)$x #> [1] 1 2 2 3 4 5 6
By default, panel_fill()
will only fill in gaps between existing observations. However, commonly we might want to create new observations outside of the existing range, perhaps to create a fully balanced panel for ourselves. .min
and .max
will ensure that each individual has observations at least as far back as .min
, and at least as far out as .max
. Set .min = min(t)
and .max = max(t)
(where t
is your time variable) to ensure a fully balanced panel.
Data for the outside-the-observed-range values will be taken from the closest observed value.
panel_fill(df, .min = min(df$t), .max = max(df$t)) #> # A pibble: 10 x 4 #> # Individual-level identifier (.i): i [2] #> # Time variable (.t): t [5] #> # Gap (.d): 1 #> i t x y #> <dbl> <dbl> <int> <int> #> 1 1 1 1 7 #> 2 1 2 1 7 #> 3 1 3 1 7 #> 4 1 4 2 8 #> 5 1 5 3 9 #> 6 2 1 4 10 #> 7 2 2 5 11 #> 8 2 3 6 12 #> 9 2 4 6 12 #> 10 2 5 6 12
The rest of the options include .group_i
(by default, if .i
can be found, data will be filled within-individual. Set .group_i = FALSE
to ignore this), and standard arguments related to declaring the panel structure of the data (.i
, .t
, .d
, .uniqcheck
, see the “pibble” section above). .setpanel
ensures that if you declare the panel structure in the panel_fill()
function, it will be maintained in the object you get back.
panel_locf()
(“last observation carried forward”) will fill in explicit NA
values using recently available data. It is very similar to zoo::na.locf()
except that it respects panel structure and is more flexible.
panel_locf(.var, .df = get(".", envir = parent.frame()), .fill = NA, .backwards = FALSE, .resolve = "error", .group_i = TRUE, .i = NULL, .t = NULL, .d = 1, .uniqcheck = FALSE )
where .var
is the variable to be filled in, and .df
is the data set that variable lives in. If the data set is being passed in via %>% mutate()
or similar, then .df
will automatically pick it up and you don’t need to specify it.
df <- pibble( i = c(1, 1, 1, 2, 2, 2), t = c(1, 2, 3, 2, 3, 4), x = c(1, NA, 3, NA, -3, 4), .i = i, .t = t ) # Notice that the fourth observation doesn't get filled in # because it's the first observation for person 2, so nothing to fill in from df %>% mutate(x_filled = panel_locf(x)) #> # A pibble: 6 x 4 #> # Individual-level identifier (.i): i [2] #> # Time variable (.t): t [4] #> # Gap (.d): 1 #> i t x x_filled #> <dbl> <dbl> <dbl> <dbl> #> 1 1 1 1 1 #> 2 1 2 NA 1 #> 3 1 3 3 3 #> 4 2 2 NA NA #> 5 2 3 -3 -3 #> 6 2 4 4 4
You have a fair amount of control over how filling-in works. By default, data will be filled in using the most recent previous observation. But .backwards = TRUE
will use the next upcoming observation instead. Also, by default, only NA
values will be overwritten. But .fill
will allow you to specify a vector of values (perhaps including NA
) to be overwritten. This can be handy if you’re working with data that uses missingness indicators other than NA
.
df %>% mutate( x_filled = panel_locf(x, .backwards = TRUE), x_no_neg3 = panel_locf(x, .backwards = TRUE, .fill = c(NA, -3)) ) #> # A pibble: 6 x 5 #> # Individual-level identifier (.i): i [2] #> # Time variable (.t): t [4] #> # Gap (.d): 1 #> i t x x_filled x_no_neg3 #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 1 1 1 1 #> 2 1 2 NA 3 3 #> 3 1 3 3 3 3 #> 4 2 2 NA -3 4 #> 5 2 3 -3 -3 4 #> 6 2 4 4 4 4
panel_locf()
will work even if .i
and .t
don’t uniquely identify the observations. However, this presents a problem! If there are different values of .var
for a given combination of .i
and .t
, then which value do we choose to use for the purpose of filling in other observations? .resolve
makes this choice. By default, there will be an “error” if values of .var
are inconsistent within .i
and .t
. Or, set .resolve
to a summary function like .resolve = mean
or .resolve = function(x) mean(x, na.rm = TRUE)
to resolve inconsistencies before filling in. If you have some .i
/.t
combinations with both missing and non-missing values, the missing values will be filled in using the same function.
inconsistent_df <- pibble( i = c(1, 1, 1, 2, 2, 2), t = c(1, 1, 2, 1, 2, 3), x = c(1, 2, NA, 1, 2, 3), .i = i, .t = t ) inconsistent_df %>% mutate( x_filled = panel_locf(x, .resolve = mean) ) #> # A pibble: 6 x 4 #> # Individual-level identifier (.i): i [2] #> # Time variable (.t): t [3] #> # Gap (.d): 1 #> i t x x_filled #> <dbl> <dbl> <dbl> <dbl> #> 1 1 1 1 1 #> 2 1 1 2 2 #> 3 1 2 NA 1.5 #> 4 2 1 1 1 #> 5 2 2 2 2 #> 6 2 3 3 3
The rest of the options include .group_i
(by default, if .i
can be found, data will be filled within-individual. Set .group_i = FALSE
to ignore this), and standard arguments related to declaring the panel structure of the data (.i
, .t
, .d
, .uniqcheck
, see the “pibble” section above).
In panel data, and especially hierarchical data, there are some variables that should be fixed within values of other variables. And if they’re not, you have a problem!
For example, consider the data set
df <- data.frame( continent = c("Asia", "Europe", "Europe", "S America", "S America"), country = c("France", "France", "France", "Brazil", "Brazil"), year = c(2000, 2001, 2002, 2000, 2001) ) df #> continent country year #> 1 Asia France 2000 #> 2 Europe France 2001 #> 3 Europe France 2002 #> 4 S America Brazil 2000 #> 5 S America Brazil 2001
The variable continent
should never change within values of country
- a country can’t change the continent it’s on! The fact that France changes continents from year to year in this data should be regarded as very fishy. It will be handy to spot these sorts of potential errors in your data set, and fix them if you think you know how.
fixed_check()
will look in your data .df
for inconsistencies in the value of some variables .var
within values of other variables .within
.
fixed_check(.df, .var = NULL, .within = NULL )
You should pick variables for .var
that are supposed to be constant within combinations of .within
.
If your data has problems and is inconsistent, fixed_check()
will retun a list of data sets, one for each .var
variable, containing the subset of the data that gives you problems. For our df
with the France problem, that’s all of the France observations!
fixed_check(df, .var = continent, .within = country)$continent #> # A tibble: 3 x 3 #> continent country year #> <chr> <chr> <dbl> #> 1 Asia France 2000 #> 2 Europe France 2001 #> 3 Europe France 2002
If your data is fine, and all .var
variables are indeed constant within combinations of .within
, then fixed_check()
will return TRUE
.
consistent_df <- data.frame( state = c(1, 1, 1, 2, 2, 2), year = c(2000, 2001, 2001, 2000, 2000, 2001), treatment = c(F, T, T, T, T, F), outcome = c(4.4, 3.2, 3.4, 5.5, 5.6, 8) ) # Since this policy treatment is administered on the state level, # everyone in the same state/year should get the same treatment. # And they do! fixed_check(consistent_df, .var = treatment, .within = c(state, year)) #> [1] TRUE
Some handy fixed_check()
tips:
fixed_check()
returns either a logical
or a list
depending on the outcome. If you want to just get FALSE
instead of a list of data sets, do is.logical(fixed_check())
instead of fixed_check()
.fixed_force()
(see below) to change the observations to be consistent, or to drop all inconsistent observations with fixed_force(.resolve = "drop")
.fixed_force()
will take a data set .df
, find any inconsistencies in the variables .var
within combinations of the variables .within
, and will “fix” those inconsistencies, using the function .resolve
to select the correct values. It can flag any changed values with a new variable named .flag
.
fixed_force(.df, .var = NULL, .within = NULL, .resolve = mode_order, .flag = NA )
The default resolution function is mode_order()
(see the Additional Calculations section), which calculates the mode, selecting the first-ordered value in the data if there are ties. The mode seems most relevant here, since the most likely (and responsible) use for fixed_force()
is when you have data that is mostly correct but just has a few odd values that are likely just miscodes. mode_order()
also is not just limited to numeric variables.
Continuing with our France-in-Asia data set,
fixed_force(df, .var = continent, .within = country, .flag = "altered") #> # A tibble: 5 x 4 #> continent country year altered #> <chr> <chr> <dbl> <lgl> #> 1 Europe France 2000 TRUE #> 2 Europe France 2001 FALSE #> 3 Europe France 2002 FALSE #> 4 S America Brazil 2000 FALSE #> 5 S America Brazil 2001 FALSE
Another option for .resolve
is to set .resolve = "drop"
(or any other character, really), and it will drop the inconsistent observations.
fixed_force(df, .var = continent, .within = country, .resolve = "drop") #> # A tibble: 2 x 3 #> continent country year #> <chr> <chr> <dbl> #> 1 S America Brazil 2000 #> 2 S America Brazil 2001
pmdplyr
contains several additional functions that produce calculations of interest.
between_i()
performs the between transformation. In particular, it isolates the variation between .i
groups in a variable .var
, throwing out all variation within .i
groups. The result is identical within combinations of .i
.
The specific calculation that is performed is
\[between.i(x) = \bar{x}_i - \bar{x}\]
where \(\bar{x}_i\) is the mean of x
within the .i
groups, and \(\bar{x}\) is the grand mean of x
over all observations.
Be aware that this is different from plm::between()
, which returns \(\bar{x}_i\) and does not subtract out \(\bar{x}\).
The syntax for between_i()
is:
between_i(.var, .df = get(".", envir = parent.frame()), .fcn = function(x) mean(x, na.rm = TRUE), .i = NULL, .t = NULL, uniqcheck = FALSE )
Where .var
is the variable on which the transformation is performed, and .df
is the data set. If the data set is being passed in via %>% mutate()
or similar, then .df
will automatically pick it up and you don’t need to specify it. .fcn
is the function applied to calculate the group and grand values, i.e. \(.fcn(x) = \bar{x}\). The standard definition of the between transformation is for this to be the mean, but it has been left flexible.
The rest of the options include standard arguments related to declaring the panel structure of the data (.i
, .t
, .uniqcheck
, see the “pibble” section above). .d
is omitted because it is irrelevant to the calculation.
An example of the between transformation follows:
df <- pibble( i = c(1, 1, 2, 2), x = 1:4, .i = i ) %>% mutate(between_x = between_i(x)) # Notice that the grand mean is... mean(df$x) #> [1] 2.5 # And the mean within groups is... df %>% group_by(i) %>% summarize(x = mean(x)) #> # A pibble: 2 x 2 #> # Individual-level identifier (.i): i [2] #> # Gap (.d): 1 #> i x #> <dbl> <dbl> #> 1 1 1.5 #> 2 2 3.5 # So the between calculation should be # 1.5 - 2.5 = -1 and 3.5 - 2.5 = 1 for the different groups: df$between_x #> [1] -1 -1 1 1
within_i()
performs the within transformation. In particular, it isolates the variation within .i
groups in a variable .var
, throwing out all variation between .i
groups. The result averages out to 0
within combinations of .i
.
The specific calculation that is performed is
\[within.i(x) = x_i - \bar{x}_i\]
where \(\bar{x}_i\) is the mean of x
within the .i
groups.
The syntax for within_i
is:
within_i(.var, .df = get(".", envir = parent.frame()), .fcn = function(x) mean(x, na.rm = TRUE), .i = NULL, .t = NULL, uniqcheck = FALSE )
Where .var
is the variable on which the transformation is performed, and .df
is the data set. If the data set is being passed in via %>% mutate()
or similar, then .df
will automatically pick it up and you don’t need to specify it. .fcn
is the function applied to calculate the group values, i.e. \(.fcn(x) = \bar{x}\). The standard definition of the within transformation is for this to be the mean, but it has been left flexible.
The rest of the options include standard arguments related to declaring the panel structure of the data (.i
, .t
, .uniqcheck
, see the “pibble” section above). .d
is omitted because it is irrelevant to the calculation.
An example of the between transformation follows:
df <- pibble( i = c(1, 1, 2, 2), x = 1:4, .i = i ) %>% mutate(within_x = within_i(x)) # Notice that the mean within groups is... df %>% group_by(i) %>% summarize(x = mean(x)) #> # A pibble: 2 x 2 #> # Individual-level identifier (.i): i [2] #> # Gap (.d): 1 #> i x #> <dbl> <dbl> #> 1 1 1.5 #> 2 2 3.5 # So the between calculation should be # 1 - 1.5 = -.5 and 2 - 1.5 = .5 for individual 1 # and 3 - 3.5 = -.5 and 4 - 3.5 = .5 individual 2: df$within_x #> [1] -0.5 0.5 -0.5 0.5
R does not have a base function for calculating the mode of a vector. But fixed_force()
wanted one, and so here we are. This function has been exported for general use in case it comes in handy elsewhere.
In particular, mode_order()
calculates the mode of a vector and, if there are ties between two different values, selects the one that comes earlier in the original vector order.
# 2 appears twice while everything else appears once; 2 is the mode. x <- c(1, 2, 2, NA, 5, 3, 4) mode_order(x) #> [1] 2 # 1 or 2 could be the mode. # Ties are broken by order in the vector. x <- c(2, 2, 1, 1) mode_order(x) #> [1] 2