library(pmdplyr)

The pmdplyr package is an extension to dplyr designed for cleaning and managing panel and hierarchical data. It contains variations on the dplyr::mutate() and dplyr::join() functions that address common panel data needs, and contains functions for managing and cleaning panel data.

Unlike other panel data packages, functions in pmdplyr are all designed to work even if there is more than one observation per individual per period. This comes in handy if each individual is observed multiple times per period - for example, multiple classes per student per term; or if you have hierarchical data - for example, multiple companies per country.

There are three vignettes in total describing the contents of pmdplyr:

  1. CURRENT VIGNETTE “pmdplyr”/“Get Started”, which describes the pibble panel data object type, and the pmdplyr tools for creating well-behaved ID and time variables id_variable() and time_variable().

  2. “dplyr variants”, which that describes pmdplyr variations on dplyr functions mutate() (mutate_cascade() and mutate_subset()), _join (inexact_join and safe_join()), and lag (tlag()).

  3. “Panel Tools”, which describes novel tools that pmdplyr provides for cleaning and manipulating panel data (panel_fill(), panel_locf(), fixed_check(), fixed_force(), between_i(), within_i(), mode_order()).


pibble

The pibble data type is a tibble data frame with additional attributes .i, which is a set of variables in the pibble that identifies individuals, .t, which is a variable in the pibble that indentifies the time index, and .d, which identifies the gap between periods of .t. If the gap between time periods doesn’t matter and any consecutive time periods should be treated as consecutive, set .d = 0.

pibble status will be maintained if the pibble is modified using functions from pmdplyr or dplyr (unless you use those functions in a way that drops or renames the .i or .t variables, in which case pibble status will be lost). Other data manipulation functions may remove pibble status, requiring it to be re-declared as a pibble.

Most functions in pmdplyr will allow you to declare .i and .t in the function itself. But if a pibble is passed in via a dplyr verb, there is no need.

pibble() and as_pibble()

pibbles can be declared in two main ways: raw, via pibble():

pibble(...,
  .i = NULL,
  .t = NULL,
  .d = 1,
  .uniqcheck = FALSE
)

or by transforming an existing data.frame, list, or tbl_df using as_pibble():

as_pibble(x,
  .i = NULL,
  .t = NULL,
  .d = 1,
  .uniqcheck = FALSE,
  ...
)

Both functions work exactly as tibble::tibble() and tibble::as_tibble() do, except that they also take the arguments .i, .t, and .d, with .i and .t accepting either unquoted or quoted variable names. If you’d like your pibble checked to see if .i and .t uniquely identify your observations, set .uniqcheck = TRUE. It will do this automatically the first time in each R session you create a pibble, but if you’d like it to keep doing it, use uniqcheck.

As a side bonus, you can check if the variables a, b, c uniquely identify the observations in data set d by running as_pibble(d, .i = c(a, b, c), .uniqcheck = TRUE). No warning? It’s uniquely identified!

# .d = 1 by default, so in this data,
# a = 1, b = 3 comes one period after a = 1, b = 2.
basic_pibble <- pibble(
  a = c(1, 1, 1, 2, 2, 2),
  b = c(1, 2, 3, 2, 3, 3),
  c = 1:6,
  .i = a,
  .t = b
)

data(SPrail)
# In SPrail, insert_date does not imply regular gaps between
# time periods, so we set .d = 0
declared_pibble <- as_pibble(SPrail,
  .i = c(origin, destination),
  .t = insert_date,
  .d = 0
)

panel_convert()

pmdplyr also has the function panel_convert() which allows you to convert between different popular R panel data objects, including pibble. This can come in handy for creating pibbles, or exporting your cleaned pibble to use with a package that does panel data analysis (which pmdplyr does not):

panel_convert(
  data,
  to,
  ...
)

Where data is a panel data object, either pibble, tsibble, pdata.frame, or panel_data, and to is the type of object you’d like returned, which you can refer to by object name, object class, or package name: get a pibble with "pmdplyr", "pibble", or "tbl_pb", a tsibble with "tsibble" or "tbl_ts", a pdata.frame with "plm" or "pdata.frame", or a panel_data with "panelr" or "panel_data". ... sends additional arguments to the functions used to declare those objects.

When using panel_convert, be aware that any grouping will be lost, and you must have the relevant package of your to option installed (tsibble, plm, or panelr). When your data object is a pdata.frame, it is recommended to also have sjlabelled installed.

All valid objects of the non-pibble types can be converted to pibbles, but the reverse is not true, since pibble does not enforce some strict requirements that other types do:

Feature/Requirement pibble tsibble pdata.frame panel_data
ID .i key index[1] id
Time .t index index[2] wave
Gap control .d regular No No
ID must exist No No Yes Yes
Time must exist No Yes Yes Yes[1]
Only one ID variable[2] No No Yes Yes
Unique identification No Yes No[3] No[3]

[1] pdata.frame does not require that time be provided, but if not provided will create it based on original ordering of the data. The pdata.frame option to set index equal to an integer for a balanced panel and have it figure out the rest by itself is not supported.

[2] Use id_variable() (described below) to generate a single ID variable from multiple if one is required.

[3] pdata.frame and panel_data do not require that ID and time uniquely identify the observations on declaring the data, but functions in these packages may not work correctly without unique identification.

In addition to the above, be aware that the different packages have different requirements on which variable classes can be Time variables. time_variable() (described below) can build an integer variable that will work in all packages.


ID and Time Variables

Any panel data dataset is defined by ID variable(s) (.i) that indicate the individual person/firm/country/etc. you’re talking about, and a time variable (.t) that tell you when the observation in question was recorded.

While pmdplyr allows multiple ID variables, many panel data packages only allow one. id_variable() allows you to turn multiple ID variables into a single variable.

Many panel data packages, including pmdplyr, prefer a time variable that is an integer, so that the difference between each period is 1 (or .d in pmdplyr’s case). The function time_variable() can help you generate an integer time variable from a Date-class variable, or from one or more variables containing, for example, year and month.

id_variable()

id_variable() syntax follows:

id_variable(...,
  .method = "number",
  .minwidth = FALSE
)

where ... is the set of identity variables that you want to combine into a single one (or, potentially, a single variable you’d like to encode numerically).

.method describes the way in which you’d like the variable encoded:

  • .method = number assigns consecutive numeric codes in the original order they appear in the data.
  • .method = random assigns numeric codes from 0 to 10*N (where N is the number of codes to be assigned) in a random order, so it will be more difficult to uncover the original identity variables.
  • .method = character preserves all original information and combines the variables together into a string, adding spacing to ensure uniqueness. Set .minwidth = TRUE to remove the spacing, although this may lead to non-uniqueness in some cases.
df <- data.frame(
  country = c(
    "US", "US", "US", "US",
    "ENG", "ENG", "ENG", "ENG"
  ),
  city = c(
    "NYC", "NYC", "Cambridge", "NYC",
    "Cambridge", "London", "Manchester", "Manchester"
  )
) %>%
  mutate(
    numeric_ID = id_variable(country, city),
    random_ID = id_variable(country, city, .method = "random"),
    char_ID = id_variable(country, city, .method = "character")
  )

df
#>   country       city numeric_ID random_ID           char_ID
#> 1      US        NYC          1         1 |US|.|NYC|.......
#> 2      US        NYC          1         1 |US|.|NYC|.......
#> 3      US  Cambridge          2        27 |US|.|Cambridge|.
#> 4      US        NYC          1         1 |US|.|NYC|.......
#> 5     ENG  Cambridge          3        42 |ENG||Cambridge|.
#> 6     ENG     London          4        32 |ENG||London|....
#> 7     ENG Manchester          5        30 |ENG||Manchester|
#> 8     ENG Manchester          5        30 |ENG||Manchester|

time_variable()

time_variable() syntax follows:

time_variable(...,
  .method = "present",
  .datepos = NA,
  .start = 1,
  .skip = NA,
  .breaks = NA,
  .turnover = NA,
  .turnover_start = NA
)

Where ... is the set of variables that you want to combine into a single, integer-class time variable. The rest of the options determine how the variable(s) will be read or transformed; the need for each varies depending on the structure of the original data and which .method is used.

.method can take the values:

  • .method = "present" will assume that, even if each individual may have some missing periods, each period is present in your data somewhere, and so simply numbers, in order, all the time periods observed in the data.
  • .method = "year" can be used with a single Date/POSIX/etc.-type variable (anything that allows lubridate::date()) and will extract the year from it. Or, use it with a character or numeric variable and indicate with .datepos the character/digit positions that hold the year in YY or YYYY format. If combined with .breaks or .skip, will instead set the earliest year in the data to 1 rather than returning the actual year.
  • .method = "month" can be used with a single Date/POSIX/etc.-type variable (anything that allows lubridate::date()). It will give the earliest-observed month in the data set a value of 1, and will increment from there. Or, use it with a character or numeric variable and indicate with .datepos the character/digit positions that hold the year and month in YYMM or YYYYMM format (note that if your variable is in MMYYYY format, for example, you can just give a .datepos argument like c(3:6,1:2)). Months turn over on the .start day of the month, which is by default 1.
  • .method = "week" can be used with a single Date/POSIX/etc.-type variable (anything that allows lubridate::date()). It will give the earliest-observed week in the data set a value of 1, and will increment from there. Weeks turn over on the .start day, which is by default 1 (Monday). Note that this method always starts weeks on the same day of the week, which is different from standard lubridate procedure of counting sets of 7 days starting from January 1.
  • .method = "day" can be used with a single Date/POSIX/etc.-type variable (anything that allows lubridate::date()). It will give the earliest-observed day in the data set a value of 1, and increment from there. Or, use it with a character or numeric variable and indicate with .datepos the character/digit positions that hold the year and month in YYMMDD or YYYYMMDD format. To skip certain days of the week, such as weekends, use the .skip option.
  • .method = "turnover" can be used when you have more than one variable in variable and they are all numeric nonnegative integers. Set the .turnover option to indicate the highest value each variable takes before it starts over, and set .turnover_start to indicate what value it takes when it starts over. Cannot be combined with .skip or .breaks. Doesn’t work with any variable for which the turnover values change, i.e. it doesn’t play well with days-in-month - if you’d like to do something like year-month-day-hour, I recommend running .method="day" once with just the year-month-day variable, and then taking the result and combining that with hour in .method = "turnover".
data(SPrail)

# Since we have a date variable, we can easily create integers that increment for each
# year, or for each month, etc.
# Likely we'd only really need one of these four, depending on our purposes
SPrail <- SPrail %>%
  dplyr::mutate(
    year_time_id = time_variable(insert_date, .method = "year"),
    month_time_id = time_variable(insert_date, .method = "month"),
    week_time_id = time_variable(insert_date, .method = "week"),
    day_time_id = time_variable(insert_date, .method = "day")
  )

# Let's see what we've got
SPrail %>%
  select(insert_date, ends_with("time_id")) %>%
  head()
#> # A tibble: 6 x 5
#>   insert_date         year_time_id month_time_id week_time_id day_time_id
#>   <dttm>                     <int>         <int>        <int>       <int>
#> 1 2019-04-12 20:17:04         2019             1            1           2
#> 2 2019-04-16 09:33:08         2019             1            2           6
#> 3 2019-05-08 09:04:07         2019             2            5          28
#> 4 2019-04-16 06:21:42         2019             1            2           6
#> 5 2019-05-02 07:03:34         2019             2            4          22
#> 6 2019-04-13 06:03:43         2019             1            1           3

# Perhaps I'd like quarterly data
# (although in this case there are only two months, not much variation there)
SPrail <- SPrail %>%
  dplyr::mutate(quarter_time_id = time_variable(insert_date,
    .method = "month",
    .breaks = c(1, 4, 7, 10)
  ))
# Should line up properly with month
SPrail %>%
  count(month_time_id, quarter_time_id)
#> # A tibble: 2 x 3
#>   month_time_id quarter_time_id     n
#>           <int>           <int> <int>
#> 1             1               1  1633
#> 2             2               1   367

# Maybe I'd like Monday to come immediately after Friday!
SPrail <- SPrail %>%
  dplyr::mutate(weekday_time_id = time_variable(insert_date,
    .method = "day",
    .skip = c(6, 7)
  ))

# Perhaps I'm interested in ANY time period in the data and just want to enumerate them in order
SPrail <- SPrail %>%
  dplyr::mutate(any_present_time_id = time_variable(insert_date,
    .method = "present"
  ))

# Note the weekday_time_id NAs - these are weekends! We told it to skip those.
head(SPrail %>% select(insert_date, day_time_id, weekday_time_id, any_present_time_id))
#> # A tibble: 6 x 4
#>   insert_date         day_time_id weekday_time_id any_present_time_id
#>   <dttm>                    <int>           <int>               <int>
#> 1 2019-04-12 20:17:04           2              NA                  96
#> 2 2019-04-16 09:33:08           6               4                 461
#> 3 2019-05-08 09:04:07          28              20                1899
#> 4 2019-04-16 06:21:42           6               4                 446
#> 5 2019-05-02 07:03:34          22              16                1670
#> 6 2019-04-13 06:03:43           3              NA                 130

# Maybe instead of being given a nice time variable, I was given it in string form
SPrail <- SPrail %>% dplyr::mutate(time_string = as.character(insert_date))
# As long as the character positions are consistent we can still use it
SPrail <- SPrail %>%
  dplyr::mutate(day_from_string_id = time_variable(time_string,
    .method = "day",
    .datepos = c(3, 4, 6, 7, 9, 10)
  ))
# Results are identical from using the actual Date variable
cor(SPrail$day_time_id, SPrail$day_from_string_id)
#> [1] 1