vignettes/pmdplyr.Rmd
pmdplyr.Rmd
library(pmdplyr)
The pmdplyr
package is an extension to dplyr
designed for cleaning and managing panel and hierarchical data. It contains variations on the dplyr::mutate()
and dplyr::join()
functions that address common panel data needs, and contains functions for managing and cleaning panel data.
Unlike other panel data packages, functions in pmdplyr
are all designed to work even if there is more than one observation per individual per period. This comes in handy if each individual is observed multiple times per period - for example, multiple classes per student per term; or if you have hierarchical data - for example, multiple companies per country.
There are three vignettes in total describing the contents of pmdplyr
:
CURRENT VIGNETTE “pmdplyr”/“Get Started”, which describes the pibble
panel data object type, and the pmdplyr
tools for creating well-behaved ID and time variables id_variable()
and time_variable()
.
“dplyr variants”, which that describes pmdplyr
variations on dplyr
functions mutate()
(mutate_cascade()
and mutate_subset()
), _join
(inexact_join
and safe_join()
), and lag
(tlag()
).
“Panel Tools”, which describes novel tools that pmdplyr
provides for cleaning and manipulating panel data (panel_fill()
, panel_locf()
, fixed_check()
, fixed_force()
, between_i()
, within_i()
, mode_order()
).
The pibble
data type is a tibble
data frame with additional attributes .i
, which is a set of variables in the pibble
that identifies individuals, .t
, which is a variable in the pibble
that indentifies the time index, and .d
, which identifies the gap between periods of .t
. If the gap between time periods doesn’t matter and any consecutive time periods should be treated as consecutive, set .d = 0
.
pibble
status will be maintained if the pibble
is modified using functions from pmdplyr
or dplyr
(unless you use those functions in a way that drops or renames the .i
or .t
variables, in which case pibble
status will be lost). Other data manipulation functions may remove pibble
status, requiring it to be re-declared as a pibble
.
Most functions in pmdplyr
will allow you to declare .i
and .t
in the function itself. But if a pibble
is passed in via a dplyr
verb, there is no need.
pibble
s can be declared in two main ways: raw, via pibble()
:
pibble(..., .i = NULL, .t = NULL, .d = 1, .uniqcheck = FALSE )
or by transforming an existing data.frame
, list
, or tbl_df
using as_pibble()
:
as_pibble(x, .i = NULL, .t = NULL, .d = 1, .uniqcheck = FALSE, ... )
Both functions work exactly as tibble::tibble()
and tibble::as_tibble()
do, except that they also take the arguments .i
, .t
, and .d
, with .i
and .t
accepting either unquoted or quoted variable names. If you’d like your pibble
checked to see if .i
and .t
uniquely identify your observations, set .uniqcheck = TRUE
. It will do this automatically the first time in each R session you create a pibble
, but if you’d like it to keep doing it, use uniqcheck
.
As a side bonus, you can check if the variables a, b, c
uniquely identify the observations in data set d
by running as_pibble(d, .i = c(a, b, c), .uniqcheck = TRUE)
. No warning? It’s uniquely identified!
# .d = 1 by default, so in this data, # a = 1, b = 3 comes one period after a = 1, b = 2. basic_pibble <- pibble( a = c(1, 1, 1, 2, 2, 2), b = c(1, 2, 3, 2, 3, 3), c = 1:6, .i = a, .t = b ) data(SPrail) # In SPrail, insert_date does not imply regular gaps between # time periods, so we set .d = 0 declared_pibble <- as_pibble(SPrail, .i = c(origin, destination), .t = insert_date, .d = 0 )
pmdplyr
also has the function panel_convert()
which allows you to convert between different popular R panel data objects, including pibble
. This can come in handy for creating pibbles
, or exporting your cleaned pibble
to use with a package that does panel data analysis (which pmdplyr
does not):
panel_convert( data, to, ... )
Where data
is a panel data object, either pibble
, tsibble
, pdata.frame
, or panel_data
, and to
is the type of object you’d like returned, which you can refer to by object name, object class, or package name: get a pibble
with "pmdplyr"
, "pibble"
, or "tbl_pb"
, a tsibble
with "tsibble"
or "tbl_ts"
, a pdata.frame
with "plm"
or "pdata.frame"
, or a panel_data
with "panelr"
or "panel_data"
. ...
sends additional arguments to the functions used to declare those objects.
When using panel_convert
, be aware that any grouping will be lost, and you must have the relevant package of your to
option installed (tsibble
, plm
, or panelr
). When your data
object is a pdata.frame
, it is recommended to also have sjlabelled
installed.
All valid objects of the non-pibble
types can be converted to pibbles
, but the reverse is not true, since pibble
does not enforce some strict requirements that other types do:
Feature/Requirement | pibble |
tsibble |
pdata.frame |
panel_data |
---|---|---|---|---|
ID | .i |
key |
index[1] |
id |
Time | .t |
index |
index[2] |
wave |
Gap control | .d |
regular |
No | No |
ID must exist | No | No | Yes | Yes |
Time must exist | No | Yes | Yes | Yes[1] |
Only one ID variable[2] | No | No | Yes | Yes |
Unique identification | No | Yes | No[3] | No[3] |
[1] pdata.frame
does not require that time be provided, but if not provided will create it based on original ordering of the data. The pdata.frame
option to set index
equal to an integer for a balanced panel and have it figure out the rest by itself is not supported.
[2] Use id_variable()
(described below) to generate a single ID variable from multiple if one is required.
[3] pdata.frame
and panel_data
do not require that ID and time uniquely identify the observations on declaring the data, but functions in these packages may not work correctly without unique identification.
In addition to the above, be aware that the different packages have different requirements on which variable classes can be Time variables. time_variable()
(described below) can build an integer variable that will work in all packages.
Any panel data dataset is defined by ID variable(s) (.i
) that indicate the individual person/firm/country/etc. you’re talking about, and a time variable (.t
) that tell you when the observation in question was recorded.
While pmdplyr
allows multiple ID variables, many panel data packages only allow one. id_variable()
allows you to turn multiple ID variables into a single variable.
Many panel data packages, including pmdplyr
, prefer a time variable that is an integer, so that the difference between each period is 1 (or .d
in pmdplyr
’s case). The function time_variable()
can help you generate an integer time variable from a Date
-class variable, or from one or more variables containing, for example, year and month.
id_variable()
syntax follows:
id_variable(..., .method = "number", .minwidth = FALSE )
where ...
is the set of identity variables that you want to combine into a single one (or, potentially, a single variable you’d like to encode numerically).
.method
describes the way in which you’d like the variable encoded:
.method = number
assigns consecutive numeric codes in the original order they appear in the data..method = random
assigns numeric codes from 0
to 10*N
(where N
is the number of codes to be assigned) in a random order, so it will be more difficult to uncover the original identity variables..method = character
preserves all original information and combines the variables together into a string, adding spacing to ensure uniqueness. Set .minwidth = TRUE
to remove the spacing, although this may lead to non-uniqueness in some cases.df <- data.frame( country = c( "US", "US", "US", "US", "ENG", "ENG", "ENG", "ENG" ), city = c( "NYC", "NYC", "Cambridge", "NYC", "Cambridge", "London", "Manchester", "Manchester" ) ) %>% mutate( numeric_ID = id_variable(country, city), random_ID = id_variable(country, city, .method = "random"), char_ID = id_variable(country, city, .method = "character") ) df #> country city numeric_ID random_ID char_ID #> 1 US NYC 1 1 |US|.|NYC|....... #> 2 US NYC 1 1 |US|.|NYC|....... #> 3 US Cambridge 2 27 |US|.|Cambridge|. #> 4 US NYC 1 1 |US|.|NYC|....... #> 5 ENG Cambridge 3 42 |ENG||Cambridge|. #> 6 ENG London 4 32 |ENG||London|.... #> 7 ENG Manchester 5 30 |ENG||Manchester| #> 8 ENG Manchester 5 30 |ENG||Manchester|
time_variable()
syntax follows:
time_variable(..., .method = "present", .datepos = NA, .start = 1, .skip = NA, .breaks = NA, .turnover = NA, .turnover_start = NA )
Where ...
is the set of variables that you want to combine into a single, integer
-class time variable. The rest of the options determine how the variable(s) will be read or transformed; the need for each varies depending on the structure of the original data and which .method
is used.
.method
can take the values:
.method = "present"
will assume that, even if each individual may have some missing periods, each period is present in your data somewhere, and so simply numbers, in order, all the time periods observed in the data..method = "year"
can be used with a single Date
/POSIX
/etc.-type variable (anything that allows lubridate::date()
) and will extract the year from it. Or, use it with a character or numeric variable and indicate with .datepos
the character/digit positions that hold the year in YY or YYYY format. If combined with .breaks
or .skip
, will instead set the earliest year in the data to 1 rather than returning the actual year..method = "month"
can be used with a single Date
/POSIX
/etc.-type variable (anything that allows lubridate::date()
). It will give the earliest-observed month in the data set a value of 1
, and will increment from there. Or, use it with a character or numeric variable and indicate with .datepos
the character/digit positions that hold the year and month in YYMM or YYYYMM format (note that if your variable is in MMYYYY format, for example, you can just give a .datepos
argument like c(3:6,1:2)
). Months turn over on the .start
day of the month, which is by default 1..method = "week"
can be used with a single Date
/POSIX
/etc.-type variable (anything that allows lubridate::date()
). It will give the earliest-observed week in the data set a value of 1
, and will increment from there. Weeks turn over on the .start
day, which is by default 1 (Monday). Note that this method always starts weeks on the same day of the week, which is different from standard lubridate
procedure of counting sets of 7 days starting from January 1..method = "day"
can be used with a single Date
/POSIX
/etc.-type variable (anything that allows lubridate::date()
). It will give the earliest-observed day in the data set a value of 1
, and increment from there. Or, use it with a character or numeric variable and indicate with .datepos
the character/digit positions that hold the year and month in YYMMDD or YYYYMMDD format. To skip certain days of the week, such as weekends, use the .skip
option..method = "turnover"
can be used when you have more than one variable in variable and they are all numeric nonnegative integers. Set the .turnover
option to indicate the highest value each variable takes before it starts over, and set .turnover_start
to indicate what value it takes when it starts over. Cannot be combined with .skip
or .breaks
. Doesn’t work with any variable for which the turnover values change, i.e. it doesn’t play well with days-in-month - if you’d like to do something like year-month-day-hour, I recommend running .method="day"
once with just the year-month-day variable, and then taking the result and combining that with hour in .method = "turnover"
.data(SPrail) # Since we have a date variable, we can easily create integers that increment for each # year, or for each month, etc. # Likely we'd only really need one of these four, depending on our purposes SPrail <- SPrail %>% dplyr::mutate( year_time_id = time_variable(insert_date, .method = "year"), month_time_id = time_variable(insert_date, .method = "month"), week_time_id = time_variable(insert_date, .method = "week"), day_time_id = time_variable(insert_date, .method = "day") ) # Let's see what we've got SPrail %>% select(insert_date, ends_with("time_id")) %>% head() #> # A tibble: 6 x 5 #> insert_date year_time_id month_time_id week_time_id day_time_id #> <dttm> <int> <int> <int> <int> #> 1 2019-04-12 20:17:04 2019 1 1 2 #> 2 2019-04-16 09:33:08 2019 1 2 6 #> 3 2019-05-08 09:04:07 2019 2 5 28 #> 4 2019-04-16 06:21:42 2019 1 2 6 #> 5 2019-05-02 07:03:34 2019 2 4 22 #> 6 2019-04-13 06:03:43 2019 1 1 3 # Perhaps I'd like quarterly data # (although in this case there are only two months, not much variation there) SPrail <- SPrail %>% dplyr::mutate(quarter_time_id = time_variable(insert_date, .method = "month", .breaks = c(1, 4, 7, 10) )) # Should line up properly with month SPrail %>% count(month_time_id, quarter_time_id) #> # A tibble: 2 x 3 #> month_time_id quarter_time_id n #> <int> <int> <int> #> 1 1 1 1633 #> 2 2 1 367 # Maybe I'd like Monday to come immediately after Friday! SPrail <- SPrail %>% dplyr::mutate(weekday_time_id = time_variable(insert_date, .method = "day", .skip = c(6, 7) )) # Perhaps I'm interested in ANY time period in the data and just want to enumerate them in order SPrail <- SPrail %>% dplyr::mutate(any_present_time_id = time_variable(insert_date, .method = "present" )) # Note the weekday_time_id NAs - these are weekends! We told it to skip those. head(SPrail %>% select(insert_date, day_time_id, weekday_time_id, any_present_time_id)) #> # A tibble: 6 x 4 #> insert_date day_time_id weekday_time_id any_present_time_id #> <dttm> <int> <int> <int> #> 1 2019-04-12 20:17:04 2 NA 96 #> 2 2019-04-16 09:33:08 6 4 461 #> 3 2019-05-08 09:04:07 28 20 1899 #> 4 2019-04-16 06:21:42 6 4 446 #> 5 2019-05-02 07:03:34 22 16 1670 #> 6 2019-04-13 06:03:43 3 NA 130 # Maybe instead of being given a nice time variable, I was given it in string form SPrail <- SPrail %>% dplyr::mutate(time_string = as.character(insert_date)) # As long as the character positions are consistent we can still use it SPrail <- SPrail %>% dplyr::mutate(day_from_string_id = time_variable(time_string, .method = "day", .datepos = c(3, 4, 6, 7, 9, 10) )) # Results are identical from using the actual Date variable cor(SPrail$day_time_id, SPrail$day_from_string_id) #> [1] 1