The pmdplyr package is an extension to dplyr designed for cleaning and managing panel and hierarchical data. It contains variations on the dplyr
_join functions that address common panel data needs, and contains functions for managing and cleaning panel data. The goal is to get you a nice tidy
pibble panel data object, which you can
panel_convert() for use in one of the many packages that help you analyze panel data.
Unlike other panel data packages, functions in pmdplyr are all designed to work even if there is more than one observation per individual per period. This comes in handy if each individual is observed multiple times per period - for example, multiple classes per student per term; or if you have hierarchical data - for example, multiple companies per country.
Examples of pmdplyr use are below. These examples only cover some of the functionality of the package. See the Reference
pkgdown page for a full list of functions, or
You can install pmdplyr from CRAN:
The development version can be installed from GitHub:
# install.packages("devtools") devtools::install_github("NickCH-K/pmdplyr")
Let’s start with the fairly straightforward
Scorecard data, which describes how well students who attended that college are doing years after attendance.
Scorecard observations are uniquely identified by college ID
unitid and year
# Note that pmdplyr automatically loads dplyr as well library(pmdplyr) data(Scorecard) # Let's declare it as a pibble panel-data tibble Scorecard <- as_pibble(Scorecard, .i = unitid, .t = year) # We also have this data on the December unemployment rate for US college grads nationally # but only every other year unemp_data <- data.frame( unemp_year = c(2006, 2008, 2010, 2012, 2014, 2016, 2018), unemp = c(.017, .036, .048, .040, .028, .025, .020) )
I am interested in measuring the differences in ex-student earnings
earnings_med between two-year and four-year colleges (
pred_degree_awarded_ipeds == 2 or
3, respectively). But before we can do that we need to clean the data.
Scorecard %>% # We want pred_degree_awarded_ipeds to be consistent within college. No changers! # So let's drop them by using fixed_check with .resolve = "drop" to lose inconsistencies fixed_force( .var = pred_degree_awarded_ipeds, .within = unitid, .resolve = "drop" ) %>% # Then, get rid of pred_degree_awarded_ipeds == 1 # And simplify our terms filter(pred_degree_awarded_ipeds %in% c(2, 3)) %>% mutate(FourYear = pred_degree_awarded_ipeds == 3) %>% # earnings_med has some missing values - let's fill them in with # the most recent nonmissing observations we have # - panel_locf respects the panel structure declared above with as_pibble() mutate(earnings_med = panel_locf(earnings_med)) %>% # Now let's bring in that unemployment data! # Since it's every other year, it won't line up properly # in the join. So let's use inexact_join to get the MOST RECENT # year to join with inexact_left_join(unemp_data, var = year, jvar = unemp_year, method = "last") %>% # To adjust for state-level trends, let's also control for a tlag of # average earnings within state. # The lag is at the state level, and state-year doesn't uniquely identify, # But that's okay! We just pick a .resolve function to handle disagreements. # (We could also do this straight in the regression model itself) mutate(lag_state_earnings = tlag(earnings_med, .i = state_abbr, .t = year, .resolve = mean )) -> scorecard_clean # Now we can run a basic regression. lm( earnings_med ~ FourYear + unemp + lag_state_earnings, data = scorecard_clean ) %>% summary() #> #> Call: #> lm(formula = earnings_med ~ FourYear + unemp + lag_state_earnings, #> data = scorecard_clean) #> #> Residuals: #> Min 1Q Median 3Q Max #> -25341 -4917 -528 4091 54587 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 5.933e+03 1.772e+03 3.348 0.000826 *** #> FourYearTRUE 9.088e+03 3.511e+02 25.886 < 2e-16 *** #> unemp -4.564e+04 2.529e+04 -1.805 0.071215 . #> lag_state_earnings 7.348e-01 4.027e-02 18.244 < 2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 8209 on 2474 degrees of freedom #> (25161 observations deleted due to missingness) #> Multiple R-squared: 0.3431, Adjusted R-squared: 0.3423 #> F-statistic: 430.6 on 3 and 2474 DF, p-value: < 2.2e-16
We could even improve that code - why not run the
safe_join()? When we do the
inexact_left_join(), for example, we’re assuming that
unemp_data is uniquely identified by
unemp_year—is it really?
safe_join() would check for us and minimize error.
Let’s shift focus to the
SPrail data set, which is not nearly as well behaved as
Scorecard! This data contains the
price of 2,000 train trips taken on the Spanish rail system. At any given time period, there are multiple tickets sold per route, which is identified by
destination. And the
insert_date of sale is down to the minute - far more precise than we might actually want to use.
Let’s say we are interested in how much of a premium you’ll have to pay for different ticket types relative to the cheapo option - a tourist ticket with a transfer (
train_class == "Turista con enlace"), accounting for differences in prices between routes.
We have some difficulties to cover: making the ID and time variables behave, accounting for the between-route differences, and figuring out how to compare each price to the cheapo price.
data(SPrail) SPrail %>% # We have two ID variables - origin and destination. # pmdplyr has no problem with this, but maybe we want to export # to something like plm later, which can't handle it. # So let's use id_variable to combine them into one mutate(route_ID = id_variable(origin, destination)) %>% # We have a time variable down to the minute. Too fine-grained! # Let's back things up to the daily level, and # create a nice integer time variable that's easy to use mutate(day = time_variable(insert_date, .method = "day")) %>% # Now we can declare a pibble as_pibble(.i = route_ID, .t = day) %>% # We want to account for between-route differences in price, # so let's isolate the within variation mutate(price_w = within_i(price)) %>% # We want to compare to the cheapo option, so let's use # mutate_subset to get the average price of the cheapo option # and propogate that to the other options for comparison mutate_subset( cheapo_price = mean(price, na.rm = TRUE), .filter = train_class == "Turista con enlace" ) %>% mutate(premium = price - cheapo_price) %>% filter(train_class %in% c("Preferente", "Turista", "Turista Plus")) %>% # Now let's compare premia group_by(train_class) %>% summarize(premium = mean(premium, na.rm = TRUE)) -> sprail_compare_premia sprail_compare_premia #> # A tibble: 3 x 2 #> train_class premium #> <fct> <dbl> #> 1 Preferente 29.9 #> 2 Turista 8.36 #> 3 Turista Plus 14.4
And so there we have it—
Preferente will really set you back relative to the cheapo ticket on the same route.
Please note that the ‘pmdplyr’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.