Create a single integer time period index variable

This function takes either multiple time variables, or a single Date-class variable, and creates a single integer time variable easily usable with functions in pmdplyr and other packages like plm and panelr.

time_variable(
  ...,
  .method = "present",
  .datepos = NA,
  .start = 1,
  .skip = NA,
  .breaks = NA,
  .turnover = NA,
  .turnover_start = NA
)

Arguments

...	variables (vectors) to be used to generate the time variable, in order of increasing specificity. So if you have a variable each for year, month, and day (with the names year, month, and day), you would use `year,month,day` (if a data set containing those variables has been attached using `with` or `dplyr`) or `data$year,data$month,data$day` (if not).
.method	The approach that will be taken to create your variable. See below for the options. By default, this is `.method = "present"`.
.datepos	A numeric vector containing the character/digit positions, in order, of the YY or YYYY year (or year/month in YYMM or YYYYMM format, or year/month/day in YYMMDD or YYYYMMDD) for the `.method="year"`, `.method="month"`, or `.method="day"` options, respectively. Give it only the data it needs - if you give `.method="year"` YYMM information, it will assume you're giving it YYYY and mess up. For example, if dates are stored as a character variable in the format '2013-07-21' and you want the year and month, you might specify `.datepos=c(1:4,6:7)`. If two-digit year is given, `.datepos` uses the `lubridate` package to determine century.
.start	A numeric variable indicating the day of the week/month that begins a new week/month, if `.method="week"` or `.method="month"` is used. By default, 1, where for `.method=week` 1 is Monday, 7 Sunday. If used with `.method="month"`, the time data should include day as well.
.skip	A numeric vector containing the values of year, month, or day-of-week (where Monday = 1, Sunday = 7, no matter what value `.start` takes) you'd like to skip over (for `.method="year","month","week","day"`, respectively). For example, with `.method="month"` and `.skip=12`, an observation in January would be determined to come one period after November. Commonly this might be `.skip=c(6,7)` with `.method="day"` to skip weekends so that Monday immediately follows Friday. If `.breaks` is also specified, select the values of `.breaks` you would like to skip, but do be aware that combining `.skip` and `.breaks` can be tricky.
.breaks	A numeric vector containing the starting breakpoints of year or month you'd like to clump together (for `.method="year','month"`, respectively). Commonly, this might be `.breaks=c(1,4,7,10)` with `.method="month"` to go by quarter-year. The first element of `.breaks` should usually be 1.
.turnover	A numeric vector the same length as the number of variables included indicating the maximum value that the corresponding variable in the list of variables takes, where NA indicates no maximum value, for use with `.method="turnover"` and required for that method. For example, if the variable list is `year,month` then you might have `.turnover=c(NA,12)`. Or if the variable list is `days-since-jan1-1970,hour,minute,second` you might have `.turnover=c(NA,23,59,59)`. Defaults to the maximum observed value of each variable if not specified, and NA for the first variable. Note that in almost all cases, the first element of `.turnover` should be `NA`, and all others should be non-NA.
.turnover_start	A numeric vector the same length as the number of variables included indicating the minimum value that the corresponding variable in the list of variables takes, where NA indicates no minimum value, for use with `method="turnover"`. For example, if the variable list is `year,month` then you might have `.turnover=c(NA,1)`. Or if the variable list is `days-since-jan1-1970,hour,minute,second` you might have `.turnover=c(NA,0,0,0)`. By default this is a vector of 1s the same length as the number of variables, except for the first element, which is NA. Note that in almost all cases, the first element of `.turnover_start` should be `NA`, and all others should be non-NA.

Details

The pmdplyr library accepts only two kinds of time variables:

1. Ordinal time variables: Variables of any ordered type (numeric, Date, character) where the size of the gap between one value and the next does not matter. So if someone has two observations - one in period 3 and one in period 1, the period immediately before 3 is period 1, and two periods before 3 is missing. Set .d=0 in your data to use this.

2. Cardinal time variables: Numeric variables with a fixed gap between one observation and the next, where the size of that gap is given by .d. So if .d=1 and someone has two observations - one in period 3 and one in period 1, the period immediately before 3 is missing, and two periods before 3 is period 1.

If you would like to have a cardinal time variable but your data is not currently in that format, time_variable() will help you create a new variable that works with a setting of .d=1, the default.

If you have a date variable that is not in Date format (perhaps it's a string) and would like to use one of the Date-reliant methods below, I recommend converting it to Date using the convenient ymd(), mdy(), etc. functions from the lubridate package. If you only have partial date information (i.e. only year and month) and so converting to a Date doesn't work, see the .datepos option below.

Methods available include:

.method="present" will assume that, even if each individual may have some missing periods, each period is present in your data *somewhere*, and so simply numbers, in order, all the time periods observed in the data.
.method="year" can be used with a single Date/POSIX/etc.-type variable (anything that allows lubridate::date()) and will extract the year from it. Or, use it with a character or numeric variable and indicate with .datepos the character/digit positions that hold the year in YY or YYYY format. If combined with .breaks or .skip, will instead set the earliest year in the data to 1 rather than returning the actual year.
.method="month" can be used with a single Date/POSIX/etc.-type variable (anything that allows lubridate::date()). It will give the earliest-observed month in the data set a value of 1, and will increment from there. Or, use it with a character or numeric variable and indicate with .datepos the character/digit positions that hold the year and month in YYMM or YYYYMM format (note that if your variable is in MMYYYY format, for example, you can just give a .datepos argument like c(3:6,1:2)). Months turn over on the .start day of the month, which is by default 1.
.method="week" can be used with a single Date/POSIX/etc.-type variable (anything that allows lubridate::date()). It will give the earliest-observed week in the data set a value of 1, and will increment from there. Weeks turn over on the .start day, which is by default 1 (Monday). Note that this method always starts weeks on the same day of the week, which is different from standard lubridate procedure of counting sets of 7 days starting from January 1.
.method="day" can be used with a single Date/POSIX/etc.-type variable (anything that allows lubridate::date()). It will give the earliest-observed day in the data set a value of 1, and increment from there. Or, use it with a character or numeric variable and indicate with .datepos the character/digit positions that hold the year and month in YYMMDD or YYYYMMDD format. To skip certain days of the week, such as weekends, use the .skip option.
.method="turnover" can be used when you have more than one variable in variable and they are all numeric nonnegative integers. Set the .turnover option to indicate the highest value each variable takes before it starts over, and set .turnover_start to indicate what value it takes when it starts over. Cannot be combined with .skip or .breaks. Doesn't work with any variable for which the turnover values change, i.e. it doesn't play well with days-in-month - if you'd like to do something like year-month-day-hour, I recommend running .method="day" once with just the year-month-day variable, and then taking the result and combining *that* with hour in .method="turnover".

Examples


data(SPrail)

# Since we have a date variable, we can easily create integers that increment for each
# year, or for each month, etc.
# Likely we'd only really need one of these four, depending on our purposes
SPrail <- SPrail %>%
  dplyr::mutate(
    year_time_id = time_variable(insert_date, .method = "year"),
    month_time_id = time_variable(insert_date, .method = "month"),
    week_time_id = time_variable(insert_date, .method = "week"),
    day_time_id = time_variable(insert_date, .method = "day")
  )

# Perhaps I'd like quarterly data
# (although in this case there are only two months, not much variation there)
SPrail <- SPrail %>%
  dplyr::mutate(quarter_time_id = time_variable(insert_date,
    .method = "month",
    .breaks = c(1, 4, 7, 10)
  ))
table(SPrail$month_time_id, SPrail$quarter_time_id)
#>    
#>        1
#>   1 1633
#>   2  367

# Maybe I'd like Monday to come immediately after Friday!
SPrail <- SPrail %>%
  dplyr::mutate(weekday_id = time_variable(insert_date,
    .method = "day",
    .skip = c(6, 7)
  ))
#> Warning: .skip includes some days of week that are present in the data.
#> Observations in these weekdays will be given a missing time value.

# Perhaps I'm interested in ANY time period in the data and just want to enumerate them in order
SPrail <- SPrail %>%
  dplyr::mutate(any_present_time_id = time_variable(insert_date,
    .method = "present"
  ))


# Maybe instead of being given a nice time variable, I was given it in string form
SPrail <- SPrail %>% dplyr::mutate(time_string = as.character(insert_date))
# As long as the character positions are consistent we can still use it
SPrail <- SPrail %>%
  dplyr::mutate(day_from_string_id = time_variable(time_string,
    .method = "day",
    .datepos = c(3, 4, 6, 7, 9, 10)
  ))
# Results are identical
cor(SPrail$day_time_id, SPrail$day_from_string_id)
#> [1] 1


# Or, maybe instead of being given a nice time variable, we have separate year and month variables
SPrail <- SPrail %>%
  dplyr::mutate(
    year = lubridate::year(insert_date),
    month = lubridate::month(insert_date)
  )
# We can use the turnover method to tell it that there are 12 months in a year,
# and get an integer year-month variable
SPrail <- SPrail %>%
  dplyr::mutate(month_from_two_vars_id = time_variable(year, month,
    .method = "turnover",
    .turnover = c(NA, 12)
  ))
# Results are identical
cor(SPrail$month_time_id, SPrail$month_from_two_vars_id)
#> [1] 1

# I could also use turnover to make the data hourly.
# Note that I'm using the day variable from earlier to avoid having
# to specify when day turns over (since that could be 28, 30, or 31)
SPrail <- SPrail %>%
  dplyr::mutate(hour_id = time_variable(day_time_id, lubridate::hour(insert_date),
    .method = "turnover",
    .turnover = c(NA, 23),
    .turnover_start = c(NA, 0)
  ))
# This could be easily extended to make the data by-minute, by-second, etc.