# Make sure these are all installed
# Data manipulation and graphing
library(tidyverse)
# Handling dates
library(lubridate)
# Reading Excel files
library(readxl)
# Looking at the data
library(vtable)
We begin by loading in a data set of the raw social-distancing data aggregated together at the county-date level, which I’ve stored in the data set Social_Distancing_From_Feb1.Rdata
load('Social_Distancing_From_Feb1.Rdata')
# Here's what's in the data
vtable(compiled_data, lush = TRUE)
Name | Class | Values | Missing | Summary |
---|---|---|---|---|
state | character | ‘1’ ‘10’ ‘11’ ‘12’ ‘13’ and more | 0 | nuniq: 56 |
county | character | ‘001’ ‘003’ ‘005’ ‘006’ ‘007’ and more | 0 | nuniq: 328 |
device_count | integer | Num: 5 to 412936 | 0 | mean: 5834.198 sd: 15770.937 nuniq: 27358 |
completely_home_device_count | integer | Num: 1 to 152367 | 0 | mean: 1484.269 sd: 4310.209 nuniq: 13005 |
Date | Date | 0 | mean: 18321.5033671113 sd: 16.74 nuniq: 58 |
Next, we bring in population by county so that we can adjust for different sampling rates. I have population by county in 2019 from the American Community Survey (ACS) in the file county_pop.xlsx
# 2019 popualtion by state/county AND also our link for state and county names
ctypop <- read_excel('county_pop.xlsx') %>%
# county = 0 indicates "the state"as a whole and we want counties
filter(county != 0) %>%
select(state,county,statename,countyname,popcty2019)
Now I just merge the two together so I can add the countyname
and popcty2019
variables on. Note that in my file, compiled_data
stores state
and county
indicators as strings so we need to conver them to numeric first.
compiled_data <- compiled_data %>%
mutate(state = as.numeric(state),
county = as.numeric(county)) %>%
left_join(ctypop) %>%
# I want to focus on the main 50 states, not e.g. Guam
filter(state <= 56) %>%
# Make sure there weren't any blank rows that snuck in
filter(!is.na(state),!is.na(county))
We want to adjust the device numbers in the data to population size. We follow the steps here.
compiled_data <- compiled_data %>%
group_by(state, Date) %>%
# Get full population by state in each time period, and full in-sample count by state
mutate(state_pop = sum(popcty2019, na.rm = TRUE),
sample_pop = sum(device_count, na.rm = TRUE)) %>%
ungroup() %>%
# Now adjust the device count and the home device count
mutate(adjust_factor = (popcty2019/state_pop)*(sample_pop/device_count)) %>%
mutate(adj_device_count = device_count*adjust_factor,
adj_home_count = completely_home_device_count*adjust_factor)
Now we use a hierarchical Bayesian model to adjust the values. This will largely affect counties with small samples, since they will be noisily estimated. The Bayesian model pulls them towards the means of their states, assuming that the states have information useful for estimating prevalence in those small areas. We follow the steps here.
# First, estimate the hyperparameter theta (the mean stay-home rate in each STATE)
# As well as our estimate of the variance of theta
compiled_data <- compiled_data %>%
group_by(state, Date) %>%
mutate(expected_theta = mean(adj_home_count/adj_device_count),
var_theta = var(adj_home_count/adj_device_count)/n()) %>%
# Now we can estimate the parameters of the beta distribution
# (these formulae basically come from a standard beta distribution estimation procedure)
mutate(alpha_plus_beta = (expected_theta*(1-expected_theta)/var_theta)-1) %>%
mutate(alpha = alpha_plus_beta*expected_theta,
beta = alpha_plus_beta*(1-expected_theta)) %>%
ungroup()
What we have now are our state-level prior \[\alpha\] and \[\beta\] values for our beta distribution.
We will take these alpha and beta values as our starting place. Then, for the results we get for each county, we move these priors into posteriors. For places with a lot of observations, they’ll move a lot, swamping out the prior. For places with not a lot of observations, they won’t move as much, and the state-level information will help inform those values.
compiled_data <- compiled_data %>%
mutate(posterior_alpha = alpha + adj_home_count,
posterior_beta = beta + (adj_device_count - adj_home_count)) %>%
# Finally, estimate the mean of the beta distribution
mutate(mean_estimate = 100*posterior_alpha/(posterior_alpha+posterior_beta))
How different does this make things?
ggplot(compiled_data, aes(x=100*adj_home_count/adj_device_count, y=mean_estimate)) +
geom_point() +
labs(x = 'Original Estimate',
y = 'Bayes-Adjusted Estimate') +
coord_fixed() +
theme_minimal()
You can see theree are a lot of original estimates on the edges (far left or right) that end up closer to the middle (neither top nor bottom), being “pulled in” to the mean by the Bayes estimator. These are typically smaller counties for which we have fewer observations.
Now we have our daily estimate. We can finalize this by smoothing things out (and handling weekend patterns) by taking the mean over the past seven days
# Function to calculate a seven-day moving average
ma <- function(x,n=7){as.numeric(stats::filter(as.ts(x),rep(1/n,n), sides=1))}
compiled_data <- compiled_data %>%
# Make sure it's grouped by county and in date order
group_by(state,county) %>%
# There is one county that only shows up for a few days, we can't use them
filter(n() > 7) %>%
arrange(Date) %>%
mutate(mean_estimate_ma = ma(mean_estimate)) %>%
ungroup()
We can turn our result into an index by taking it all relative to a certain day. We will be subtracting out whatever the value is on February 12.
compiled_data <- compiled_data %>%
# Put Feb 12 at the top of the data so we can refer to it with first()
group_by(state, county) %>%
arrange(-(Date == ymd('2020-02-12'))) %>%
mutate(sd_index = mean_estimate_ma - first(mean_estimate_ma)) %>%
# And put back in order
ungroup() %>%
arrange(state, county, Date)
```