One difficult thing about SafeGraph data is that while it’s very easy to use it to calculate relative changes in foot traffic to an area or POI, it’s not clear how we can use it to calculate absolute values of visits, since (1) we only have a subsample of the population, (2) we count devices and visits, not people, and people can own multiple devices and make multiple visits, and (3) we don’t know if the people who select into the sample are more/less likely to visit places (or certain places!). Among other things.
The first-stab pass at fixing this problem has been to just scale a measure of the overall sample up to the size of the population, for example multiplying any given visit number by \(Population/SGSample\), where \(SGSample\) is measured at the national/state/county/CBG/what have you level, and could be a count of the total number of visits (by aggregating across patterns or using normalization-data
), the total number of devices in sample (in normalization-data
) or the number of devices residing in an area (home-panel-summary
) being the most commonly-recommended adjustment measures, among a few other measures.
There has already been some success applying this kind of scaling to translate mobility SafeGraph visitor flows into population-count mobility flows in a way that at least correlates well with other models of mobility flows.
In this document I will take some real-world locations for which we have a decent idea of what the actual visitor count is from other data, and see how well different scaling methods work to make the SafeGraph counts match the other-source counts.
I won’t be displaying the code I used in this doc, but you can look at all the files in the GitHub repo.
The first set of real-world locations I’ll check is Starbucks, specifically all Starbucks locations from September-December 2019 (why this period? Because I was already getting NFL data, see the next section).
I have found evidence from a few places, for example here that in 2019 the number of daily customers to a Starbucks was in the range of 460-500 per day. Averaging across all Starbucks locations, we can see what adjustment we need to do to get around there.
Pros of the Starbucks analysis:
Cons of the Starbucks analysis:
The second set of real-world locations I’m using is based on the 2019 NFL schedule. We have attendance counts for each game in the season from Pro-Football-Reference, as well as the date and location each game was played. We can compare the attendance figures to the SafeGraph visits to the relevant stadiums.
Pros of the NFL analysis:
Cons of the NFL analysis:
normalization-data
. So we won’t be able to check that.Let’s start by taking the average raw visits to Starbucks locations in the data. I’ll drop any location/day with zero visitors, as that’s likely a day the location is closed. The average here will give us a sense of what our multiplier should be, at least on average. Remember that our goal is something like 460-500, then adjusted upwards for staff and non-customer visitors.
Visits are relatively flat, with some within-week variation, which is a relief-if there were some big trend here we’d probably be worried about only using a quarter of the year rather than the whole thing.
We get big dips at what looks like Thanksgiving and Christmas, so a target of 460-500 doesn’t seem appropriate for those days. After dropping them we get a new grand mean of 28.6, meaning that we are looking to scale up by at least 16.1 to even hit 460.
So how can we scale?
First, let’s try just using the national normalization data and see what we can get. Our options in the file are total_visits
, total_devices_seen
, total_home_visits
, and total_home_visitors
, all of which vary daily. For each of these, we’ll take the rounded 2019 US population estimate of 329 million (from here) and divide it by the SafeGraph number to get our adjustment factor, and then multiply that adjustment factor by our raw SafeGraph count. The horizontal line represents the minimal target of 460.
Adjustment Factor | Grand Mean |
---|---|
total_visits | 150.8 |
total_devices_seen | 541.0 |
total_home_visits | 392.0 |
total_home_visitors | 703.2 |
This looks pretty good! The preferred adjustment factor of total_devices_seen
gives us 541.0 which by my intuition is probably a little low, even at the bottom target of 460, once you factor in non-customers. But that’s a guess on my part. It’s probably not a big undershot. Somewhere between that and total_home_visitors
is more what I would guess is the actual right number. But that’s a guess on my part - this doesn’t look too bad.
How about the individual Starbucks locations under this adjustment? Now, surely, each Starbucks actually gets a different amount of traffic - some are busier or bigger than others. But the distribution probably isn’t wildly wide. How big does it get? This time instead of averaging all locations for each day, I average each of the 11357 locations across all days and plot the distribution.
Them’s some tails! Now, there are some Starbucks locations that are likely way busier than others-the roasteries, the spot I’m guessing they have in like Central Station or Disneyland or something. But is it realistic that there are Starbucks’ with ~15k visitors a day? Maybe! I’ve been to the one in Disneyland, it’s jammed all day every day. But it seems likely that those very-busy spots, which I’m going to guess are indeed at the top of the pile here, are being overcounted.
Perhaps the summary file normalizations will do better, since they can operate on a more local level?
Now let’s take the panel summary data and try to use number_devices_residing
, along with information about CBG and county population, to fix things up! We’ll also try at the state and national level.
These all work fairly well, with the CBG adjustment doing the worst, despite Starbucks being local-customer businesses in many cases. The other three levels of aggregation all work just about the same. Interestingly, this implies a similar (although certainly not identical) portion of the population being sampled in different areas. But in any case, this adjustment is not quite as successful as you get with the normalization file. It doesn’t adjust upwards as much, which is what we need.
Let’s do one more CBG-level adjustment. For this one, we look at the weekly visitor_home_cbgs
variable, which records the number of visitors to each POI from each CBG. Then, we scale those visits up to the population level of those origin CBGs. This is a measure of number of visitors from each CBG, so we scale that again by the ratio of overall visitors to visits (raw_visit_counts/raw_visitor_counts
) for that POI. Then for each POI we add up the adjuted origin-CBG numbers across all the origin CBGs to get our POI-week specific adjustment factor. Since this is a weekly measure, we take the proportion of visits in a given week that come on a given day and multiply that by the adjustment factor. More complex, but manages to smooth things out a bit better. So how does it do?
Works okay! At least, it is an improvement on the CBG adjustment alone, getting a bit closer to the 500+ we want. But it is less of an adjustment than for the county, state, or national. In this case, at least, we might prefer going with those.
As just a quick check before we try adjusting stuff, let’s make sure we’re properly aligning the data. Could we figure out which days were game days without having the schedule?
It looks like it works pretty well. While there are certainly active days at these stadiums that aren’t game days (and are likely other events), we see that game days in general are much higher than your average day, as you’d expect. How about those few little blue dots nestled down at the bottom? What are those?
Stadium Name | Date | Official Count | SafeGraph Count |
---|---|---|---|
FirstEnergy Stadium | 2019-09-08 | 67431 | 1 |
FirstEnergy Stadium | 2019-09-22 | 67431 | 7 |
FirstEnergy Stadium | 2019-11-10 | 67431 | 1 |
FirstEnergy Stadium | 2019-11-14 | 67431 | 7 |
FirstEnergy Stadium | 2019-11-24 | 67431 | 2 |
FirstEnergy Stadium | 2019-12-08 | 67431 | 1 |
FirstEnergy Stadium | 2019-12-22 | 67431 | 1 |
SoFi Stadium | 2019-09-08 | 25363 | 0 |
SoFi Stadium | 2019-09-15 | 71460 | 0 |
SoFi Stadium | 2019-09-22 | 25349 | 5 |
SoFi Stadium | 2019-09-29 | 68117 | 2 |
SoFi Stadium | 2019-10-06 | 25357 | 0 |
SoFi Stadium | 2019-10-13 | 25425 | 0 |
SoFi Stadium | 2019-10-13 | 75695 | 0 |
SoFi Stadium | 2019-10-27 | 83720 | 3 |
SoFi Stadium | 2019-11-03 | 25435 | 8 |
SoFi Stadium | 2019-11-17 | 70758 | 1 |
SoFi Stadium | 2019-11-18 | 76252 | 15 |
SoFi Stadium | 2019-11-25 | 72409 | 16 |
SoFi Stadium | 2019-12-08 | 71501 | 2 |
SoFi Stadium | 2019-12-15 | 25446 | 3 |
SoFi Stadium | 2019-12-22 | 25380 | 5 |
SoFi Stadium | 2019-12-29 | 68665 | 0 |
Soldier Field | 2019-10-20 | 62306 | 0 |
Well that ain’t right. Checking the raw data, it’s not that the dates of the games are wrong either, we just have very few SG visits to those POIs. We also notice the rather odd phenomenon that FirstEnergy Stadium appears to have the exact same attendance for every game, which is in the data I downloaded. Probably best to get rid of that one anyway. Let’s get rid of these two stadiums going forward.
With that out of the way, we have 151 games to look at. Let’s start by seeing what adjustment factors we’d want to have, i.e. by just comparing the raw counts, under the simplifying assumption that there are no staff at the games.
We see a that a straight line seems to fit the data okay, nearly matching the nonparametric LOESS fit, which is a relief, as it suggests that the adjustment needs to be fairly straightforward. We also see that the fit is weaker for smaller events, which is a potential concern given that most things we’d want to estimate the number of absolute visits for are much smaller than NFL games. This is also exactly what we’d expect, though-smaller events, more noise. So no big surprise there. In general we should keep in mind that any absolute-count for a smaller event/location is likely to be noisier.
Ideally, the intercept on that OLS fit is 0, allowing us to just multiply by a number. Is it? Let’s also see what we get with a polynomial fit, and a logarithmic fit since it sorta looks like we have a curve in the graph but it’s not clear whether it’s that or just the heteroskedasticity.
Attendance Counts | NA | NA | |
---|---|---|---|
SafeGraph Visits | 3.07 *** | 0.31 | |
(0.28) | (0.97) | ||
SG Visits Squared | 0.00 ** | ||
(0.00) | |||
SG Visits Logged | 10940.58 *** | ||
(1331.80) | |||
Intercept | 53219.11 *** | 59368.80 *** | -23771.32 * |
(1415.84) | (2497.47) | (11095.23) | |
N | 151 | 151 | 151 |
R2 | 0.44 | 0.47 | 0.31 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
All of these columns have a clearly nonzero intercept, which is a problem. We can make a decent guess at a multiplicative factor by looking at population ratios, but if we didn’t know what that intercept was we’d have no real way of guessing it.
Regardless, let’s forge ahead and see what we can do.
First, let’s try just using the national normalization data and see what we can get. Our options in the file are total_visits
, total_devices_seen
, total_home_visits
, and total_home_visitors
, all of which vary daily. For each of these, we’ll take the rounded 2019 US population estimate of 329 million (from here) and divide it by the SafeGraph number to get our adjustment factor, and then multiply that adjustment factor by our raw SafeGraph count. Straight lines indicate the point where the adjusted values exactly match the attendance records.
All of these except total_visits
do an okay job for smaller values but wildly understate large ones. total_visits
has a decent slope to it but is undercounting in general, which makes sense since we’re dividing by visits, which people do many of, and multiplying by people, which there’s only one of per person. In general we just have way too much variation in the SafeGraph visits relative to the attendance numbers.
I tried doing some Empirical Bayes shrinkage to pull in some of those large values but it didn’t really do anything, and also to do this you have to have a group to shrink towards. Here there are multiple events across stadiums and dates to shrink to, but that won’t work in every application anyway.
What if I just try a bunch of random adjustments until I find one that looks nice? I get this:
This seems to work okayish… total_home_visits
looks good but there’s no real reason we’d expect to want to use that scaling into population. total_devices_seen
, generally considered the preferred measure anyway, actually looks pretty nice if you factor in the potential for staff being counted. but what is it? Basically what I did is I transformed each SafeGraph observation by shrinking it very crudely towards the mean, which surprisingly “works” better than Empirical Bayes here.
If \(\bar{\mu}\) is the overall sample mean after adjusting for population and \(X_i\) is an individual observation after adjusting for population, the transformation is:
\[ Transformed = \hat{\mu} + (X_i-\hat{\mu})/\sqrt{17} \]
Where \(17\) is the number of weeks in the NFL season. How we might apply this to, say, an individual event is less clear. And the equation is mostly arbitrary - you might see a similar transformation elsewhere but there’s no real justification for using it here other than that I thought it might work and it did. 17 is even pretty arbitrary - we’re analyzing things at the set of all games. Why 17 and not the total number of games? 17 works, and 151 doesn’t, though. If we’re being honest, I’m just smushing a bunch of the variance out here, rather than following any particularly principled shrinkage method. But principled shrinkage is shrinking too little!
What we can get out of this is that SafeGraph counts for events have considerably more variation than actual counts, and so you want some form of shrinkage if you have multiple events/target estiamtes, and if you’ve only got one absolute value you’re trying to estimate, you will probably want to pair it with other estimates of something similar you can use to balance out the variation.
Thinking about it through another lens though, maybe “Shrinkage” isn’t the right way to think about it. This transformation actually makes the small events worse - they were already fine! The value of this is in bringing down the values of the huge events. There’s more of a “big-event” problem than a “too much variation” problem in the NFL games.
Now let’s take the panel summary data and try to use number_devices_residing
, along with information about county population, to fix things up! We’ll also try at the state and national level.
This appears to give us almost exactly the same problem as with the normalization-data
adjustments. Does the same weird fix work?
This works about the same. Arguably maybe better than total_devices_seen
. Still an overstatement of amounts, and relies on that fairly arbitrary 17 and the ability to shrink across a group of events.
Let’s do one more CBG-level adjustment. For this one, we look at the weekly visitor_home_cbgs
variable, which records the number of visitors to each POI from each CBG. Then, we scale those visits up to the population level of those origin CBGs. This is a measure of number of visitors from each CBG, so we scale that again by the ratio of overall visitors to visits (raw_visit_counts/raw_visitor_counts
) for that POI. Then for each POI we add up the adjuted origin-CBG numbers across all the origin CBGs to get our POI-week specific adjustment factor. Since this is a weekly measure, we take the proportion of visits in a given week that come on a given day and multiply that by the adjustment factor. More complex, but manages to smooth things out a bit better. So how does it do?
Doesn’t seem that different from the other options.
So what can we take away from this?
First, any time we do some sort of scaling to adjust a SafeGraph count upwards to try to get an actual count, we have to necessarily introduce some extra assumptions on top of what we’re already doing concerning the representativeness of the sample, the proportion of the population that is sampled, and sampling variation. If you can rephrase whatever question you’re trying to ask in terms of relative visits, rather than absolute visits (“visits to X grew by Y% from A to B” rather than “there were Z visitors to X on date B”), that’s going to be a safer and more reliable conclusion.
But if an absolute count is important, an absolute count is important. I get it.
Second, scaling to absolute numbers is possible and we can get some reasonable numbers from it without doing anything fancy. The Starbucks analysis shows that doing a very simple scaling based on \(Population/SGSample\), with \(SGSample\) measured using total_devices_seen
or number_devices_residing
, gives us a pretty reasonable number. Perfect? No, but pretty darn good.
Third, it really really matters what kind of thing you’re trying to get absolute counts for. I was able to hit Starbucks fairly accurately - Starbucks is an everyday thing for which I was trying to match an average attendance count, and the number of visitors I was trying to hit was modest. So I was able to average over lots of Starbucks and lots of days at each Starbucks - this really helps smoothe out noise. The fact that the \(Population/SGSample\) scaling worked pretty well (no matter how we did it, other than at the CBG level) tells us that the SafeGraph sample is at least roughly representative - we don’t have major sample selection bias issues, or else the \(Population/SGSample\) ratio would not at all match the \(PopulationVisits/SGVisits\) ratio.
On the other hand, the NFL games are one-off events, so even absent any other issues we already know that these are going to be harder to hit. Estimating the attendance a single event is way harder than estimating the average attendance at a bunch of events. That’s statistics, baybee.
But beyond that…
Fourth, scale seems to matter a lot, and bigger events are more wrong than smaller events. In the Starbucks analysis, while the average was fine, the most popular Starbucks were way too popular in the SafeGraph data.
Similarly, the NFL games are all enormous, and the consistent problem here was overestimation. For the smallest NFL games, standard adjustment worked fine, just as was the case for the smallest Starbucks. For small NFL games, the absolute errors were still pretty big, but that’s just a consequence of sampling variation, can’t really get around that.
Large NFL games, like the largest Starbucks, were way overestimated by standard adjustment, with the biggest games (roughly 100k in measured attendance) getting adjusted SafeGraph estimates in the 200k-250k range. Unrealistic!
There are a few ways we could interpret this.
One is the problem implied by the linear regression table above. Perhaps the relationship is linear, but our inability to deal with the intercept when using standard scaling means that we’re relying on a scaling function that passes through \((0, 0)\), which perhaps it shouldn’t. This would imply a fix related to guessing some sort of intercept for big events - perhaps we get lucky and the intercept turns out to be similar for lots of big events and we can have a rule of thumb.
Another approach is to say that, contrary to what the unscaled NFL scatterplot implied, there’s a nonlinear relationship between attendance at an event and SafeGraph visit counts. As an event gets bigger, SafeGraph picks up a bigger portion of its attendees. I’m implying something like \(SGVisits=\alpha_0 ActualVisits + \alpha_1 ActualVisits^2\) where both \(\alpha_0\) and \(\alpha_1\) are positive.
A third approach is to say that the relationship between SafeGraph visits and actual attendance is linear, but that the population adjustment is the thing that’s wrong. i.e. because of how big the events/attractions are, we should be considering a different share of the SafeGraph sample to be the relevant part of it that we should use in the denominator.
Either of these last two could be accounted for by taking the adjustment factor \(Population/SGSample\) and running it through some function \(f()\) before you use it, where \(f()\) is some declining function of \(SGVisits\), but designed so that it doesn’t affect small events that much. This would have to be estimated over a wide range of data to get it right, though. Just to take one quick guess at it, let’s see if making the adjustment factor for game \(j\) be \(Adj_j = Population/(SGSample\times \log(2+SGVisits-\min(SGVisits)))\) works for NFL games with the panel summary file county sample estimation - I’m not too hopeful:
That actually doesn’t look too bad - the positioning of the cluster of points is way off but the shape is good (I suspect that big point out there on the right is actually the smallest game, which is an easy fix). Clearly an overcorrection, the function needs to have less sharp of a decline, but it’s a start. I’m not going to bother pinning down something perfect because it would just apply to NFL games anyway.
total_devices_seen
or number_devices_residing
probably works just fine.If using this information to justify the use of \(Population/SGSample\) scaling in an academic publication, please cite as
Huntington-Klein, Nick. 2020. “Calculating Absolute Visit Counts in SafeGraph Data”. Unpublished.