Some Thoughts: Overdebunked! Six Statistical Critiques That Don't Quite Work

Nick Huntington-Klein

Statistical results and data analyses are quite often wrong. Sometimes they’re wrong because of carelessness, sometimes they’re wrong even though we cared a lot because it’s just really hard to get them right, and other times they’re wrong on purpose. It shouldn’t shock anyone to hear this. This is the view that a lot of people have about statistics. It’s not a surprise to me that the best-selling statistics book of all time is all about how statistics can be wrong, the classic How to Lie with Statistics by Darrell Huff. I’m not sure whether Huff inspired all this skepticism or simply reflected it back at us (I suspect the latter), but regardless we know to be on the lookout for bad statistics.

Figure 1: How to Lie with Statistics

Skepticism about statistics is good. However, just as blind belief can lead you to believe things that aren’t true, an overabundance of skepticism can lead you to disbelieve things that are true. As author, podcaster, and modern be-careful-with-data industry member Tim Harford recounts, Darrell Huff took the data skepticism he was known for a little too far. He ended up getting paid by the tobacco industry to testify before congress about just how darn safe cigarettes were. After all, Huff pointed out that statistics can be misrepresented. Surely it follows that they always are. All those statistics showing that cigarettes are harmful for you can’t be trusted either, right?

So if we’re going to be skeptical of statistical results (and we should be), we have to be careful that we’ve got actual criticisms that make sense.

One interesting feature of spending time on social media is that you get to see all sorts of people make criticisms of statistical results. A lot of these criticisms don’t really make much sense. Or sometimes they’re legit criticisms in general that don’t really apply to the thing being criticized. Or they’ve stumbled on an actual problem but then assign way too much importance to it.

It seems from my end of things that most of these critiques are driven by a desire for the result they’re criticizing to be wrong, rather than actual concern for statistical technique. So maybe talking about the statistical side of things is pointless. But also some people are genuinely interested in whether the stats are right are not, and in any case talking about some of these misconceptions may be handy.

So, below, I’ve listed six statistical critiques I commonly see on social media, and why they’re not great critiques. This isn’t a comprehensive compendium of statistical mistakes, but rather things I’ve seen over and over reading the comments on data-containing posts about climate change, political polling, COVID, violence, foreign relations, etc. etc. etc.. These aren’t technical errors - they’re not about misinterpreting a \(p\)-value or whatever, but more about common-sense critiques of published statistical results that anyone could make. I can guarantee you that these mistakes aren’t limited to any one side of any given issue (in fact, some of the examples below are of people who almost certainly agree with me on what the actual truth is, even though I think they’ve made a mistake getting there). You, yes you, may even remember a reply you’ve made containing one of the things below.

1. Bias in Levels vs. Bias in Changes

Suppose you’ve got some statistic that is well-known to be biased. Perhaps it is consistently higher than “the truth” and perhaps it is consistently lower. The most obvious example I can give here is in the realm of political polling. Different pollsters use different methods - do they call landlines only? All phones? Include internet polling? Where do they advertise? How do they weight for differences in sampling across parts of the population? And so on. These decisions, intentionally or not, can lead to polls that consistently overstate support for one party and understate it for another, relative both to what other pollsters find (leading to polling “house effects”), or relative to what you see on election day.

So, if you see a poll in the United States from that one pollster who always finds higher support for the Republicans than other pollsters do, for example, you could reasonably say “hey, actual support for Republicans is probably a bit lower than that, since this pollster uses methods that tend to overstate their support.” That’s totally legit.

But what if you see a change in that value over time? Perhaps that one pollster that finds high numbers for Republicans used to say that 55% of voters were going to go Republican, but now it says that 65% are.

The critique “that increase can’t be right, because this pollster overstates the support for Republicans” no longer works! Their methods make the absolute number for support too high, sure, but unless they changed their methods recently, the ten percentage point increase still indicates an increase in support for Republicans.

Now, the change itself could also be wrong (everything could be wrong), but simply knowing that the level is too high doesn’t tell you that the change is too high. Without knowing more about why that pollster gets high Republican numbers, it’s not even a reason to think that their measurement of the change is more wrong than the changes other pollsters show. If you think the change is wrong, you gotta find a reason why you think the pollster’s numbers aren’t internally consistent relative to each other, not just that they’re wrong in their absolute level. Plenty of statistics have weird absolute values but at least go up and down at the times it seems like they should go up and down!

This incorrect critique pops up anywhere you see a number that goes up or down, and for which people have issues with how it’s measured, or worse, think the books are cooked. The unemployment rate, GDP, crime rates, and so on and so on. For all of these, as long as they’re consistently-produced estimates that have a few kinks in the process (as opposed to outright lies), then just because you think the value itself is over/underestimated doesn’t mean the change in that value is over/underestimated.

This critique sometimes also pops up, although less often in my experience, in cases where people don’t think the original value is wrong, just unusual. For example, perhaps student test scores are already known to be extremely low in a certain region in your country. Then, those test scores fall even further. A critique that their now extremely extremely low scores aren’t noteworthy because we already knew that region had bad scores is, at least, missing the point.

2. Demanding Infinite Detail

Collecting data is hard, and expensive, and in many cases accessing data is hard and expensive, but easier to do at the aggregate level. It’s far easier, for example, to find data on the US unemployment rate than it is to find data on the unemployment rate in my neighborhood (although I know Steve’s been without a job a few months…).

Because of this, you’ll often see data posted on a more aggregate level than would be ideal. Sometimes this means geographic aggregation, like seeing the US unemployment rate instead of the unemployment rate for my neighborhood. Often it’s aggregation across groups of people. Relatively few publicly-available statistics are available broken down by age, or gender, or race, and far fewer than that are broken down by all the different cross-tabulations of age and gender and race and education level and ancestral nationality and religion and and and…

A common critique I see is that an aggregate statistic is wrong or useless because it does not have a multitude of breakdowns into fractal levels of increasing detail.

Now, sometimes, a statistic is indeed useless because it’s at the wrong level of detail. If the hot topic of the day is whether immigration is affecting wages in Macon, Georgia, then a graph of wages going up in Georgia overall isn’t really going to say much. “The data would be too hard to get” may be true but it doesn’t make the original analysis right.

But the problem comes when the reverse argument is made. This is where someone has a graph showing wages going up in Georgia overall, and someone comments “Useless. Wages are going down in Macon. Why bother posting this without breaking out each city individually?”

Information for each individual city might be interesting to have. But often we don’t have it, and a demand for increasingly precise levels of detail can just be an excuse to ignore an overall trend. Further, one disappointing fact about statistics is that sample size really matters. If we did have data broken down into minute subsamples, the results for each of those subsamples would be worse. Noisier, with misleading results popping up in a few of the subsamples that distract from the more accurate overall picture. Often an aggregate statistic is simply a higher-quality statistic, even if it’s a less-detailed one. This is especially true with noisy and fast-moving data, or when the subsamples would be very small.

This entire brand of critique I see a lot in regards to COVID data. Do masks work? How effective are the vaccines? Whenever someone has data that can help with an answer, the comments are flooded with people saying that there’s no point looking at the data if it doesn’t break things out by age, or by eight different comorbidities, or political affiliation, or occupation, and so on and so on. Yes, maybe those things would be nice, and they’d let us answer a different question than the aggregate data would, but the aggregate data often does answer an interesting question - sure, maybe not the one you want, but it’s probably quite useful in its own right. Skipping the breakdown doesn’t make it wrong, and there’s a good chance it doesn’t make it useless either.¹

3. Asking for Double-Counting

A close cousin of the demand for infinite detail (#2 above) is the use of a subgroup to try to disprove a statistic. This, in effect, asks for the critique to be double-counted. I think of this commonly as coming in one of two flavors: the use of subgroups, and the use of caveats.

Sticking with COVID for subgroups, for example, suppose someone has a graph showing that COVID case rates are declining in Africa. Someone else might say “that can’t be right, cases are increasing in Egypt.” It might well be true that cases are increasing in Egypt. But that doesn’t mean cases aren’t declining in Africa, it just means that the increase in Egypt isn’t enough to offset the aggregate decline in the rest of the continent. Someone taking the critique further might say that the decrease in African cases should be adjusted to account for the Egyptian increase, and might even point out that doing so might lead the overall African trend to be upwards. This misses that the aggregate African trend already included Egypt, and so this adjustment would double-count Egypt.

we see another example of this with the recent debate over inflation. I’ve seen quite a lot of claims that inflation in the US must be higher than is being recorded because meat, or cars, are going up in price more quickly than the official inflation rate. I also see plenty of claims that inflation isn’t as bad as is being recorded because many different services are rising in price more slowly than the official inflation rate. Both of these miss the point that, while there are definitely valid criticisms of the way inflation is measured, inflation isn’t supposed to match the price increases for meat, or cars, or services individually, it’s supposed to record the price increase for all of them at once. If each of the individual price increases matched the aggregate statistic, that would be genuinely surprising, not a sign that it was working properly.

Taking this critique to the micro-level we have good ol’ anecdote vs. data. Just because you lost your job last week doesn’t mean that unemployment is rising, and just because your grandma smoked and lived to 104 doesn’t mean smoking’s not harmful. But you already knew this one, I’d imagine.

4. Misunderstanding Control Variables

“Controlling for” or “adjusting for” a variable is a deceptively tricky thing. The purpose of control variables is this: you can see the relationship between two things in the data, for example drinking more water and better overall health. You think these things might be related for several different reasons: (1) perhaps lots of water does make you healthier, or (2) perhaps athletes are more likely to drink lots of water, and athletes are just healthy anyway. The purpose of a control variable is to try to exclude one of those explanations. Looking at the relationship between water and health by itself could be either of those explanations, but looking at the relationship between water and health while controlling for being an athlete should leave you with just the “water makes you healthier” explanation of why they’re related (if those are actually the only two possible explanations, which seems iffy).

How can this be misapplied in critique? One common way is to criticize a result for not controlling for stuff even when it isn’t trying to pin down a specific explanation. Controls are about excluding certain reasons why we see something in the data. But if we don’t care why, we just care that we see it in the data, then we probably don’t want to control for anything. Climate change is one place where the distinction can be clear. One critique I often see from skeptics of climate change is that it doesn’t account for things like CO2 releases from volcanic eruptions. This critique makes conceptual sense when you’re talking about why temperatures are rising - if you want to pin down humans as the reason for warming, you want to control away the other potential reasons for a rise in temperature over time, like volcanic eruptions (surely the climate scientists have never thought to try this). But it makes no sense when I see it used as an argument as to why evidence of increasing temperatures is itself wrong. If temperatures are increasing, temperatures are increasing, regardless of what caused it.

Another misapplication of control variables in critiques is in demanding that the original analysis control away the actual thing we’re interested in. It’s like if a bitter fan of a losing sports team argues the best team wouldn’t be so great without their star players. Well, yeah… the star players are what make the good team into the good team. Doesn’t mean much to say the team would be worse without them. In online political debates this, generically, often takes the form of “policy X doesn’t really work, it only improved Y because it improved Z! Controlling for Z makes it obvious X was useless.” Sounds to me like policy X worked just fine - it improved Y by improving Z!

Another place this pops up is in the extremely online debate over the discriminatory gender wage gap. Should you look for the wage gap only within the same occupation (i.e. “controlling for occupation”)? One common explanation given for why there’s a discriminatory wage gap is that women are kept out of higher-paying jobs. If you control for occupation, you’re excluding that explanation by saying it doesn’t count as part of the discriminatory gap, although it should. So clearly we shouldn’t control for occupation in that case, then. Well… another explanation often given for the wage gap is that men and women prefer different occupations. If we don’t control for occupation, we’re saying that those choices are a part of the discriminatory gap, when perhaps they are not. Hmm, doesn’t seem to work either way. The question of how big the discriminatory wage gap is possible (and if you’re interested a pretty good place to start is with Claudia Goldin) but it requires some tools beyond just controlling or not controlling for a variable. Turns out the statistical approach to answer certain questions has to go beyond what can be contained within a Twitter yelling match.

5. Correlation Can’t be Causation

As we are all aware, “correlation is not causation.” Except, of course, that sometimes it is. If I jump off a building, there is a very strong negative correlation between the time since I jumped and my distance from the Earth, and I’m pretty sure it’s due to the mass of the Earth causing me to accelerate towards it. I can make this claim confidently even without randomizing myself into different jumping-off-a-building scenarios. More appropriately, the term should be “only some correlations are causal.”

Being skeptical of any statistical analysis that claims to have a causal interpretation is probably a good idea, especially if the basis of it is just some graph rather than anything more careful. Establishing a causal relationship is pretty hard, in fact I wrote a whole book about it, and too often the jump from correlational data to causal claim operates under the rigorous set of statistical procedures known as “wishful thinking.”

It may be the high hit rate of this critique that adds to it being applied where it shouldn’t. There are three main cases of this. The first I already covered in Misunderstanding Control Variables (#4): if the thing you’re critiquing isn’t trying to claim an explanation why we see what we see (causal claims are almost all forms of “why” statements), then pointing out that it’s a mere correlation is… well, they know that already.

The second case is when the causal relationship is super-duper obvious. My jumping off a building example is one case of this. In some highly convenient real-world scenarios, too, there’s really only one feasible explanation of why something happened. If you’re going to claim that a causal interpretation of the data is wrong, it really helps to actually have a realistic alternative way of explaining what’s going on. If you don’t have anything, then “just a correlation” might actually be “just a causal correlation.”

The third case is when the subject of discussion is an academic study, where the study itself is making a causal claim. Now, I’m not going to pretend that there aren’t plenty of academic studies making causal claims that are pretty weak. However, I also see plenty of critiques of causal claims from academic studies be dismissed out of hand because causal claims aren’t possible, or are only possible with randomized experiments (not true; again, I wrote a whole book about it). There are ways to deal with these problems, and it’s a good idea to at least check whether the study did deal with those problems. Often, the critiquer has thought of a pretty obvious reason that the correlation would be noncausal, except that it turns out the researchers also thought of the same thing and accounted for it.

6. Giving No Slack Whatsoever

We tend to me less critical of evidence that confirms what we already know, and want to hear, than evidence that would require us to shift our opinions. I am far from the first person to notice this. This phenomenon is by no means limited only to people talking about statistics, but statistics makes it rather unfortunate.

Why? Because statistical evidence is imperfect, and statistical evidence varies.

Statistical analysis is always imperfect and incomplete and, to some degree, wrong. Doing analysis requires making assumptions about where the data came from and how it fits together with the real world. Plus, those assumptions aren’t really ever going to be true. We’re just aiming to make assumptions that are close enough to accurate that they don’t mess us up. And people will disagree about what those assumptions should be.

And statistical evidence varies. We’re going to get different results from sample to sample, and from place to place. Even if we did an analysis that was perfect, if we do it again enough times in enough different places, we’ll eventually find a result that contradicts what we started with.

Statistical analysis is about getting close enough to the truth to be useful, not actually true, and getting results that are consistent enough to point in the right direction, not always the same every time.

Because of this, it’s trivial to find some flaw in a statistical analysis. And it’s often not difficult, at least if a topic is well-studied enough, to find some piece of contradictory evidence (even easier if you allow that contradictory evidence to be itself very weak or poorly done).

Moreso than other forms of evidence, it’s a good idea to ask of statistical evidence not just whether there is a flaw, but whether that flaw is important enough to make the evidence not worthwhile. Perfect evidence is never coming along. If you’re in the habit of taking any flaw in, or counterexample of, a piece of evidence you dislike as a reason to completely throw it out, you’re going to be able to do so. But do this enough times and you might find yourself throwing out a pretty strong mass of evidence, one piece at a time. This is doubly true if you’re making excuses for similarly-trivial flaws in analysis you like.

This doesn’t mean you have to accept results uncritically. But it does mean that, more than other forms of evidence, you want to take statistical evidence as a whole, not just one piece at a time. Your one counterexample might weigh up against a single example, but if it turns out that it’s more like one mildly flawed counterexample to ten mildly flawed examples, that should mean something too.

Plus, doing these breakdowns is often not possible with the available data, and would be a lot of work if it was, so it’s not super fair to demand someone do it.↩︎

Overdebunked! Six Statistical Critiques That Don’t Quite Work