Lecture 13: Statistics and Uncertainty

Nick Huntington-Klein

13 March, 2023

Data Communications

When you are doing data communications, what you are generally doing is taking data, using that data to draw some sort of conclusion, and then demonstrating that conclusion with the data in a convincing and accessible way
Statistics, roughly, is the process of drawing conclusions from data
If you can’t do that right, then at best you’ll be great at communicating bad or useless information

Data Communications

I’m not going to attempt to squash an entire statistics education in today
I’m going to assume that your technical stats education is to some degree limited, acknowledge that you will keep building on it and encourage you to do as much as you can (imagine a data analyst uninterested in getting as good at statistics as possible! Would you trust their work?)
I’ll focus on how to get as much as right possible from a conceptual working-with-data standpoint
Many, many, many bad errors are ones that require no advanced mathematical skill to figure out or correct

Drawing Conclusions from Data

Things to Get Right:

Properly using and understanding the data
Properly understanding what the data tells us (and doesn’t tell us) about the real world
Creating plausible results
Doing the right calculations
Properly accounting for uncertainty

(many statistics classes only bother with the last two!)

Properly Using and Understanding the Data

Read the data documentation. Don’t just rely on column names! These give you an indication of what the variable is, but not all the details of what it means, what question was asked to get it, and so on.
This includes both documentation about the variables themselves and information about how the data was collected - who’s in the sample, how they were interviewed
Be critical! Was there a likely agenda behind data collection or how things were measured? How big is the sample?
Lots of data is just bad - in any decent analytics house they’ll start on any new data set by testing it against some base truth. They won’t take it for granted, and you shouldn’t either

Properly Using and Understanding the Data

Measurement is important! If we don’t think the measurement is good and means what we think it means, there’s little point in doing data analysis anyway
Measurement can be bad because it’s noisy (the data isn’t very precise or there are errors in it, or often missing), because it’s misleading (the questions is fine but you’re trying to use the data in a wrong way), because it’s weak (small sample, big data categories) or because it’s insufficient (you’re trying to answer a much grander question than the data can really handle)
Garbage in, garbage out

Be Careful!

Nothing worse than finding out after the fact that you were reading your data wrong
Author Naomi Wolf had a book about the treatment of homosexuality in 19th century England. She reported that it was punishable by death since “sodomy” charges were followed by “death recorded” in the judicial logs
However, “sodomy” referred to a host of other sexual offenses, and “death recorded” specifically means they weren’t killed because their death was recorded but not carried out
The main point of the book fell apart! Because she didn’t understand the measurements

Be Careful!

Paul Dolan wrote a book about how marriage makes you miserable
A key part of this was the finding that, in the American Time Use Survey, married people surveyed about their happiness when the spouse was present reported being happy, but when the spouse was absent they reported being unhappy. Covering up unhappiness in front of their spouse, truly miserable!
But while Dolan read spouse present/absent as “spouse is in the room during questioning”, it actually meant “spouse together/separated.” He was comparing couples that were together vs. couples that were separating!
Didn’t read the data documentation. Oops.

What does the Data Tell Us?

Ok, so we have the data and understand what it is and whether it is of high quality, and we perform our analysis. Good, right? Maybe!
We have to carefully consider what the data does tell us and what it doesn’t tell us.
Identification is the econometrics concept that asks “does the statistical calculation we just ran answer the theoretical question we have?”

Answering the Question

One angle to think about identification in is: did you summarize the data in the right way to answer the question?
If you want to show something increased over time, make sure to show that it increased, not that it is high
If you want to show that something is high/low, be sure we can see high/low relative to what

Answering the Question

If you have a question about individuals, but report totals, your answer combines both an answer about individuals with an answer about how many individuals there are
If you’re deciding which type of franchise to open, you’d probably be more interested in average sales per store than total sales, right? So show that!
If you want to know which college is more expensive, you’d want average tuition per student instead of total tuition for the college
(reporting sums instead of averages, when an average is called for, is a common tendency among business students I find)
Many other versions of this average/sum problem with different statistics!

Theory and Hypothesis

More broadly, if you want to learn anything from your data analysis, you have to come in with a theory
“Theory” just means there’s a “why” or “because” in there - “sales went up because there was a promotion in place”
Very rarely can we actually see a “why” or “because” in data. Instead we get a calculation like “average sales were higher in periods where a promotion in place than in periods where there wasn’t”
The calculation is rarely interesting (we really want to know if those sales are effective, not whether sales happened to be higher!). Identification asks “can we learn anything about the theory from the calculation?”

Theory and Hypothesis

The formal process of statistical identification can be very tough (although check out my book chapter 5!), but there are some easy first stabs you can take at it. Ask:
If your theory is true, what pattern should you see in the data and what pattern should you NOT see in the data? These are hypotheses. Try to test more than one if possible.
Why might we see this pattern in the data? Is my theory the only reason we would see this pattern? Is it the most likely reason we’d see this pattern? If not, we aren’t identified.
If you saw a different pattern in the data, would it make you disbelieve your theory? If not, you probably shouldn’t put much stock in the fact that you do see that pattern

Example

Let’s take the example of a toy company who wants to know if their ads work. They have data on how much money they spent on ads each month, and also how many sales they had each month.
What patterns should we see and not see in the data?
Keep in mind that, given the available data (or the allowable sophistication of methods) you may not be able to get a pure and clean identification. But get as close as you can, and be aware of remaining gaps

Example

An obvious answer:

Sales should be higher in months with more ads

There are other answers (which may lead to better identification) but let’s go with this and see where it takes us

Example

Why might we see this pattern in the data (that sales are higher in months with more ads)

Ads might cause sales to rise
Perhaps the company chooses to advertise more in months it expects to get more sales anyway

Uh-oh. Maybe we see that “ads are effective” just because December is a boom month for both ads and sales?

Example

If we saw a different pattern, would that put us off our theory?

If we saw no relationship (or a negative one) between sales and ads, we might try to wriggle out of it by saying that maybe they advertise more in anticipated bad months to shore up sales
Hmm… but if that’s our explanation, then if we did find a positive relationship, maybe the real explanation is just that they advertise more in expected banner months. Which is it?

(note we might be able to test this alternate explanation in the data by, say, predicting sales with previous years’ sales that month)

Example

So what now?

We’d want our analysis to be sure we can account for this possible “anticipated sales” problem. Otherwise we’re leaving something out and just showing that sales are higher in high-ad months doesn’t tell us if ads are effective
(this isn’t a toy problem, by the way - marketing departments have a heck of a time proving whether ads have any effect at all for this very reason)
Maybe doing some sort of experiment where ads aren’t set by anticipation for a while, or having some way of adjusting for anticipation. These are the alternate calculations that we think might identify our theory

Another Example

Something I see a lot in student submissions:

You examine some data on, say, sales
Some products/months/areas A have way above-average sales, and products/months/areas B have way below-average sales
Some students: “to improve sales, the company should focus on A”
An equal number of other students: “to improve sales, the company should focus on B”

Another Example

Which is it? Focus on already-good-performers or focus on currently-bad-performers? - Often I see the takeaway given as though it’s obvious, but clearly half the room comes to different conclusions, so it’s not!
Can you do an analysis, or even make an argument, as to why that same result should imply your takeaway in this context? Identification!

(and why does nobody ever say to focus on the middling products??)

Or Just a Gut Check

Another way to avoid identification error is to think about what you’re really implying with your results and seeing if it makes sense and is supported with the data
A common error I see from business students on this is the “predictors of success” problem
If you are trying to say something about what makes a successful business/strategy, and you find that $X$ is correlated with success, does that really mean that $X$ causes, or even is a useful predictor of, success?
Will wearing black turtlenecks make you more likely to be as successful as Steve Jobs?

Creating Plausible Results

More broadly, whether we’re talking about identification or not
A great indication that something is wrong about your data or analysis, or that you’re not identified, is simply in whether your results make any darn sense
Data is wrong sometimes, and errors can become clear only in retrospect! So a good gut-checking instinct (or a good noticing instinct) is really key, especially in areas like business where you know you’ll get things a little wrong and have imperfect data, but need to know if it’s too wrong to be useful

Creating Plausible Results

Always be asking yourself:

“If I did this right, what should I see?”
“Does this look like it’s supposed to look?”
“Do I believe this result? If not, is it implausible or impossible?”

Creating Plausible Results

Sometimes there are genuine surprises in data, and you don’t want to toss those out, so you’ll be walking the fine line between “this is strange” and “this must be wrong, I made an error, go back”
This all applies to both data and analysis. Any time you do something with data, look at the result and make sure it doesn’t seem wrong. What should be there but isn’t? What shouldn’t be there but is?

Creating Plausible Results

Explore the data! See what looks way wrong - implausible averages, outliers that are so far gone they’re probably errors, “missing data” coded as values like -99, etc.
After each data cleaning step, look at the data to make sure it actually did what you thought. I can’t stress this enough. Please do this!
In each analysis, ask yourself what the result is saying and if it’s at all plausible

Creating Plausible Results

For example, if you find that your sales are only nonzero in two weeks of the year, that’s probably wrong. Or that the ad sales are 5x the non-ad sales, that’s probably wrong too!
If your result on when movie releases peaked found that 2021 was the year with the most movies ever, at 65 total movies… that’s wrong!
Or if you’ve chosen to make a recommendation on the basis of your method, but your method recommends something impossible (I recommend that you hire for your new CEO… Thomas Edison!) or clearly wrong (I recommend that you hire for your new CEO… a cat!) that doesn’t mean you’ve come to a great surprising conclusion, it means your method is bad and you should start over with a different method

Doing the Right Calculations

Now we come to actually performing statistical calculations.
Again, I won’t try to squeeze all of statistics into this one section of one lecture
But some general guiding principles to think about:

Doing the Right Calculations

Statistical calculations aren’t universal - they are intended for particular contexts and make certain assumptions
These assumptions can be things like “what is the distribution of the data?” or “what kind of comparison are you trying to make?”
This means that just tossing some data into a calculation you’ve heard of is very likely to lead to results that don’t mean what you think they mean
For example, in one of my classes, a huge number of students tried to use a two-sample t-test to check for a correlation between two continuous variables. This does not work.

Tips for Wading into Statistical Waters

Don’t use tests or models you don’t understand. Recognize that the cost of using a fancier model, or just something you’re unfamiliar with (on top of having to figure out how to explain your results) is going to be the the work of reading all about both the method and the software that goes with it
If you can’t or won’t do that, don’t use the method. A simpler, imperfect method you do understand is better than a more complex one you don’t (as long as you understand the ways it’s imperfect)

Tips for Wading into Statistical Waters

Even if you think you understand the method, read up on the code used to perform it. It may not work in the way you expect it to
Remember, if you don’t understand it, then you don’t know how it’s supposed to work, and you have no way of gut-checking it to see if it’s going wrong

Statistical Significance

One important aside here, as long as we’re talking about using methods we don’t understand, is statistical significance. Significance is wildly misused
What is statistical significance?
Significance takes a theoretical statement about how the data is distributed
It then looks at the actual data and asks “what’s the chance we’d see data like this if that theoretical statement is true?”
If the probability (p-value) is very low, we say that the theoretical proposition is unlikely and so reject it

Statistical Significance

Notice what’s not in there:

How important the effect is
Whether the estimate we do get is right / what the actual effect is
Whether the estimate we do get is more right than some other estimate that also doesn’t match the theoretical distribution
Whether there’s an X% chance of the effect being real or we are “X% confident” of an effect
Being insignificant doesn’t mean there’s no effect, it just means there’s not enough evidence to reject a null effect (and don’t try to get significance)

Communicating Uncertainty

That said, statistics and data is still all about uncertainty. You want to get that across to have an honest form of data communication
Estimates are uncertain! Sampling variation will getcha. So will uncertainty in what’s left out of the analysis
Non-stats people often don’t want to hear about it, but it’s important. How can we communicate this?

Communicating Uncertainty

Ways of communicating uncertainty that people tend to be able to understand are distribution percentiles and confidence bands
Distribution percentiles are things of the form “X% of the time, this variable will be below Y”
So for example “average sales are 1000 dollars per month, but 10% of the time they’re below 500 dollars”
In actual data, that calculation can come from just the proportion of observations below/above a certain value
In an estimated value, you can use the sampling distribution (like, calculating a mean and standard deviation and using the normal distribution) to calculate percentiles

Communicating Uncertainty

Confidence bands are related to statistical significance but come at it in a different way that maybe communicates the uncertainty more accurately (and in more detail than just an up/down significant/insignificant)
Basically, you want to describe the distribution of the estimate from a low percentile to a high one so as to get across how much it might vary from sample to sample
Plus, these are easy to graph
And work with forecasts too (although technically these are “forecast intervals”)

Communicating Uncertainty

So perhaps you find that, on average, sales with an ad in place are $120 higher than sales without an ad in place
Using the distribution might say “on average, sales are $120 higher with an ad in place, but it varies, with about 10% of sales only being $15 higher than normal”
The confidence band might be “we’d expect that 95% of sales with ads in place will be between $5 and $225 higher than normal sales”

Forecast Bands - easy to interpret right?