When you are doing data communications, what you are generally doing
is taking data, using that data to draw some sort of
conclusion, and then demonstrating that conclusion with the
data in a convincing and accessible way
Statistics, roughly, is the process of drawing conclusions from
data
If you can’t do that right, then at best you’ll be great at
communicating bad or useless information
Data Communications
I’m not going to attempt to squash an entire statistics education in
today
I’m going to assume that your technical stats education is to some
degree limited, acknowledge that you will keep building on it and
encourage you to do as much as you can (imagine a data analyst
uninterested in getting as good at statistics as possible! Would you
trust their work?)
I’ll focus on how to get as much as right possible from a
conceptual working-with-data standpoint
Many, many, many bad errors are ones that require no
advanced mathematical skill to figure out or correct
Drawing Conclusions from Data
Things to Get Right:
Properly using and understanding the data
Properly understanding what the data tells us (and doesn’t tell us)
about the real world
Creating plausible results
Doing the right calculations
Properly accounting for uncertainty
(many statistics classes only bother with the last two!)
Properly Using and Understanding the Data
Read the data documentation. Don’t just rely on column
names! These give you an indication of what the variable is, but not all
the details of what it means, what question was asked to get it, and so
on.
This includes both documentation about the variables themselves
and information about how the data was collected - who’s in the
sample, how they were interviewed
Be critical! Was there a likely agenda behind data collection or how
things were measured? How big is the sample?
Lots of data is just bad - in any decent analytics house they’ll
start on any new data set by testing it against some base truth. They
won’t take it for granted, and you shouldn’t either
Properly Using and Understanding the Data
Measurement is important! If we don’t think the measurement is good
and means what we think it means, there’s little point in doing data
analysis anyway
Measurement can be bad because it’s noisy (the data isn’t
very precise or there are errors in it, or often missing), because it’s
misleading (the questions is fine but you’re trying to use the
data in a wrong way), because it’s weak (small sample, big data
categories) or because it’s insufficient (you’re trying to
answer a much grander question than the data can really handle)
Garbage in, garbage out
Be Careful!
Nothing worse than finding out after the fact that you were
reading your data wrong
Author Naomi
Wolf had a book about the treatment of homosexuality in 19th century
England. She reported that it was punishable by death since “sodomy”
charges were followed by “death recorded” in the judicial logs
However, “sodomy” referred to a host of other sexual offenses, and
“death recorded” specifically means they weren’t killed because
their death was recorded but not carried out
The main point of the book fell apart! Because she didn’t understand
the measurements
Be Careful!
Paul
Dolan wrote a book about how marriage makes you miserable
A key part of this was the finding that, in the American Time Use
Survey, married people surveyed about their happiness when the spouse
was present reported being happy, but when the spouse was absent they
reported being unhappy. Covering up unhappiness in front of their
spouse, truly miserable!
But while Dolan read spouse present/absent as “spouse is in the room
during questioning”, it actually meant “spouse together/separated.” He
was comparing couples that were together vs. couples that were
separating!
Didn’t read the data documentation. Oops.
What does the Data Tell Us?
Ok, so we have the data and understand what it is and whether it is
of high quality, and we perform our analysis. Good, right? Maybe!
We have to carefully consider what the data does tell us and what it
doesn’t tell us.
Identification is the econometrics concept that asks “does
the statistical calculation we just ran answer the
theoretical question we have?”
Answering the Question
One angle to think about identification in is: did you summarize the
data in the right way to answer the question?
If you want to show something increased over time, make sure to show
that it increased, not that it is high
If you want to show that something is high/low, be sure we
can see high/low relative to what
Answering the Question
If you have a question about individuals, but report
totals, your answer combines both an answer about individuals
with an answer about how many individuals there are
If you’re deciding which type of franchise to open, you’d probably
be more interested in average sales per store than
total sales, right? So show that!
If you want to know which college is more expensive, you’d want
average tuition per student instead of total tuition
for the college
(reporting sums instead of averages, when an average is called for,
is a common tendency among business students I find)
Many other versions of this average/sum problem with different
statistics!
Theory and Hypothesis
More broadly, if you want to learn anything from your data analysis,
you have to come in with a theory
“Theory” just means there’s a “why” or “because” in there - “sales
went up because there was a promotion in place”
Very rarely can we actually see a “why” or “because” in data.
Instead we get a calculation like “average sales were higher in periods
where a promotion in place than in periods where there wasn’t”
The calculation is rarely interesting (we really want to
know if those sales are effective, not whether sales happened
to be higher!). Identification asks “can we learn anything about the
theory from the calculation?”
Theory and Hypothesis
The formal process of statistical identification can be very tough
(although check out my book chapter
5!), but there are some easy first stabs you can take at it.
Ask:
If your theory is true, what pattern should you see in the
data and what pattern should you NOT see in the data?
These are hypotheses. Try to test more than one if possible.
Why might we see this pattern in the data? Is my theory the
only reason we would see this pattern? Is it the most likely
reason we’d see this pattern? If not, we aren’t identified.
If you saw a different pattern in the data, would it make
you disbelieve your theory? If not, you probably shouldn’t put
much stock in the fact that you do see that pattern
Example
Let’s take the example of a toy company who wants to know if their
ads work. They have data on how much money they spent on ads each month,
and also how many sales they had each month.
What patterns should we see and not see in the data?
Keep in mind that, given the available data (or the allowable
sophistication of methods) you may not be able to get a pure and clean
identification. But get as close as you can, and be aware of remaining
gaps
Example
An obvious answer:
Sales should be higher in months with more ads
There are other answers (which may lead to better identification) but
let’s go with this and see where it takes us
Example
Why might we see this pattern in the data (that sales are higher in
months with more ads)
Ads might cause sales to rise
Perhaps the company chooses to advertise more in months it expects
to get more sales anyway
Uh-oh. Maybe we see that “ads are effective” just because December is
a boom month for both ads and sales?
Example
If we saw a different pattern, would that put us off our
theory?
If we saw no relationship (or a negative one) between sales and ads,
we might try to wriggle out of it by saying that maybe they advertise
more in anticipated bad months to shore up sales
Hmm… but if that’s our explanation, then if we did find a
positive relationship, maybe the real explanation is just that they
advertise more in expected banner months. Which is it?
(note we might be able to test this alternate explanation in the data
by, say, predicting sales with previous years’ sales that month)
Example
So what now?
We’d want our analysis to be sure we can account for this possible
“anticipated sales” problem. Otherwise we’re leaving something out and
just showing that sales are higher in high-ad months doesn’t tell us if
ads are effective
(this isn’t a toy problem, by the way - marketing departments have a
heck of a time proving whether ads have any effect at all for this very
reason)
Maybe doing some sort of experiment where ads aren’t set by
anticipation for a while, or having some way of adjusting for
anticipation. These are the alternate calculations that we think might
identify our theory
Another Example
Something I see a lot in student submissions:
You examine some data on, say, sales
Some products/months/areas A have way above-average sales, and
products/months/areas B have way below-average sales
Some students: “to improve sales, the company should focus on
A”
An equal number of other students: “to improve sales, the company
should focus on B”
Another Example
Which is it? Focus on already-good-performers or focus on
currently-bad-performers? - Often I see the takeaway given as though
it’s obvious, but clearly half the room comes to different conclusions,
so it’s not!
Can you do an analysis, or even make an argument, as to why that
same result should imply your takeaway in this context?
Identification!
(and why does nobody ever say to focus on the middling
products??)
Or Just a Gut Check
Another way to avoid identification error is to think about what
you’re really implying with your results and seeing if it makes sense
and is supported with the data
A common error I see from business students on this is the
“predictors of success” problem
If you are trying to say something about what makes a successful
business/strategy, and you find that \(X\) is correlated with success, does that
really mean that \(X\) causes,
or even is a useful predictor of, success?
Will wearing black turtlenecks make you more likely to be as
successful as Steve Jobs?
Creating Plausible Results
More broadly, whether we’re talking about identification or not
A great indication that something is wrong about your data or
analysis, or that you’re not identified, is simply in whether your
results make any darn sense
Data is wrong sometimes, and errors can become clear only in
retrospect! So a good gut-checking instinct (or a good noticing
instinct) is really key, especially in areas like business where you
know you’ll get things a little wrong and have imperfect data,
but need to know if it’s too wrong to be useful
Creating Plausible Results
Always be asking yourself:
“If I did this right, what should I see?”
“Does this look like it’s supposed to look?”
“Do I believe this result? If not, is it implausible or
impossible?”
Creating Plausible Results
Sometimes there are genuine surprises in data, and you don’t want to
toss those out, so you’ll be walking the fine line between “this is
strange” and “this must be wrong, I made an error, go back”
This all applies to both data and analysis. Any time you do
something with data, look at the result and make sure it
doesn’t seem wrong. What should be there but isn’t? What shouldn’t be
there but is?
Creating Plausible Results
Explore the data! See what looks way wrong - implausible averages,
outliers that are so far gone they’re probably errors, “missing data”
coded as values like -99, etc.
After each data cleaning step, look at the data to make sure it
actually did what you thought. I can’t stress this enough. Please
do this!
In each analysis, ask yourself what the result is saying and if
it’s at all plausible
Creating Plausible Results
For example, if you find that your sales are only nonzero in two
weeks of the year, that’s probably wrong. Or that the ad sales are 5x
the non-ad sales, that’s probably wrong too!
If your result on when movie releases peaked found that 2021 was the
year with the most movies ever, at 65 total movies… that’s wrong!
Or if you’ve chosen to make a recommendation on the basis of your
method, but your method recommends something impossible (I recommend
that you hire for your new CEO… Thomas Edison!) or clearly wrong (I
recommend that you hire for your new CEO… a cat!) that doesn’t mean
you’ve come to a great surprising conclusion, it means your method is
bad and you should start over with a different method
Doing the Right Calculations
Now we come to actually performing statistical calculations.
Again, I won’t try to squeeze all of statistics into this one
section of one lecture
But some general guiding principles to think about:
Doing the Right Calculations
Statistical calculations aren’t universal - they are intended for
particular contexts and make certain assumptions
These assumptions can be things like “what is the distribution of
the data?” or “what kind of comparison are you trying to make?”
This means that just tossing some data into a calculation you’ve
heard of is very likely to lead to results that don’t mean what you
think they mean
For example, in one of my classes, a huge number of students tried
to use a two-sample t-test to check for a correlation between two
continuous variables. This does not work.
Tips for Wading into Statistical Waters
Don’t use tests or models you don’t understand. Recognize
that the cost of using a fancier model, or just something you’re
unfamiliar with (on top of having to figure out how to explain your
results) is going to be the the work of reading all about both the
method and the software that goes with it
If you can’t or won’t do that, don’t use the method. A simpler,
imperfect method you do understand is better than a more complex one you
don’t (as long as you understand the ways it’s imperfect)
Tips for Wading into Statistical Waters
Even if you think you understand the method, read up on the code
used to perform it. It may not work in the way you expect it to
Remember, if you don’t understand it, then you don’t know how it’s
supposed to work, and you have no way of gut-checking it to see if it’s
going wrong
Statistical Significance
One important aside here, as long as we’re talking about using
methods we don’t understand, is statistical significance.
Significance is wildly misused
What is statistical significance?
Significance takes a theoretical statement about how the data is
distributed
It then looks at the actual data and asks “what’s the chance we’d
see data like this if that theoretical statement is true?”
If the probability (p-value) is very low, we say that the
theoretical proposition is unlikely and so reject it
Statistical Significance
Notice what’s not in there:
How important the effect is
Whether the estimate we do get is right / what the actual
effect is
Whether the estimate we do get is more right than some
other estimate that also doesn’t match the theoretical
distribution
Whether there’s an X% chance of the effect being real or we are “X%
confident” of an effect
Being insignificant doesn’t mean there’s no effect, it just
means there’s not enough evidence to reject a null effect (and
don’t try to get significance)
Communicating Uncertainty
That said, statistics and data is still all about uncertainty. You
want to get that across to have an honest form of data
communication
Estimates are uncertain! Sampling variation will getcha. So will
uncertainty in what’s left out of the analysis
Non-stats people often don’t want to hear about it, but it’s
important. How can we communicate this?
Communicating Uncertainty
Ways of communicating uncertainty that people tend to be able to
understand are distribution percentiles and confidence
bands
Distribution percentiles are things of the form “X% of the time,
this variable will be below Y”
So for example “average sales are 1000 dollars per month, but 10% of
the time they’re below 500 dollars”
In actual data, that calculation can come from just the proportion
of observations below/above a certain value
In an estimated value, you can use the sampling distribution (like,
calculating a mean and standard deviation and using the normal
distribution) to calculate percentiles
Communicating Uncertainty
Confidence bands are related to statistical significance but come at
it in a different way that maybe communicates the uncertainty more
accurately (and in more detail than just an up/down
significant/insignificant)
Basically, you want to describe the distribution of the estimate
from a low percentile to a high one so as to get across how much it
might vary from sample to sample
Plus, these are easy to graph
And work with forecasts too (although technically these are
“forecast intervals”)
Communicating Uncertainty
So perhaps you find that, on average, sales with an ad in place are
$120 higher than sales without an ad in place
Using the distribution might say “on average, sales are $120 higher
with an ad in place, but it varies, with about 10% of sales only being
$15 higher than normal”
The confidence band might be “we’d expect that 95% of sales with ads
in place will be between $5 and $225 higher than normal sales”