Lecture 1: Describing Data

Nick Huntington-Klein

2020-12-09

Causality

This class is about causality
Welcome!
This class exists to answer one question: how can we use statistics figure out how \(X\) causes \(Y\)?
It’s a short question but an extremely difficult one

This Class

We’ll be covering the purpose of statistical research and how it works
Then the concepts underlying causality and research design
And then some standard research designs for uncovering causality in observational data

Housekeeping

Let’s go over the syllabus and projects!
The textbook: The Effect

This Week

We’ll be discussion how we describe variables and describe relationships
With a bit of an R reminder
We’ll cover a bit of regression review, but…!
This class is much more concerned with design than any particular estimator. How is the data utilized?
Regression is one way of doing this stuff, but regression is only one implementation. So we won’t be solely focusing on it

This Week

We’ll start with ways of discussing how we can describe variables
And then move on to ways of discussing how we can describe relationships
Secretly, pretty much all statistical analysis is just about doing one of those two things
Causal analysis is purely about knowing exactly which variables and relationships to describe

Variables

A statistical variable is a recorded observation, repeated many times
“Number of calories I ate for breakfast this morning” is one observation
“The number of calories I ate each breakfast in the past week” is a variable with seven observations

The Distribution of a Variable

Variables have distributions
The distribution of a variable is simply the description of how often each value of the variable comes up
So for example, the statement “10% of people are left-handed” is just a partial description of the distribution of the handedness variable.
If you observe a bunch of people and record what their dominant hand is, 10% of the time you’ll write down “left-handed,” 1% of the time you’ll write down “ambidextrous,” and 89% of the time you’ll write down “right-handed.” That’s the full description of the distribution

Looking Straight at a Distribution

The distribution of a variable contains everything we know about that variable from empirical observation
Any description we make will be a summary of that distribution
So we may as well look at it directly!

Distributions of Kinds of Variables

There are two main kinds of variables for which the distributions look different: discrete and continuous
Discrete variables take a finite set of values: left-handed, right-handed, ambidextrous. Or “lives in Seattle” vs. “Doesn’t” or “Number of kids”
Continuous variables take any value: income, height, KwH of electricity used each day
(Sometimes, “ordinal” discrete variables with many values are treated as continuous for simplicity)

Discrete Distributions

To fully describe the distribution of a discrete variable, just give the proportion of time it takes each value. That’s it!
Give a table with the proportions (or counts), or show a graph with the proportions

library(tidyverse)
handedness_data %>%
  pull(hand) %>%
  table() %>%
  prop.table()

## .
## Ambidextrous         Left        Right 
##        0.005        0.090        0.905

Discrete Distributions

ggplot(handedness_data, aes(x = hand)) + 
  geom_bar(fill = 'white', color = 'black') + # These two lines are important
  stat_count(geom = "text", size = 5,
             aes(label = scales::percent(..count../nrow(handedness_data)), 
                 y = ..count.. + 1300)) +
  ggpubr::theme_pubr() + 
  labs(x = 'Handedness', y = 'Count') # The rest is just decoration

Discrete Distributions

Using Discrete Distributions

What can we use a discrete distribution to say?
X% of observations are in category A
(X+Y)% of observations are in category (A or B)
If it’s “ordinal” (the values have an order), we can describe the median, max, min, etc.
There are also dispersion measures describing how evenly distributed the categories are but we won’t be going into that

Continuous Distributions

Variables that are numeric in nature and take many values have a continuous distribution
Their distributions can be presented in one of two main ways - using a histogram or using a density distribution
A histogram splits the range of the data up into bins and then just treats it like a ordinal discrete distribution
A density distribution uses a rolling average of the proportion of observations within each window

Continuous Distributions

ggplot(Scorecard, aes(x = repay_rate)) + 
  geom_histogram(bins = 5, fill = 'white', color = 'black') + 
  ggpubr::theme_pubr() + 
  labs(x = 'Proportion of Grads on-track to Repay Loans', y = 'Count', title = 'Loan Repayment by College')

Continuous Distributions

We can describe these distributions fully using percentiles
The Xth percentile is the value for which X% of the sample is less than that value
Taken together, you can describe the entire sample by just going through percentiles

Continuous Distributions

Summarizing Continuous Data

Commonly we want to describe these distrbutions much more compactly, while still telling us something about them

library(vtable)
Scorecard %>%
  select(repay_rate) %>%
  sumtable()

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
repay_rate	20890	0.576	0.182	0.107	0.437	0.718	0.975

Summarizing Continuous Data

Every “summary statistic” of a given variable is just a way of describing some aspect of these distributions
Commonly we are focused on just a few important features of the distribution:
The central tendency
Dispersion

The Central Tendency

Central tendencies are ways of picking a single number that represents the variable best
Often: the mean
The median (50th percentile)
For categorical data, sometimes the mode

The Central Tendency

The median is good at being representative of a typical observation, and is not sensitive to outliers
The mean can be better thought of as a betting average. If you “bet the mean” and drew an infinite number of observations, you’d break even
If Jeff Bezos walks in the room, mean income shoots through the roof (because if you’re randomly drawing people in the room, sometimes you’re Jeff Bezos!), but the median largely remains unchanged (because Jeff Bezos isn’t anywhere near being a typical person)

The Central Tendency

So why use the mean at all? It makes sense to think about those betting odds if you are, say, trying to predict something
It also has a bunch of nice statistical properties
Meaning, we understand the mean fairly well, and we know how the mean changes as we go from sample to sample
In other words, it’s handy when we’re trying to learn about the theoretical distribution our data comes from (more on that in a bit!)

Dispersion

Measures of dispersion tell us how spread out the data is
Some of these are percentile-based measures, like the inter-quartile range (75th percentile minus 25th) or the range (Max - Min, or 100th percentile minus 0th)
Most commonly we will use standard deviation and variance

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
repay_rate	20890	0.576	0.182	0.107	0.437	0.718	0.975

Dispersion

Variance is average squared deviation from the mean.
Take each observation, subtract the mean, square the result, and take the mean of that (times \(n/(n-1)\) )
Standard deviation is the square root of the variance

Dispersion

Skew

One other aspect of a distribution we sometimes consider is skew
Skew is sort of "how much the distribution leans to one side or the other
This can be a problem if the skew is extreme
Extreme right skew can make means highly unrepresentative as a few big observations pull the mean upwards
This can sometimes be helped by taking a log of the data

Skew

Theoretical Distributions

Now for the good stuff!
We rarely actually care what our data is or the distribution of it!
What we actually care about are what broader inferences can we draw from the data we see!
The mean of your variable is just the mean of the observations you happened to sample
But what can we learn about how that variable works overall from that?

Theoretical Distributions

There is a “population distribution” that we can’t see - it’s theoretical - we just get a sample
If we had infinite data, we’d see the theoretical distribution
To learn about that theoretical distribution, we take what we know about sampling variation and use it to rule out certain theoretical distributions as unlikely

Theoretical Distributions

For example, if we flip a coin 1000 times and get heads 999 times, can we rule out that it’s a fair coin?
(the “theoretical distribution” here is a discrete one: the coin is heads 50% of the time and tails 50%)
We assume that the coin is fair (null/prior hypothesis) and see how unlikely the data is. If the coin is fair, we take what we know about sampling varaition for a binary variable and calculate that 999/1000 heads has a dbinom(999, 1000, .5) chance of happening
So that’s probably not the real theoretical distribution!

Theoretical Distributions

Reminders:

All we’ve shown is that a particular theoretical distribution is unlikely, not anything else
We don’t know what the proper theoretical distribution is
We haven’t shown that our result is important
We have effectively calculated a p-value here - if it’s low enough, we say “Statistically significant” but please don’t get fooled into thinking that means anything other than what we’ve said here - a particular theoretical distribution is statistically unlikely to have generated this data

Sampling Variation

Often when trying to generalize from a sample to a theoretical distribution we will focus on the mean
This is because the sampling variation of the mean is very well-understood. It follows a normal distribution with a mean at the population mean, and a standard deviation of the population standard deviation, scaled by \(1/\sqrt{n}\)

Sampling Variation

\[ \bar{X} = \hat{\mu} \sim N(\mu, \sigma/\sqrt{n}) \]

Latin letters ( \(X, n\) ) are data
Modified Latin letters ( \(\bar{X}\) ) are calculations made with data
Greek letters ( \(\mu\) ) are population values/“the truth”
Modified Greek letters ( \(\hat{\mu}\) ) are our estimate of the truth

Sampling Variation

With 20890 observations, the average of that repayment rate is 0.576, standard deviation is 0.182
Does the average college have a repayment rate of 50%?
If it does, then the mean of a sample of 20890 observations should follow a distribution of \(N(.5, \sigma/\sqrt{20890})\)
Estimate \(\hat{\sigma}\) using sample standard deviation \(s\) (with a correction)

Sampling Variation

So if the true distribution has a mean of 50% (whatever kind of distribution it is, as long as it has a mean - we don’t need to assume the distribution is normal, the sampling variation of the mean will be normal anyway), then \(\bar{X} \sim N(.5, .0013)\)
The probability of getting \(\bar{X} =\) 0.576 or something even higher is `1-pnorm(.575, .5, .0013) = 0. It rounds to 0 here but it’s just an extremely small number. This is a one-tailed p-value
The probability of getting \(\bar{X} =\) 0.576 or something equally far away or farther from .5 in either direction is 2*(1-pnorm(.575, .5, .0013), a two-tailed p-value

Sampling Variation

Another, possibly better way to think about it is what range of theoretical distributions we wouldn’t find unlikely
We have to first define “unlikely” for this, often with a p-value cutoff
Then, a confidence interval around the actual sample mean tells us which theoretical means would not find the existing data “too unlikely”
\(C.I. = [\bar{X} \pm z_\alpha\hat{\sigma}/\sqrt{n}]\), where \(z_\alpha\) is based on our “too unlikely” definition. For a 95% confidence interval (“too unlikely” is a two-tailed p-value below .05), it’s \(z_{.05} \approx 1.96\)

Statistical Inference

By using what we know about sampling variation, we can *make inferences about a variable’s theoretical distribution (i.e. what mean that distribution is likely ot have)
In this way we can use what we have - data - to learn about what we actually care about - population distributions
We have to leverage what we know, and what we have, to make that theoretical inference that we really care about
This will echo very strongly once we start talking about causality!

Next Time

Not just single variables, but relationships!
What are those population relationships?
That’s the real juicy stuff