Lecture 2: Understanding Data

Nick Huntington-Klein

December 1, 2018

What’s the Point?

What are we actually trying to DO when we use data?

Contrary to popular opinion, the point isn’t to make pretty graphs or to make a point, or justify something you’ve done.

Those may be nice side effects!

Uncovering Truths

The cleanest way to think about data analysis is to remember that data comes from somewhere

There was some process that generated that data

Our goal, in all data analysis, is to get some idea of what that process is

Example

Imagine a basic coin flip
Every time we flip, we get heads half the time and tails half the time
The TRUE process that generates the data is that there’s a coin that’s heads half the time and tails half the time
If we analyze the data correctly, we should report back that the coin is heads half the time
Let’s try calculating the proportion of heads

Example

#Generate 500 heads and tails
data <- sample(c("Heads","Tails"),500,replace=TRUE)
#Calculate the proportion of heads
mean(data=="Heads")

## [1] 0.454

Example

Let’s try out that code in R a few times and see what happens
First, what do we want to happen? What should we see if our data analysis method is good?

How Good Was It?

Our data analysis consistently told us that the coin was generating heads about half the time - the true process!
Our data analysis lets us conclude the coin is fair
That is describing the true data generating process pretty well!
Let’s think - what other approaches could we have taken? What would the pros and cons be?
- Counting the heads instead of taking the proportion?
- Taking the mean and adding .1?
- Just saying it’s 50%?

Another Example

People have different amounts of money in their wallet, from 0 to 10
We flip a coin and, if it’s heads, give them a dollar
What’s the data generating process here?
What should our data analysis uncover?

Another Example

#Generate 1000 wallets and 1000 heads and tails
data <- data.frame(wallets=sample(0:10,1000,replace=TRUE))
data$coin <- sample(c("Heads","Tails"),1000,replace=TRUE)
#Give a dollar whenever it's a heads, then get average money by coin
data <- data %>% mutate(wallets = wallets + (coin=="Heads"))
data %>% group_by(coin) %>% summarize(wallets = mean(wallets))

## # A tibble: 2 x 2
##   coin  wallets
##   <chr>   <dbl>
## 1 Heads    5.93
## 2 Tails    4.84

Conclusions

What does our data analysis tell us?
We observe a difference of 1.09 between the heads and tails
Because we know that nothing would have caused this difference other than the coin flip, we can conclude that the coin flip is why there’s a difference of 1.09

But What If?

So far we’ve been cheating
We know exactly what process generated that data
So really, our data analysis doesn’t matter
But what if we don’t know that?
We want to make sure our method is good, so that when we draw a conclusion from our data, it’s the right one

Example 3

We’re economists! So no big surprise we might be interested in demand curves
Demand curve says that as \(P\) goes up, \(Q_d\) goes down
Let’s gather data on \(P\) and \(Q_d\)
And determine things like the slope of demand, demand elasticity, etc.
Just so happens I have some data on 1879-1923 US food export prices and quantities!

Example 3

We can calculate the correlation between \(P\) and \(Q\):

cor(foodPQ$FoodPrice,foodPQ$FoodQuantity)

## [1] 0.2970008

Example 3 Conclusions

We observe a POSITIVE correlation of 0.3
But demand curves shouldn’t slope upwards… huh?
Does demand really slope up? Does this tell us about the process that generated this data? Why or why not?
Why do we see the data we do? Let’s try to think of some reasons.

Getting Difficult

We need to be more careful to figure out what’s actually going on
Plus, the more we know about the context and the underlying model, the more likely it is that we won’t miss something important

Getting Difficult

Our fake examples were easy because we knew perfectly where the data came from
But how much do you know about food prices in turn-of-the-century US?
Or for that matter how prices are set in the food industry?
There’s more work to do in uncovering the process that made the data

For the Record

Likely, one of the big problems with our food analysis was that we forgot to account for DEMAND shifting around, showing us what SUPPLY looks like. Supply does slope up!
Just because we loudly announced that we wanted the demand curve doesn’t force the data to give it to us!
Let’s imagine the process that might have generated this data by starting with the model itself and seeing how we can generate this data

But…

If we can figure out what methods work well when we do know the right answer
And apply them when we don’t…
We can figure out what those processes are
And that’s the goal!
This is what we’ll be figuring out how to do technically during the programming part of the class, and conceptually during causal inference

R

We’ll need good tools to do it
We’ll be using R
Let’s go through how R can be installed, in preparation for next week
R-Project.org
RStudio.com
RStudio.cloud

Finish us out

Now that we’re all prepared let’s get a jump on homework
Go to New York Times’ The Upshot or FiveThirtyEight and find an article that uses data
See Homework 1
Together let’s start thinking about answers to these questions