Lecture 7: Exploratory Data Analysis

Nick Huntington-Klein

28 February, 2023

The Purpose of Analysis

What are we even doing with data?
We want to see what sorts of stuff is in the data, so that we know what things we should be reporting on
This requires us to explore the data
Even if we come in with a strong idea of what we want to find, learning about what the data looks like is a good idea

The Purpose of Analysis

We want to:

Understand our data (read the docs, too!!!)
Detect mistakes
Get a sense of what our variables look like
Figure out what some of the relationships are

The Purpose of Analysis

Always be on the lookout (in EDA and in your code!)

Things that are different that shouldn’t be different (“why does this one person have a height 8x that of anyone else?”)
Things that are the same that shouldn’t be the same (“Why is the mean income for Americans and Canadians exactly preciely the same to the 8th decimal place?”)
Relationships that are very surprising (“Why is age negatively correlated with height?”)

Exploring Data

Univariate non-graphical
Univariate graphical
Multivariate non-graphical
Multivariate graphical

Univariate Exploratory Analysis

Looking for:

What kind of values we have
Are there massive outliers? Are there values that look incorrect?
What is the distribution? Is there skew? If it’s a factor, does one category dominate?
You’ll want to do this with nearly every variable you have
Getting all the little presentation details is less important, but the graph should still be readable (labels etc.) On that note, EDA isn’t always concise - some tables/graphs will slip off the edge of these slides.

Univariate Exploratory Analysis

Think about what features of the variable it makes sense to explore
Central tendencies (mean, median, etc.?)
Tails? Skew? Percentiles?
If it’s a factor, is showing all the categories informative? Do you need to collapse first? Or summarize?

Univariate Exploratory Analysis Tools

summary() shows lots of good info about a variable’s distribution (also works to summarize other objects)

summary(iris$Sepal.Length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900

summary(iris$Species)

##     setosa versicolor  virginica 
##         50         50         50

Univariate Exploratory Analysis Tools

str() shows variable classes (important!) and some values

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Univariate Exploratory Analysis Tools

vtable() in vtable is a more readable and flexible version of that. lush = TRUE for more info

library(vtable)
vtable(iris, lush = TRUE)

iris
Name	Class	Values	Summary
Sepal.Length	numeric	Num: 4.3 to 7.9	mean: 5.843, sd: 0.828, nuniq: 35
Sepal.Width	numeric	Num: 2 to 4.4	mean: 3.057, sd: 0.436, nuniq: 23
Petal.Length	numeric	Num: 1 to 6.9	mean: 3.758, sd: 1.765, nuniq: 43
Petal.Width	numeric	Num: 0.1 to 2.5	mean: 1.199, sd: 0.762, nuniq: 22
Species	factor	‘setosa’ ‘versicolor’ ‘virginica’	nuniq: 3

Univariate Exploratory Analysis Tools

sumtable() in vtable is a full table of summary statistics

sumtable(iris)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
Sepal.Length	150	5.843	0.828	4.3	5.1	6.4	7.9
Sepal.Width	150	3.057	0.436	2	2.8	3.3	4.4
Petal.Length	150	3.758	1.765	1	1.6	5.1	6.9
Petal.Width	150	1.199	0.762	0.1	0.3	1.8	2.5
Species	150
… setosa	50	33.3%
… versicolor	50	33.3%
… virginica	50	33.3%

Univariate Exploratory Analysis Tools

Moving on to graphical tools! In ggplot, geom_density() and geom_histogram() can both show full distributions (when might this not be useful?)

ggplot(iris, aes(x = Sepal.Length)) + geom_density()

Univariate Exploratory Analysis Tools

Boxplots aren’t for general-audience consumption but they show some easy summary statistics

ggplot(iris, aes(x = Sepal.Length)) + geom_boxplot() + coord_flip()

Univariate Exploratory Analysis

Load the gov_transfers data from the causaldata package
Read the docs! help(gov_transfers)
Explore the variables that are in there on a univariate basis

Multivariate Exploratory Analysis

We want to know how variables are related in multivariate analysis
Again, fit the analysis to the context. Don’t treat a continuous variable as a factor, and if you treat a discrete variable as continuous know that a scatterplot won’t work! etc.
Don’t just fire and forget. Think about whether the output is actually informative.
more than 2 variables are possible too! Just add more groups/aesthetics.

Multivariate Exploratory Analysis Tools

For continuous vs. continuous, nothing wrong with starting with a linear correlation!
Bivariate OLS is just a rescaled correlation

cor(iris$Sepal.Length, iris$Sepal.Width)

## [1] -0.1175698

lm(Sepal.Length ~ Sepal.Width, data = iris)

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
## 
## Coefficients:
## (Intercept)  Sepal.Width  
##      6.5262      -0.2234

Multivariate Exploratory Analysis Tools

If one variable is discrete or a factor, don’t underestimate plain ol’ group_by() %>% summarize()
In fact, many headaches of trying to get R to do something for you can be avoided by just doing group_by() %>% summarize() yourself
Don’t forget na.rm = TRUE as appropriate

iris %>%
  group_by(Species) %>%
  summarise(mean.SL = mean(Sepal.Length, na.rm = TRUE),
            sd.SL = sd(Sepal.Length, na.rm = TRUE))

## # A tibble: 3 × 3
##   Species    mean.SL sd.SL
##   <fct>        <dbl> <dbl>
## 1 setosa        5.01 0.352
## 2 versicolor    5.94 0.516
## 3 virginica     6.59 0.636

Multivariate Exploratory Analysis Tools

sumtable has a group option for the same purpose (note summ can customize the summary functions)

iris %>% sumtable(group = 'Species')

Summary Statistics
Species	setosa			versicolor			virginica
Variable	N	Mean	SD	N	Mean	SD	N	Mean	SD
Sepal.Length	50	5.006	0.352	50	5.936	0.516	50	6.588	0.636
Sepal.Width	50	3.428	0.379	50	2.77	0.314	50	2.974	0.322
Petal.Length	50	1.462	0.174	50	4.26	0.47	50	5.552	0.552
Petal.Width	50	0.246	0.105	50	1.326	0.198	50	2.026	0.275

Multivariate Exploratory Analysis Tools

tabyl in janitor is great for discrete vs. discrete; use adorn functions to get percents instead of counts. Watch the denominator! It affects interpretation a LOT

library(janitor)
mtcars %>% tabyl(am, vs) %>%
  adorn_percentages(denominator = 'row') %>%
  adorn_pct_formatting() %>% adorn_title()

##        vs      
##  am     0     1
##   0 63.2% 36.8%
##   1 46.2% 53.8%

Multivariate Exploratory Analysis

Moving on to graphs!
If one variable is a factor/discrete you can do pretty much any kind of graph you like, but grouped

mtcars %>% group_by(vs, am) %>%
  summarize(mean_mpg = mean(mpg)) %>%
  ggplot(aes(x = vs, fill = factor(am), y = mean_mpg)) + geom_col(position = 'dodge')

Multivariate Exploratory Analysis

When it comes to densities, typical geoms look cluttered. I recommend geom_density_ridges in ggridges, or geom_violin (or use facets)

library(ggridges); library(patchwork)
p1 <- ggplot(iris, aes(x = Sepal.Length, y = Species, fill = Species)) + geom_density_ridges()
p2 <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_violin()
p1 + p2

Multivariate Exploratory Analysis

For two continuous variables, things can be tricky, but a scatterplot is a good place to start

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()

Automation

As a starting place, ggpairs in GGally will automatically pick univariate and multivariate comparisons to start

GGally::ggpairs(iris)

Automation

Multivariate Exploratory Analysis

Let’s do some of this in our gov_transfers data!

Going into Detail

This has all been running with little input from us at this point
Yes we have to figure out which kind makes sense but we’re just exploring
Going for a bit more detail requires us to have a question to answer, and then we can look into that
This targets us towards the variables/relationships to study, and how to explore them further
Perhaps, for example, checking trivariate relationships, one example being “in this subgroup how are X and Y related?”

Careful!

Be aware of what EDA does not do:

Checking every relationship will tend to uncover non-real relationships by random chance
Consider doing EDA on a random subset data for this reason, so you can check if something interesting actually is there in the other/full data and don’t trick yourself
Hardly ever establishes a causal relationship - avoid causal language unless you can really back it up!

Let’s Do More (if there’s time)

https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Clothing.csv
Data description: https://rdrr.io/rforge/Stat2Data/man/Clothing.html
Explore! What do we find?