Lecture 7: Exploratory Data Analysis

Nick Huntington-Klein

28 February, 2023

The Purpose of Analysis

  • What are we even doing with data?
  • We want to see what sorts of stuff is in the data, so that we know what things we should be reporting on
  • This requires us to explore the data
  • Even if we come in with a strong idea of what we want to find, learning about what the data looks like is a good idea

The Purpose of Analysis

We want to:

  • Understand our data (read the docs, too!!!)
  • Detect mistakes
  • Get a sense of what our variables look like
  • Figure out what some of the relationships are

The Purpose of Analysis

Always be on the lookout (in EDA and in your code!)

  • Things that are different that shouldn’t be different (“why does this one person have a height 8x that of anyone else?”)
  • Things that are the same that shouldn’t be the same (“Why is the mean income for Americans and Canadians exactly preciely the same to the 8th decimal place?”)
  • Relationships that are very surprising (“Why is age negatively correlated with height?”)

Exploring Data

  • Univariate non-graphical
  • Univariate graphical
  • Multivariate non-graphical
  • Multivariate graphical

Univariate Exploratory Analysis

Looking for:

  • What kind of values we have
  • Are there massive outliers? Are there values that look incorrect?
  • What is the distribution? Is there skew? If it’s a factor, does one category dominate?
  • You’ll want to do this with nearly every variable you have
  • Getting all the little presentation details is less important, but the graph should still be readable (labels etc.) On that note, EDA isn’t always concise - some tables/graphs will slip off the edge of these slides.

Univariate Exploratory Analysis

  • Think about what features of the variable it makes sense to explore
  • Central tendencies (mean, median, etc.?)
  • Tails? Skew? Percentiles?
  • If it’s a factor, is showing all the categories informative? Do you need to collapse first? Or summarize?

Univariate Exploratory Analysis Tools

  • summary() shows lots of good info about a variable’s distribution (also works to summarize other objects)
summary(iris$Sepal.Length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900
summary(iris$Species)
##     setosa versicolor  virginica 
##         50         50         50

Univariate Exploratory Analysis Tools

  • str() shows variable classes (important!) and some values
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Univariate Exploratory Analysis Tools

  • vtable() in vtable is a more readable and flexible version of that. lush = TRUE for more info
library(vtable)
vtable(iris, lush = TRUE)
iris
Name Class Values Missing Summary
Sepal.Length numeric Num: 4.3 to 7.9 0 mean: 5.843, sd: 0.828, nuniq: 35
Sepal.Width numeric Num: 2 to 4.4 0 mean: 3.057, sd: 0.436, nuniq: 23
Petal.Length numeric Num: 1 to 6.9 0 mean: 3.758, sd: 1.765, nuniq: 43
Petal.Width numeric Num: 0.1 to 2.5 0 mean: 1.199, sd: 0.762, nuniq: 22
Species factor ‘setosa’ ‘versicolor’ ‘virginica’ 0 nuniq: 3

Univariate Exploratory Analysis Tools

  • sumtable() in vtable is a full table of summary statistics
sumtable(iris)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
Sepal.Length 150 5.843 0.828 4.3 5.1 6.4 7.9
Sepal.Width 150 3.057 0.436 2 2.8 3.3 4.4
Petal.Length 150 3.758 1.765 1 1.6 5.1 6.9
Petal.Width 150 1.199 0.762 0.1 0.3 1.8 2.5
Species 150
… setosa 50 33.3%
… versicolor 50 33.3%
… virginica 50 33.3%

Univariate Exploratory Analysis Tools

  • Moving on to graphical tools! In ggplot, geom_density() and geom_histogram() can both show full distributions (when might this not be useful?)
ggplot(iris, aes(x = Sepal.Length)) + geom_density()

Univariate Exploratory Analysis Tools

  • Boxplots aren’t for general-audience consumption but they show some easy summary statistics
ggplot(iris, aes(x = Sepal.Length)) + geom_boxplot() + coord_flip()

Univariate Exploratory Analysis

  • Load the gov_transfers data from the causaldata package
  • Read the docs! help(gov_transfers)
  • Explore the variables that are in there on a univariate basis

Multivariate Exploratory Analysis

  • We want to know how variables are related in multivariate analysis
  • Again, fit the analysis to the context. Don’t treat a continuous variable as a factor, and if you treat a discrete variable as continuous know that a scatterplot won’t work! etc.
  • Don’t just fire and forget. Think about whether the output is actually informative.
  • more than 2 variables are possible too! Just add more groups/aesthetics.

Multivariate Exploratory Analysis Tools

  • For continuous vs. continuous, nothing wrong with starting with a linear correlation!
  • Bivariate OLS is just a rescaled correlation
cor(iris$Sepal.Length, iris$Sepal.Width)
## [1] -0.1175698
lm(Sepal.Length ~ Sepal.Width, data = iris)
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
## 
## Coefficients:
## (Intercept)  Sepal.Width  
##      6.5262      -0.2234

Multivariate Exploratory Analysis Tools

  • If one variable is discrete or a factor, don’t underestimate plain ol’ group_by() %>% summarize()
  • In fact, many headaches of trying to get R to do something for you can be avoided by just doing group_by() %>% summarize() yourself
  • Don’t forget na.rm = TRUE as appropriate
iris %>%
  group_by(Species) %>%
  summarise(mean.SL = mean(Sepal.Length, na.rm = TRUE),
            sd.SL = sd(Sepal.Length, na.rm = TRUE))
## # A tibble: 3 × 3
##   Species    mean.SL sd.SL
##   <fct>        <dbl> <dbl>
## 1 setosa        5.01 0.352
## 2 versicolor    5.94 0.516
## 3 virginica     6.59 0.636

Multivariate Exploratory Analysis Tools

  • sumtable has a group option for the same purpose (note summ can customize the summary functions)
iris %>% sumtable(group = 'Species')
Summary Statistics
Species
setosa
versicolor
virginica
Variable N Mean SD N Mean SD N Mean SD
Sepal.Length 50 5.006 0.352 50 5.936 0.516 50 6.588 0.636
Sepal.Width 50 3.428 0.379 50 2.77 0.314 50 2.974 0.322
Petal.Length 50 1.462 0.174 50 4.26 0.47 50 5.552 0.552
Petal.Width 50 0.246 0.105 50 1.326 0.198 50 2.026 0.275

Multivariate Exploratory Analysis Tools

  • tabyl in janitor is great for discrete vs. discrete; use adorn functions to get percents instead of counts. Watch the denominator! It affects interpretation a LOT
library(janitor)
mtcars %>% tabyl(am, vs) %>%
  adorn_percentages(denominator = 'row') %>%
  adorn_pct_formatting() %>% adorn_title()
##        vs      
##  am     0     1
##   0 63.2% 36.8%
##   1 46.2% 53.8%

Multivariate Exploratory Analysis

  • Moving on to graphs!
  • If one variable is a factor/discrete you can do pretty much any kind of graph you like, but grouped
mtcars %>% group_by(vs, am) %>%
  summarize(mean_mpg = mean(mpg)) %>%
  ggplot(aes(x = vs, fill = factor(am), y = mean_mpg)) + geom_col(position = 'dodge')

Multivariate Exploratory Analysis

  • When it comes to densities, typical geoms look cluttered. I recommend geom_density_ridges in ggridges, or geom_violin (or use facets)
library(ggridges); library(patchwork)
p1 <- ggplot(iris, aes(x = Sepal.Length, y = Species, fill = Species)) + geom_density_ridges()
p2 <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_violin()
p1 + p2

Multivariate Exploratory Analysis

  • For two continuous variables, things can be tricky, but a scatterplot is a good place to start
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()

Automation

  • As a starting place, ggpairs in GGally will automatically pick univariate and multivariate comparisons to start
GGally::ggpairs(iris)

Automation

Multivariate Exploratory Analysis

  • Let’s do some of this in our gov_transfers data!

Going into Detail

  • This has all been running with little input from us at this point
  • Yes we have to figure out which kind makes sense but we’re just exploring
  • Going for a bit more detail requires us to have a question to answer, and then we can look into that
  • This targets us towards the variables/relationships to study, and how to explore them further
  • Perhaps, for example, checking trivariate relationships, one example being “in this subgroup how are X and Y related?”

Careful!

Be aware of what EDA does not do:

  • Checking every relationship will tend to uncover non-real relationships by random chance
  • Consider doing EDA on a random subset data for this reason, so you can check if something interesting actually is there in the other/full data and don’t trick yourself
  • Hardly ever establishes a causal relationship - avoid causal language unless you can really back it up!

Let’s Do More (if there’s time)