We want to:
Always be on the lookout (in EDA and in your code!)
Looking for:
summary() shows lots of good info about a variable’s
distribution (also works to summarize other objects)##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900##     setosa versicolor  virginica 
##         50         50         50str() shows variable classes (important!) and some
values## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...vtable() in vtable is a more readable
and flexible version of that. lush = TRUE for more
info| Name | Class | Values | Missing | Summary | 
|---|---|---|---|---|
| Sepal.Length | numeric | Num: 4.3 to 7.9 | 0 | mean: 5.843, sd: 0.828, nuniq: 35 | 
| Sepal.Width | numeric | Num: 2 to 4.4 | 0 | mean: 3.057, sd: 0.436, nuniq: 23 | 
| Petal.Length | numeric | Num: 1 to 6.9 | 0 | mean: 3.758, sd: 1.765, nuniq: 43 | 
| Petal.Width | numeric | Num: 0.1 to 2.5 | 0 | mean: 1.199, sd: 0.762, nuniq: 22 | 
| Species | factor | ‘setosa’ ‘versicolor’ ‘virginica’ | 0 | nuniq: 3 | 
sumtable() in vtable is a full table
of summary statistics| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max | 
|---|---|---|---|---|---|---|---|
| Sepal.Length | 150 | 5.843 | 0.828 | 4.3 | 5.1 | 6.4 | 7.9 | 
| Sepal.Width | 150 | 3.057 | 0.436 | 2 | 2.8 | 3.3 | 4.4 | 
| Petal.Length | 150 | 3.758 | 1.765 | 1 | 1.6 | 5.1 | 6.9 | 
| Petal.Width | 150 | 1.199 | 0.762 | 0.1 | 0.3 | 1.8 | 2.5 | 
| Species | 150 | ||||||
| … setosa | 50 | 33.3% | |||||
| … versicolor | 50 | 33.3% | |||||
| … virginica | 50 | 33.3% | 
geom_density() and geom_histogram() can both
show full distributions (when might this not be useful?)gov_transfers data from the
causaldata packagehelp(gov_transfers)## [1] -0.1175698## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
## 
## Coefficients:
## (Intercept)  Sepal.Width  
##      6.5262      -0.2234group_by() %>% summarize()group_by() %>% summarize()
yourselfna.rm = TRUE as appropriateiris %>%
  group_by(Species) %>%
  summarise(mean.SL = mean(Sepal.Length, na.rm = TRUE),
            sd.SL = sd(Sepal.Length, na.rm = TRUE))## # A tibble: 3 × 3
##   Species    mean.SL sd.SL
##   <fct>        <dbl> <dbl>
## 1 setosa        5.01 0.352
## 2 versicolor    5.94 0.516
## 3 virginica     6.59 0.636sumtable has a group option for the same
purpose (note summ can customize the summary
functions)| 
Species
 | 
setosa
 | 
versicolor
 | 
virginica
 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Variable | N | Mean | SD | N | Mean | SD | N | Mean | SD | 
| Sepal.Length | 50 | 5.006 | 0.352 | 50 | 5.936 | 0.516 | 50 | 6.588 | 0.636 | 
| Sepal.Width | 50 | 3.428 | 0.379 | 50 | 2.77 | 0.314 | 50 | 2.974 | 0.322 | 
| Petal.Length | 50 | 1.462 | 0.174 | 50 | 4.26 | 0.47 | 50 | 5.552 | 0.552 | 
| Petal.Width | 50 | 0.246 | 0.105 | 50 | 1.326 | 0.198 | 50 | 2.026 | 0.275 | 
tabyl in janitor is great for discrete
vs. discrete; use adorn functions to get percents instead
of counts. Watch the denominator! It affects interpretation a LOTlibrary(janitor)
mtcars %>% tabyl(am, vs) %>%
  adorn_percentages(denominator = 'row') %>%
  adorn_pct_formatting() %>% adorn_title()##        vs      
##  am     0     1
##   0 63.2% 36.8%
##   1 46.2% 53.8%mtcars %>% group_by(vs, am) %>%
  summarize(mean_mpg = mean(mpg)) %>%
  ggplot(aes(x = vs, fill = factor(am), y = mean_mpg)) + geom_col(position = 'dodge')geom_density_ridges in ggridges,
or geom_violin (or use facets)library(ggridges); library(patchwork)
p1 <- ggplot(iris, aes(x = Sepal.Length, y = Species, fill = Species)) + geom_density_ridges()
p2 <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_violin()
p1 + p2ggpairs in GGally
will automatically pick univariate and multivariate comparisons to
startgov_transfers data!Be aware of what EDA does not do: