We want to:
Always be on the lookout (in EDA and in your code!)
Looking for:
summary() shows lots of good info about a variable’s
distribution (also works to summarize other objects)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
## setosa versicolor virginica
## 50 50 50
str() shows variable classes (important!) and some
values## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
vtable() in vtable is a more readable
and flexible version of that. lush = TRUE for more
info| Name | Class | Values | Missing | Summary |
|---|---|---|---|---|
| Sepal.Length | numeric | Num: 4.3 to 7.9 | 0 | mean: 5.843, sd: 0.828, nuniq: 35 |
| Sepal.Width | numeric | Num: 2 to 4.4 | 0 | mean: 3.057, sd: 0.436, nuniq: 23 |
| Petal.Length | numeric | Num: 1 to 6.9 | 0 | mean: 3.758, sd: 1.765, nuniq: 43 |
| Petal.Width | numeric | Num: 0.1 to 2.5 | 0 | mean: 1.199, sd: 0.762, nuniq: 22 |
| Species | factor | ‘setosa’ ‘versicolor’ ‘virginica’ | 0 | nuniq: 3 |
sumtable() in vtable is a full table
of summary statistics| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| Sepal.Length | 150 | 5.843 | 0.828 | 4.3 | 5.1 | 6.4 | 7.9 |
| Sepal.Width | 150 | 3.057 | 0.436 | 2 | 2.8 | 3.3 | 4.4 |
| Petal.Length | 150 | 3.758 | 1.765 | 1 | 1.6 | 5.1 | 6.9 |
| Petal.Width | 150 | 1.199 | 0.762 | 0.1 | 0.3 | 1.8 | 2.5 |
| Species | 150 | ||||||
| … setosa | 50 | 33.3% | |||||
| … versicolor | 50 | 33.3% | |||||
| … virginica | 50 | 33.3% |
geom_density() and geom_histogram() can both
show full distributions (when might this not be useful?)gov_transfers data from the
causaldata packagehelp(gov_transfers)## [1] -0.1175698
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
##
## Coefficients:
## (Intercept) Sepal.Width
## 6.5262 -0.2234
group_by() %>% summarize()group_by() %>% summarize()
yourselfna.rm = TRUE as appropriateiris %>%
group_by(Species) %>%
summarise(mean.SL = mean(Sepal.Length, na.rm = TRUE),
sd.SL = sd(Sepal.Length, na.rm = TRUE))## # A tibble: 3 × 3
## Species mean.SL sd.SL
## <fct> <dbl> <dbl>
## 1 setosa 5.01 0.352
## 2 versicolor 5.94 0.516
## 3 virginica 6.59 0.636
sumtable has a group option for the same
purpose (note summ can customize the summary
functions)|
Species
|
setosa
|
versicolor
|
virginica
|
||||||
|---|---|---|---|---|---|---|---|---|---|
| Variable | N | Mean | SD | N | Mean | SD | N | Mean | SD |
| Sepal.Length | 50 | 5.006 | 0.352 | 50 | 5.936 | 0.516 | 50 | 6.588 | 0.636 |
| Sepal.Width | 50 | 3.428 | 0.379 | 50 | 2.77 | 0.314 | 50 | 2.974 | 0.322 |
| Petal.Length | 50 | 1.462 | 0.174 | 50 | 4.26 | 0.47 | 50 | 5.552 | 0.552 |
| Petal.Width | 50 | 0.246 | 0.105 | 50 | 1.326 | 0.198 | 50 | 2.026 | 0.275 |
tabyl in janitor is great for discrete
vs. discrete; use adorn functions to get percents instead
of counts. Watch the denominator! It affects interpretation a LOTlibrary(janitor)
mtcars %>% tabyl(am, vs) %>%
adorn_percentages(denominator = 'row') %>%
adorn_pct_formatting() %>% adorn_title()## vs
## am 0 1
## 0 63.2% 36.8%
## 1 46.2% 53.8%
mtcars %>% group_by(vs, am) %>%
summarize(mean_mpg = mean(mpg)) %>%
ggplot(aes(x = vs, fill = factor(am), y = mean_mpg)) + geom_col(position = 'dodge')geom_density_ridges in ggridges,
or geom_violin (or use facets)library(ggridges); library(patchwork)
p1 <- ggplot(iris, aes(x = Sepal.Length, y = Species, fill = Species)) + geom_density_ridges()
p2 <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_violin()
p1 + p2ggpairs in GGally
will automatically pick univariate and multivariate comparisons to
startgov_transfers data!Be aware of what EDA does not do: