We want to:
Always be on the lookout (in EDA and in your code!)
Looking for:
summary()
shows lots of good info about a variable’s
distribution (also works to summarize other objects)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
## setosa versicolor virginica
## 50 50 50
str()
shows variable classes (important!) and some
values## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
vtable()
in vtable is a more readable
and flexible version of that. lush = TRUE
for more
infoName | Class | Values | Missing | Summary |
---|---|---|---|---|
Sepal.Length | numeric | Num: 4.3 to 7.9 | 0 | mean: 5.843, sd: 0.828, nuniq: 35 |
Sepal.Width | numeric | Num: 2 to 4.4 | 0 | mean: 3.057, sd: 0.436, nuniq: 23 |
Petal.Length | numeric | Num: 1 to 6.9 | 0 | mean: 3.758, sd: 1.765, nuniq: 43 |
Petal.Width | numeric | Num: 0.1 to 2.5 | 0 | mean: 1.199, sd: 0.762, nuniq: 22 |
Species | factor | ‘setosa’ ‘versicolor’ ‘virginica’ | 0 | nuniq: 3 |
sumtable()
in vtable is a full table
of summary statisticsVariable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
---|---|---|---|---|---|---|---|
Sepal.Length | 150 | 5.843 | 0.828 | 4.3 | 5.1 | 6.4 | 7.9 |
Sepal.Width | 150 | 3.057 | 0.436 | 2 | 2.8 | 3.3 | 4.4 |
Petal.Length | 150 | 3.758 | 1.765 | 1 | 1.6 | 5.1 | 6.9 |
Petal.Width | 150 | 1.199 | 0.762 | 0.1 | 0.3 | 1.8 | 2.5 |
Species | 150 | ||||||
… setosa | 50 | 33.3% | |||||
… versicolor | 50 | 33.3% | |||||
… virginica | 50 | 33.3% |
geom_density()
and geom_histogram()
can both
show full distributions (when might this not be useful?)gov_transfers
data from the
causaldata packagehelp(gov_transfers)
## [1] -0.1175698
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
##
## Coefficients:
## (Intercept) Sepal.Width
## 6.5262 -0.2234
group_by() %>% summarize()
group_by() %>% summarize()
yourselfna.rm = TRUE
as appropriateiris %>%
group_by(Species) %>%
summarise(mean.SL = mean(Sepal.Length, na.rm = TRUE),
sd.SL = sd(Sepal.Length, na.rm = TRUE))
## # A tibble: 3 × 3
## Species mean.SL sd.SL
## <fct> <dbl> <dbl>
## 1 setosa 5.01 0.352
## 2 versicolor 5.94 0.516
## 3 virginica 6.59 0.636
sumtable
has a group
option for the same
purpose (note summ
can customize the summary
functions)
Species
|
setosa
|
versicolor
|
virginica
|
||||||
---|---|---|---|---|---|---|---|---|---|
Variable | N | Mean | SD | N | Mean | SD | N | Mean | SD |
Sepal.Length | 50 | 5.006 | 0.352 | 50 | 5.936 | 0.516 | 50 | 6.588 | 0.636 |
Sepal.Width | 50 | 3.428 | 0.379 | 50 | 2.77 | 0.314 | 50 | 2.974 | 0.322 |
Petal.Length | 50 | 1.462 | 0.174 | 50 | 4.26 | 0.47 | 50 | 5.552 | 0.552 |
Petal.Width | 50 | 0.246 | 0.105 | 50 | 1.326 | 0.198 | 50 | 2.026 | 0.275 |
tabyl
in janitor is great for discrete
vs. discrete; use adorn
functions to get percents instead
of counts. Watch the denominator! It affects interpretation a LOTlibrary(janitor)
mtcars %>% tabyl(am, vs) %>%
adorn_percentages(denominator = 'row') %>%
adorn_pct_formatting() %>% adorn_title()
## vs
## am 0 1
## 0 63.2% 36.8%
## 1 46.2% 53.8%
geom_density_ridges
in ggridges,
or geom_violin
(or use facets)ggpairs
in GGally
will automatically pick univariate and multivariate comparisons to
startgov_transfers
data!Be aware of what EDA does not do: