Lecture 7: Summarizing Data Part 1

Nick Huntington-Klein

January 29, 2019

Recap

  • We can get data.frames and tibbles by making them with data.frame() or tibble(), or reading in data with data() or read.csv
  • We can manipulate variables with $
  • We can pick out parts of the data to analyze with filter and select
  • Or iterate over things with a for loop

Today

  • We’re going to talk about different ways of summarizing and describing data
  • We’ll put into perspective and go into more detail on some of the functions we’ve already been using
  • And when certain summary measures should apply
  • As well as how to explore a variable

What is a Distribution?

  • What are we actually doing when we do something like take a mean or a median?
  • We’re trying to say something about the distribution of that variable
  • What’s a distribution?

What is a Distribution?

  • A distribution says how often given values occur when you randomly sample that variable over and over
  • So for example, the distribution of a coin toss is that half the time it gives you H and half the time it gives you T
  • The distribution of the minutes in the day is that it’s equally likely to be any minute from 0:00 to 23:59
  • The distribution of height looks like a bell-curve shape
  • The distribution of income has a lot of people near the bottom and very few with huge values

What is a Distribution?

  • Possibly the best way to show a distribtion for a continuous variable is graphically
  • Values run along the x axis, and the y axis shows you how often each value came up

Distribution for categorical variables

  • If the variable is categorical (takes several discrete values, like Heads and Tails) rather than continuous, often the best way to describe its distribution is just count the number (or fraction) of observations in each category
  • The table() command is super handy for this
table(data)
## data
## Heads Tails 
##   253   247
prop.table(table(data))
## data
## Heads Tails 
## 0.506 0.494

What is a Distribution?

  • When the variable in question is continuous, we can’t exactly count the number of times each value comes up
  • So we smooth it out and look at the number of times it falls within a particular value

What is a Distribution?

  • When we calculate something like a mean, median, etc., what we are doing is describing the distribution in a condensed way
  • Means and medians are both ways of describing where the center of the distribution is
  • Percentiles describe where other parts not quite in the middle are
  • Standard deviations and variances describe how spread out the distribution is
  • That’s why we call these “summary statistics” - they’re providing a brief summary of what the distribution looks like

Different Summary Measures - the Mean

  • The one we’re most familiar with is the mean - we’ve even used it as an example before
  • The mean can be calculated by multiplying each value by the proportion of times it comes up, and adding it all together
  • Or in R, mean(x)
x <- c(1,2,2,3,4)
mean(x)
## [1] 2.4
1*(1/5)+2*(2/5)+3*(1/5)+4*(1/5)
1*(1/5)+2*(1/5)+2*(1/5)+3*(1/5)+4*(1/5)
## [1] 2.4

The Mean

  • Nice things about the mean:
    • Easy to understand
    • The mean of x-mean(x) is 0
    • Good statistical properties
    • Makes sense with large or small samples, with discrete or continuous variables
    • Represents the “betting average” of the variable
  • Not so nice:
    • Sensitive to outliers - mean(c(1,2,3)) is 2, but mean(c(1,2,1001)) is 334.6666667
    • Sometimes easy to forget the rest of the distribution (the mean describes the distribution, but it doesn’t describe EVERYTHING about the distribution!)

The Median

  • The median is where you line up all the observations from smallest to largest and pick the one in the middle
  • If there’s an even number of observations, take the mean of the two middle
x <- c(3,1,4,2,2)
median(x)
## [1] 2
sort(x)[round(length(x)/2)]
## [1] 2

The Median

  • Nice things about the median:
    • Super easy to calculate (you can often do it by hand)
    • Represents the “typical” observation
    • Not sensitive to outliers - median(c(1,2,3)) is 2, and median(c(1,2,1001)) is 2
    • Generally not affected by transforming the data
  • Not so nice:
    • Insensitive to outliers means it can ignore real changes in the “tails”
    • Can ignore magnitudes generally
    • May be highly sensitive if there are big gaps between observations

Mean and Median Together

Mean and Median Together

Mean and Median Together

Percentiles

  • A percentile is just like a median
  • Except that you don’t necessarily pick the MIDDLE
  • Line ’em up, and pick the (percentile)th person
  • Use the quantile() function, and list the percentiles you want
  • Percentiles can fully describe the distribution if you use enough!
quantile(c(0,1,2,3,4,5),c(.4,.5,1))
##  40%  50% 100% 
##  2.0  2.5  5.0
median(c(0,1,2,3,4,5))
## [1] 2.5

Percentiles

Percentiles

  • Note that exactly 10% of the observations are between each set of lines

Min and Max

  • Also useful are the minimum and maximum of the variable
  • (a.k.a. the 0% and 100% percentiles)
  • Show you the range of values that the variable CAN take
  • min() and max() work here, no surprises!

Standard deviation and variance

  • These are standard ways of understanding how much the data varies around the mean
  • Variance = Standard deviation squared
  • The higher these values, the less good a description the mean is of the variable
  • and the more noise around the mean!

Standard deviation and variance

  • Start with data and subtract out the mean
  • The result is the residuals (left-over part, unexplained part)
  • Square the residuals
  • Average them (variance) [note: then multiply by N/(N-1)]
  • Square root of the variance is the standard deviation
  • Why this process rather than some other measure around the mean (i.e. why square it)? Good statistical reasons I promise

Standard deviation and variance

data <- c(1,1,1,1,2)
data <- data - mean(data)
data
## [1] -0.2 -0.2 -0.2 -0.2  0.8
#Variance, sd
c((5/4)*mean(data^2),var(c(1,1,1,1,2)),
  sqrt((5/4)*mean(data^2)),sd(c(1,1,1,1,2)))
## [1] 0.2000000 0.2000000 0.4472136 0.4472136
data2 <- c(100,0,-30,50,80)
data2 <- data2 - mean(data2)
#Variance, sd
c((5/4)*mean(data2^2),var(c(100,0,-30,50,80)),
  sqrt((5/4)*mean(data2^2)),sd(c(100,0,-30,50,80)))
## [1] 2950.0000 2950.0000   54.3139   54.3139

Standard deviation and variance

  • Graphically, SD and variance tell you how “wide” the distribution is

Summary statistics table

  • Something we will often want to do is display a bunch of summary statistics at once for the variables we have
  • This makes it easy to understand a variable’s distribution at a glance
  • We’ll be using the stargazer command for this
## 
## ===================================================================
## Statistic N    Mean    St. Dev.  Min   Pctl(25) Pctl(75)     Max   
## -------------------------------------------------------------------
## sr        50   9.671    4.480   0.600   6.970    12.617    21.100  
## pop15     50  35.090    9.152   21.440  26.215   44.065    47.640  
## pop75     50   2.293    1.291   0.560   1.125     3.325     4.700  
## dpi       50 1,106.758 990.869  88.940 288.207  1,795.622 4,001.890
## ddpi      50   3.758    2.870   0.220   2.002     4.477    16.710  
## -------------------------------------------------------------------

Packages

  • Like tidyverse, Stargazer isn’t a part of base R. It’s in a package, so we’ll need to install it
  • We can install packages using install.packages('nameofpackage')
install.packages('stargazer')
  • We can then check whether it’s installed in the Packages tab

Stargazer

  • All we have to do once we’ve loaded stargazer is drop a data frame into it and it will give us basic summary statistics for all the variables in the data frame
  • (use select first if you don’t want all the variables)
data(LifeCycleSavings)
library(stargazer)
stargazer(LifeCycleSavings,type='text')
## 
## ===================================================================
## Statistic N    Mean    St. Dev.  Min   Pctl(25) Pctl(75)     Max   
## -------------------------------------------------------------------
## sr        50   9.671    4.480   0.600   6.970    12.617    21.100  
## pop15     50  35.090    9.152   21.440  26.215   44.065    47.640  
## pop75     50   2.293    1.291   0.560   1.125     3.325     4.700  
## dpi       50 1,106.758 990.869  88.940 288.207  1,795.622 4,001.890
## ddpi      50   3.758    2.870   0.220   2.002     4.477    16.710  
## -------------------------------------------------------------------

Stargazer

  • See help(stargazer) to see what other summary stats, like median or IQR (75th percentile - 25th) you may want to include. Many, many other options too
  • type='text' tells it to give us a basic text table.
  • Another handy one is type='html', especially if we want to output our table to a file
  • out='filename' will save our results
  • Note that, if desired, you can open up the HTML table and copy/paste it into Excel or Word
data(LifeCycleSavings)
library(stargazer)
stargazer(LifeCycleSavings,type='html',out='summarytable.html')

Note about Tibbles

  • Stargazer doesn’t do summary stats for tibbles, so if you have a tibble, just run it through as.data.frame() first
tibbleLCS <- as_tibble(LifeCycleSavings)
stargazer(tibbleLCS,type='text')
## 
## ===================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## ===================================================
stargazer(as.data.frame(tibbleLCS),type='text')
## 
## ===================================================================
## Statistic N    Mean    St. Dev.  Min   Pctl(25) Pctl(75)     Max   
## -------------------------------------------------------------------
## sr        50   9.671    4.480   0.600   6.970    12.617    21.100  
## pop15     50  35.090    9.152   21.440  26.215   44.065    47.640  
## pop75     50   2.293    1.291   0.560   1.125     3.325     4.700  
## dpi       50 1,106.758 990.869  88.940 288.207  1,795.622 4,001.890
## ddpi      50   3.758    2.870   0.220   2.002     4.477    16.710  
## -------------------------------------------------------------------

Practice

  • Install and load in stargazer
  • Use data(LifeCycleSavings) to get the Life Cycle Savings data, and use help() and str() to look at it
  • Use stargazer() to get a text table of summary statistics for all the variables EXCEPT ddpi
  • Now make an HTML table for all the variables. Open the file and look at it in a browser.
  • For each of the statistics that the stargazer() table gives you, plus the median, calculate that statistic on your own for the pop15 variable using the appropriate R function
  • Calculate the max, min, and median in two ways - using their own respective functions, and as percentiles.

Practice answers

install.packages('stargazer')
library(stargazer)
data(LifeCycleSavings)
help(LifeCycleSavings)
str(LifeCycleSavings)
stargazer(select(LifeCycleSavings,-ddpi),type='text')
stargazer(select(LifeCycleSavings,-ddpi),type='html',out='table.html')
LS <- LifeCycleSavings
c(length(LS$pop15),mean(LS$pop15),sd(LS$pop15),min(LS$pop15),
  quantile(LS$pop15,c(0,.25,.5,.75,1)),max(LS$pop15),median(LS$pop15))