Lecture 7: Summarizing Data Part 1

Nick Huntington-Klein

January 29, 2019

Recap

  • We can get data.frames and tibbles by making them with data.frame() or tibble(), or reading in data with data() or read.csv
  • We can manipulate variables with $
  • We can pick out parts of the data to analyze with filter and select
  • Or iterate over things with a for loop

Today

  • We’re going to talk about different ways of summarizing and describing data
  • We’ll put into perspective and go into more detail on some of the functions we’ve already been using
  • And when certain summary measures should apply
  • As well as how to explore a variable

What is a Distribution?

  • What are we actually doing when we do something like take a mean or a median?
  • We’re trying to say something about the distribution of that variable
  • What’s a distribution?

What is a Distribution?

  • A distribution says how often given values occur when you randomly sample that variable over and over
  • So for example, the distribution of a coin toss is that half the time it gives you H and half the time it gives you T
  • The distribution of the minutes in the day is that it’s equally likely to be any minute from 0:00 to 23:59
  • The distribution of height looks like a bell-curve shape
  • The distribution of income has a lot of people near the bottom and very few with huge values

What is a Distribution?

  • Possibly the best way to show a distribtion for a continuous variable is graphically
  • Values run along the x axis, and the y axis shows you how often each value came up

Distribution for categorical variables

  • If the variable is categorical (takes several discrete values, like Heads and Tails) rather than continuous, often the best way to describe its distribution is just count the number (or fraction) of observations in each category
  • The table() command is super handy for this
table(data)
## data
## Heads Tails 
##   253   247
prop.table(table(data))
## data
## Heads Tails 
## 0.506 0.494

What is a Distribution?

  • When the variable in question is continuous, we can’t exactly count the number of times each value comes up
  • So we smooth it out and look at the number of times it falls within a particular value

What is a Distribution?

  • When we calculate something like a mean, median, etc., what we are doing is describing the distribution in a condensed way
  • Means and medians are both ways of describing where the center of the distribution is
  • Percentiles describe where other parts not quite in the middle are
  • Standard deviations and variances describe how spread out the distribution is
  • That’s why we call these “summary statistics” - they’re providing a brief summary of what the distribution looks like

Different Summary Measures - the Mean

  • The one we’re most familiar with is the mean - we’ve even used it as an example before
  • The mean can be calculated by multiplying each value by the proportion of times it comes up, and adding it all together
  • Or in R, mean(x)
x <- c(1,2,2,3,4)
mean(x)
## [1] 2.4
1*(1/5)+2*(2/5)+3*(1/5)+4*(1/5)
1*(1/5)+2*(1/5)+2*(1/5)+3*(1/5)+4*(1/5)
## [1] 2.4

The Mean

  • Nice things about the mean:
    • Easy to understand
    • The mean of x-mean(x) is 0
    • Good statistical properties
    • Makes sense with large or small samples, with discrete or continuous variables
    • Represents the “betting average” of the variable
  • Not so nice:
    • Sensitive to outliers - mean(c(1,2,3)) is 2, but mean(c(1,2,1001)) is 334.6666667
    • Sometimes easy to forget the rest of the distribution (the mean describes the distribution, but it doesn’t describe EVERYTHING about the distribution!)

The Median

  • The median is where you line up all the observations from smallest to largest and pick the one in the middle
  • If there’s an even number of observations, take the mean of the two middle
x <- c(3,1,4,2,2)
median(x)
## [1] 2
sort(x)[round(length(x)/2)]
## [1] 2

The Median

  • Nice things about the median:
    • Super easy to calculate (you can often do it by hand)
    • Represents the “typical” observation
    • Not sensitive to outliers - median(c(1,2,3)) is 2, and median(c(1,2,1001)) is 2
    • Generally not affected by transforming the data
  • Not so nice:
    • Insensitive to outliers means it can ignore real changes in the “tails”
    • Can ignore magnitudes generally
    • May be highly sensitive if there are big gaps between observations

Mean and Median Together

Mean and Median Together