- We can get
`data.frame`

s and`tibble`

s by making them with`data.frame()`

or`tibble()`

, or reading in data with`data()`

or`read.csv`

- We can manipulate variables with
`$`

- We can pick out parts of the data to analyze with
`filter`

and`select`

- Or iterate over things with a
`for`

loop

- We’re going to talk about different ways of summarizing and describing data
- We’ll put into perspective and go into more detail on some of the functions we’ve already been using
- And when certain summary measures should apply
- As well as how to explore a variable

- What are we actually doing when we do something like take a mean or a median?
- We’re trying to say something about the
*distribution*of that variable - What’s a distribution?

- A distribution says
*how often*given values occur when you randomly sample that variable over and over - So for example, the
*distribution*of a coin toss is that half the time it gives you H and half the time it gives you T - The
*distribution*of the minutes in the day is that it’s equally likely to be any minute from 0:00 to 23:59 - The
*distribution*of height looks like a bell-curve shape - The
*distribution*of income has a lot of people near the bottom and very few with huge values

- Possibly the best way to show a distribtion for a continuous variable is graphically
- Values run along the x axis, and the y axis shows you how often each value came up

- If the variable is categorical (takes several discrete values, like Heads and Tails) rather than continuous, often the best way to describe its distribution is just count the number (or fraction) of observations in each category
- The
`table()`

command is super handy for this

`table(data)`

```
## data
## Heads Tails
## 253 247
```

`prop.table(table(data))`

```
## data
## Heads Tails
## 0.506 0.494
```

- When the variable in question is
*continuous*, we can’t exactly count the number of times*each*value comes up - So we smooth it out and look at the number of times it falls within a particular value

- When we calculate something like a mean, median, etc., what we are doing is
*describing the distribution*in a condensed way - Means and medians are both ways of describing where the
*center*of the distribution is - Percentiles describe where other parts not quite in the middle are
- Standard deviations and variances describe how
*spread out*the distribution is - That’s why we call these “summary statistics” - they’re providing a brief
*summary*of what the distribution looks like

- The one we’re most familiar with is the
*mean*- we’ve even used it as an example before - The mean can be calculated by multiplying each value by the proportion of times it comes up, and adding it all together
- Or in R,
`mean(x)`

```
x <- c(1,2,2,3,4)
mean(x)
```

`## [1] 2.4`

```
1*(1/5)+2*(2/5)+3*(1/5)+4*(1/5)
1*(1/5)+2*(1/5)+2*(1/5)+3*(1/5)+4*(1/5)
```

`## [1] 2.4`

- Nice things about the mean:
- Easy to understand
- The mean of
`x-mean(x)`

is 0 - Good statistical properties
- Makes sense with large or small samples, with discrete or continuous variables
- Represents the “betting average” of the variable

- Not so nice:
- Sensitive to outliers -
`mean(c(1,2,3))`

is`2`

, but`mean(c(1,2,1001))`

is 334.6666667 - Sometimes easy to forget the rest of the distribution (the mean describes the distribution, but it doesn’t describe EVERYTHING about the distribution!)

- Sensitive to outliers -

- The median is where you line up all the observations from smallest to largest and pick the one in the middle
- If there’s an even number of observations, take the mean of the two middle

```
x <- c(3,1,4,2,2)
median(x)
```

`## [1] 2`

`sort(x)[round(length(x)/2)]`

`## [1] 2`

- Nice things about the median:
- Super easy to calculate (you can often do it by hand)
- Represents the “typical” observation
- Not sensitive to outliers -
`median(c(1,2,3))`

is`2`

, and`median(c(1,2,1001))`

is 2 - Generally not affected by transforming the data

- Not so nice:
- Insensitive to outliers means it can ignore real changes in the “tails”
- Can ignore magnitudes generally
- May be highly sensitive if there are big gaps between observations