# Lecture 12: Midterm Review

## Recap

• We’ve been covering how to work with data in R
• Building up multiple variables (numeric, character, logical, factor) into vectors
• Joining vectors together into data.frames or tibbles
• Manipulating (with dplyr), summarizing, and plotting that data
• Looking at relationships between variables

## Working with Objects

• Create a vector with `c()` or `1:4` or `sample()` or `numeric()` etc.
• Create logicals to check conditions on a vector, i.e. `a < 5 & a > 1` or `c('A','B') %in% c('A','C','D')`
• Check vector type with `is.` functions or change them with `as.`
• Use `help()` to figure out how to use new functions you don’t know yet!

## Working with Objects Practice

• Use `sample()` to generate a vector of 1000 names from Jack, Jill, and Mary.
• Use `%in%` to count how many are Jill or Mary.
• Use `help()` to figure out how to use the `substr` function to get the first letter of the name. Then, use that to count how many names are Jack or Jill.
• Change the vector to a factor.
• Create a vector of all integers from 63 to 302. Then, count how many are below 99 or above 266.

``````names <- sample(c('Jack','Jill','Mary'),1000,replace=T)
sum(names %in% c('Jill','Mary'))
firstletter <- substr(names,1,1)
sum(firstletter == "J")
names <- factor(names)

numbers <- 63:302
sum(numbers < 99 | numbers > 266)``````

## Working with Data

• Get data with `read.csv()` or `data()`, or create it with `data.frame()` or `tibble()`
• Use dplyr to manipulate it:
• `filter()` to pick a subset of observations
• `select()` to pick a subset of variables
• `rename()` to rename variables
• `mutate()` to create new variables
• `%>%` to chain together commands
• Automate things with a `for (i = 1:10) {}` loop

## Working with Data Practice

• Load the `Ecdat` library and get the `Computers` data set
• In one chain of commands,
• create a logical `bigHD` if the `hd` is above median
• remove the `ads` and `trend` variables
• limit the data to only premium computers
• Use a `for` loop to print out the median price for each level of `ram`
• Loop over a vector, sometimes useful to use `unique()`

``````library(Ecdat)
data(Computers)

Computers <- Computers %>%
mutate(bigHD = hd > median(hd)) %>%

for (i in unique(Computers\$ram)) {
print(median(filter(Computers,ram==i)\$price))
}``````

## Summarizing Single Variables

• Variables have a distribution and we are interested in describing that distribution
• `table()`, `mean()`, `sd()`, `quantile()` and functions for 0, 50, 100% percentiles `min()` `median()` `max()`
• `stargazer()` to get a bunch of summary stats at once
• Plotting: `plot(density(x))`, `hist()`, `barplot(table())`
• Adding to plots with `points()`, `lines()`, `abline()`

## Summarizing Single Variables Practice

• Create a text stargazer table of Computers
• Use `table` to look at the distribution of `ram`, then make a `barplot` of it
• Create a density plot of price, and use a single `abline(v=)` to overlay the 0, 10, 20, …, 100% percentiles on it as blue vertical lines

``````library(stargazer)
stargazer(Computers,type='text')

table(Computers\$ram)
barplot(table(Computers\$ram))

plot(density(Computers\$price),xlab='Price',main='Distribution of Computer Price')
abline(v=quantile(Computers\$price,0:10/10),col='blue')``````

## Relationships Between Variables

• Looking at the distribution of one variable at a given value of another variable
• Check for dependence with `prop.table(table(x,y),margin=)`
• Correlation: are they large/small together? `cor()`
• `group_by(x) %>% summarize(mean(y))` to get mean of y within values of x
• `cut(x,breaks=10)` to put x into “bins” to explain y with
• Mean of y within values of x gives part of y explained by x
• Proportion of variance explained `1-var(residuals)/var(y)`
• `plot(x,y)` or overlaid density plots

## Relationships Practice

• Use `prop.table` with both margins to see if cd and multi look dependent
• Use `cut` to make 10 bins of `hd`
• Get average price by bin of `hd`, and residuals
• Calculate proportion of variance in price explained by `hd`, and calculate correlation
• Plot `price` (y-axis) against `hd` (x-axis)

``````prop.table(table(Computers\$cd,Computers\$multi),margin=1)
prop.table(table(Computers\$cd,Computers\$multi),margin=2)

Computers <- Computers %>%
mutate(hdbins = cut(hd,breaks=10)) %>%
group_by(hdbins) %>%
mutate(priceav = mean(price)) %>%
mutate(res = price - priceav)

#variance explained
1 - var(Computers\$res)/var(Computers\$price)

plot(Computers\$hd,Computers\$price,xlab="Size of Hard Drive",ylab='Price')``````

## Simulation

• There are true models that we can’t see, but which generate data for us
• We want to use methods that can work backwards to uncover true models
• We can randomly generate data using a true model we decide, and see if our method uncovers it
• `rnorm()`, `runif()`, `sample()`
• Create a blank vector, then a `for` loop to make data and analyze. Store result in the vector
• Analyze the vector to see what the results look like

## Simulation Practice

• Create a for loop that creates 500 obs of `At.War` (logical, equal to 1 10% of the time)
• And `Net.Exports` (uniform, min -1, max 1, then subtract 3*At.War)
• And `GDP.Growth` (normal, mean 0, sd 3, then add + Net.Exports + At.War)
• Explains GDP.Growth with At.War and takes the residual
• Calculates `cor()` between GDP.Growth and Net.Exports, and between GDP.Growth and residual
• Stores the correlations in two separate vectors, and then compares their distributions after 1000 loops

``````library(stargazer)

GDPcor <- c()
rescor <- c()

for (i in 1:1000) {
df <- tibble(At.War = sample(0:1,500,replace=T,prob=c(.9,.1))) %>%
mutate(Net.Exports = runif(500,-1,1)-3*At.War) %>%
mutate(GDP.Growth = rnorm(500,0,3)+Net.Exports+At.War) %>%
group_by(At.War) %>%
mutate(residual = GDP.Growth - mean(GDP.Growth))

GDPcor[i] <- cor(df\$GDP.Growth,df\$Net.Exports)
rescor[i] <- cor(df\$residual,df\$Net.Exports)
}

stargazer(data.frame(rescor,GDPcor),type='text')``````

## Midterm

Reminders:

• Midterm will allow the use of the R help files but not the rest of the internet
• You will also have access to lecture slides. I do not recommend relying on them as this would take you a lot of time
• Anything we’ve covered is fair game
• There will be one question that requires you to learn about, and use, a function we haven’t used yet
• The answer key to the midterm contains 35 lines of code, many of which are dplyr