- Up to now, we’ve been looking at how to manipulate objects in R
- And how to analyze data in R
- Looking at the distributions of those variables
- As well as how two variables relate to each other - dependence, correlation, explanation

- What we’ll be moving into soon is how to use these tools so that we can take data and actually
*learn things* - Today we’re going to get ready for that
- How?
- We’re going to
*learn some fake things* - Via simulation, i.e. creating fake data

- Why do this? Why not just use real data?
- Because with real data,
*we don’t know what the right answer is* - So if we do some method, and it gives us an answer, how do we know we did the method right?
- Simulation lets us
*know the right answer*so we can check whether or not we*get*that right answer - And if we do, we know the method works (at least in our fake scenario), so now we can apply it to some real data

- When it comes down to it,
*what is the purpose of data analysis*? Why do we work with data? - When we work with data, we have this idea that there exists a
*true model* - The
*true model*is the way the world actually works! - But we don’t know what that true model is

- So that’s where the data comes in
- The true model is what
*generated the data*(the ‘data generating process’) - So by looking at the data, and incorporating what we
*think*we know about the true model, we’re trying to work backwards to figure out what it is that*generated*that data - and that’s the true model! - With simulation,
*we know*what generated the data and what the true model is. So we know the right answer and can check how close we get.

- We’ve covered this example before
- Let’s generate 501 coin flips
- The
*true model*will be that this will generate heads half the time and tails half the time

`coins <- sample(c("Heads","Tails"),501,replace=T)`

- Now let’s take that data as given and analyze it in our standard way!
- The proportion of heads is
`mean(coins=='Heads') =`

0.493014 - And we can look at the distribution, as we would:

`barplot(prop.table(table(coins)))`

- So what’s our conclusion?
- We’d come to the conclusion that the
*true model*generates heads 0.493014 of the time - .500 is correct, so pretty close! But not exact. Did this whole thing work or not?
- What if it
*always*errs on the same side? Then it’s not a good method at all!

- We can go a step further by doing this simulation
*over and over again*in a loop! - This will let us tell whether our method gets it right on average
- And, when it’s wrong, how wrong it is!

```
#A blank vector to hold our results
propHeads <- c()
#Let's run this simulation 2000 times
for (i in 1:2000) {
#Re-create data using the true model
coinsdraw <- sample(c("Heads","Tails"),501,replace=T)
#Re-perform our analysis
result <- mean(coinsdraw=="Heads")
#And store the result
propHeads[i] <- result
}
#Let's see what we get on average
stargazer(as.data.frame(propHeads),type='text')
```

```
##
## ============================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## ------------------------------------------------------------
## propHeads 2,000 0.500 0.022 0.427 0.485 0.515 0.589
## ------------------------------------------------------------
```

```
#And let's look at the distribution of our findings
plot(density(propHeads),xlab="Proportion Heads",main="Mean of 501 Coin Flips over 2000 Samples")
abline(v=mean(propHeads),col='red')
```

- Now that’s pretty exact!
- What are we learning here?
- The method that we used (simply taking the proportion of heads) will, on average, give us the right answer (.5). Good! Now, if we’re interested in a similar case in the real world, we can apply this method.
- Keeping in mind that in any given sample that we actually observe, it might be a little off.

- Imagine we
*didn’t*know the answer was .5. What would have happened? - Wonder “what proportion of the time will a coin be heads?” (what is the true model?)
- Collect data on coin flips
- Perform our analysis method - take proportion of heads, and get 0.493014
- Come to the conclusion that the
*true model*is that the coin produces heads 0.493014 of the time. - We wouldn’t be dead on, but on average we’d be right!

- We can use
`sample()`

as before to pick random data from different categories like Heads and Tails, or integers (say,`1:10`

) - R also has a whole buncha functions to generate random data from other kinds of distributions. See
`help(Distributions)`

. - We will focus on two:
- The
*uniform*distribution assigns equal probablity to each value in its range (default 0 to 1) - The
*normal*distribution is a bell curve - observations near the mean are very common, observations far away from the mean very rare

`runif(thismanyobs,min,max)`

will draw`thismanyobs`

observations from the range of`min`

to`max`

.`runif(thismanyobs)`

will assume`min=0`

and`max=1`

```
uniformdata <- runif(5)
uniformdata
```

`## [1] 0.1267818 0.8207126 0.2619525 0.4291249 0.1961915`

```
uniformdata <- runif(2000)
hist(uniformdata,xlab="Random Value",main="Distribution of Random Data from Uniform Distribution",probability=TRUE)
```

- Good for variables that should be bounded - “percent male” can only be 0-1
- Gives even probability of getting each value

`rnorm(thismanyobs,mean,sd)`

will draw`thismanyobs`

observations from a normal distribution with mean`mean`

and standard deviation`sd`

`rnorm(thismanyobs)`

will assume`mean=0`

and`sd=1`

```
normaldata <- rnorm(5)
normaldata
```

`## [1] -2.2990547 0.1119890 -0.3365734 2.2590853 -0.7877652`

```
normaldata <- rnorm(2000)
hist(normaldata,xlab="Random Value",main="Distribution of Random Data from Normal Distribution",probability=TRUE)
```