Statistical Power

Statistics is an area where the lessons of children’s television are more or less true: if you try hard enough, anything is possible.

It’s also an area where the lessons of violent video games are more or less true: if you want to solve a really tough problem, you need to bring a whole lot of firepower (plus, a faster computer can really help matters).

Once we have our study design down, there are a number of things that can turn statistical analysis into a fairly weak tool and make us less likely to find the truth:

1. Really tiny effects are really hard to find (good luck seeing an electron without a super-powered microscope)
2. Most statistical analyses are about looking at variation. If there’s little variation in the data, we won’t have much to go on
3. If there’s a lot of stuff going on other than the effect we’re looking for, it will be hard to pull the signal from the noise
4. If we have really high standards for what counts as evidence, then a lot of good evidence is going to be ignored so we need more evidence to make up for it

Conveniently, all of these problems can be solved by increasing our firepower, by which I mean sample size. Power analysis is our way of figuring out exactly how much firepower we need to bring. If it’s more than we’re willing to provide, we might want to turn around and go back home.

Power analysis can be a great idea no matter what kind of study you’re running. However, it’s especially helpful in two cases:

1. If you’re looking for an effect that you think is probably not that central or important to what’s going on, i.e. it’s a small effect, or a part of a system where a lot of other stuff is going on (roughly you can think of this as “a small $$R^2$$”) a power analysis can be a great idea - the sample size required to learn something useful about a small effect is often much bigger than you expect, and it’s good to learn that now rather than after you’ve already done all the work
2. If you’re running a randomized experiment, you actually have some control over your sample size - you can choose how much data to collect, and how to randomize the treatment. A power analysis is vital before running an experiment so you don’t get to the end and realize “oh dang, I probably should have done it with a few hundred more people… too late now!”

In this document we’ll talk about power analysis in general and how it’s done, and then we’ll go into how to perform a power analysis using simulation in R, making use of tools from the tidyverse.

What Power Analysis Does

Using $$X$$ as shorthand for the treatment and $$Y$$ as shorthand for the outcome, assuming we’re doing a power analysis for the a study of the relationship between $$X$$ and $$Y$$, power analysis balances five things:

1. The size of the effect (coefficient in a regression, a correlation, a Cohen’s $$d$$, etc., etc.)
2. The amount of variation in the treatment (the variance of $$X$$, say)
3. The amount of other variation in $$Y$$ (the $$R^2$$, or the variation from the residual after explaining $$Y$$ with $$X$$, or just the variation in $$Y$$)
4. Statistical precision (the standard error of the estimate, statistical power, i.e. the true-positive rate)
5. The sample size

A power analysis holds four of these constant and tells you what the fifth can be. So, for example, it might say “if we think the effect is probably A, and there’s B variation in $$X$$, and there’s C variation in $$Y$$ unrelated to $$X$$, and you want to have at least a D% chance of finding an effect if it’s really there, then you’ll need a sample size of at least E.” This tells us the minimum sample size necessary to get sufficient statistical power.

Or we can go in other directions. “If you’re going to have a sample size of A, and there’s B variation in $$X$$, and there’s C variation in $$Y$$ unrelated to $$X$$, and you want to have at least a D% chance of finding an effect if it’s really there, then the effect must be at least as large as E.” This tells us the minimum detectable effect, i.e. the smallest effect you can hope to have a chance of reasonably measuring given your sample size.

How about that “statistical precision” option? Usually, you have a target level of statistical power (thus the name “power analysis”). Statistical power is the true-positive rate. That is, if there’s truly an effect there, and sampling variation means that you have an 80% chance of rejecting the null of no-effect in a given sample, then you have 80% statistical power. Statistical power is dependent on the kind of test you’re running, too - if you are doing a hypothesis test at the 95% confidence level, you’re more likely to reject the null (and thus will have higher statistical power) than if you’re doing a hypothesis test at the 99% confidence level. The more careful you are about false positives, the more likely you are to get a false negative. So there’s a tradeoff there.

Power analyses don’t have to be run with statistical power in mind. In fact, you don’t necessarily need to think about things in terms of “the ability to reject the null”, which is what statistical power is all about. You could run your power analysis with any sort of statistical precision as a goal, like standard errors. Given A, B, and C, what sample size D do you need to make your standard errors E or smaller?

(as an aside, simulation is also great for checking stuff other than power - how about try using it to check whether your estimate is biased on average? Barely any different from the instructions on this page, just store the difference between the estimate and the truth and see what that looks like over a bunch of interations!)

Where Do Those Numbers Come From?

In order to do power analysis, you need to be able to fill in the values for four of those five pieces, so that power analysis can tell you the fifth one. How do we know those things?

We have to make the best guesses we can. We can use previous research to try to get a sense of how big the effect is likely to be, or how much variation there’s likely to be in $$X$$. If there’s no prior research, do what you can - think about what is likely to be true, make educated guesses. Power analysis at a certain level requires some guesswork.

Other pieces aren’t about guesses but about standards. How much statistical power do you want? The higher your statistical power, the less chance there is of missing a true effect. But that means your sample size needs to go up a lot. Often people use a rule of thumb here. In the past, a goal of 80% statistical power has been standard. These days I see 90% a lot more often.

In practical terms, power analysis isn’t a homework assignment, it’s guidance. It doesn’t need to be exact. A little guesswork (although ideally as little as possible) is necessary. After all, even getting the minimum sample size necessary doesn’t guarantee your analysis is good, it just gives you a pretty good chance of finding a result if it’s there. Often, people take the results of their power analysis as a baseline and then make sure to overshoot the mark, under the assumption they’ve been too conservative. So don’t worry about being accurate, just try to make the numbers in the analysis close enough to be useful.

Doing Power Analysis with a Calculator

If the analysis you’re planning to do is a standard one, you can use one of many, many available “power analysis calculators” to do the work for you. Two that I happen to like are PowerandSampleSize (which I use when I want to just get a quick rough answer) or G*Power (which I use when I’m working on something serious). There are also plenty of functions for doing power analysis calculations in R directly, like power.t.test() or the powerMediation package (or just Google “power analysis R”).

This can be great - as long as you can get your assumptions in terms the calculator understands (if you’re an economist, translating your regression coefficients into something G*Power understands can be a bit of a headache), it will tell you the results.

On the downside, there are a lot of different statistical analyses, and a power calculator needs to be calibrated differently for every different kind of analysis. So it’s likely that, unless you’re doing something pretty standard (like a basic two-group randomized trial), the calculator might not cover your particular case.

Thankfully, we have another completely flexible tool available to us, that also requires less translation: simulation!

Doing Power Analysis with Simulation

Maybe it’s just because I’m really familiar with running simulations in general, but I find simulation to be a much easier way to do power analysis than working with a calculator a lot of the time.

What do I mean by doing power analysis with simulation? Well, the idea of a power analysis is that (a) for data with certain properties (variance of $$X$$, size of the effect, etc.), (b) and a certain analytic method (regression, etc.), (c) we are making some claim about the sampling variation of the estimator (statistical power, size of standard errors, etc.).

What better way to make sure our data has the properties we think it does, perform the exact analysis we plan to perform, and see the sampling variation than to make up our own data with those properties, perform the analysis, and do it a bunch of times so we can see the sampling variation? This is how simulated power analysis works.

Step 1: Make Up Data With The Properties We Want

There are a lot of functions you can use to generate random data in R (see help(Distributions)). I generally make use of only a few: rnorm() to generate normally-distributed data, runif() to generate uniformly-distributed data, and sample() to generate categorical or binary data.

rnorm() takes three arguments: the sample size, the mean, and the standard deviation. rnorm(100, 0, 3) produces 100 random normal draws from a normal distribution with a mean of 0 and a standard deviation of 3.

runif() takes three arguments: sample size, the start, and the end of the distribution. runif(200, 1, 3) produces 200 random uniform draws from a uniform distribution that goes from 1 to 3.

sample() takes a bunch of arguments, but the important ones are the set to sample from, the sample size, replace, and prob. sample(c('Treated','Untreated'), 500, replace = TRUE, prob = c(.2,.8)) produces 500 random observations that are either 'Treated' or 'Untreated'. Since prob = c(.2,.8), there’s a 20% chance of each draw being the first one ('Treated') and an 80% chance it’s the second ('Untreated'). You pretty much always want replace = TRUE since that says that you can get the same thing more than once (i.e. more than one 'Treated' observation).

You can then design whatever sort of data you like - the data generating process is in your hands! You can use variables generated using random data to create other variables as well - now you have a causal link!

For example, let’s construct a uniformly-distributed treatment variable from 0 to 1 $$X$$ (perhaps $$X$$ is “the share of students in this classroom that are left-handed), and let’s say that a one-unit change in $$X$$ causes a $$.2$$ change in $$Y$$ (perhaps”the number of left-handed scissors in the room”).

library(tidyverse)

# Make a tibble (or data.frame) to contain our data
tib <- tibble(
# Start by creating the variables that don't depend on anything else
# the 1000 is how many observations we're going to have. THe 0 and 1 are the start and end of the uniform distribution
X = runif(1000, 0, 1)
) %>%
# Then mutate in any variables that depend on earlier variables
# The *true effect* of X on Y is .2
mutate(Y = .2*X + rnorm(1000, mean = 0, sd = 3))

Just from this construction we have set: (a) the true effect of $$X$$ on $$Y$$, (b) the amount of variation in $$X$$ (by the characteristics of the uniform distribution from 0 to 1, 1/12), (c) the amount of variation in $$Y$$ not coming from $$X$$ (it’s a normal distribution with a standard deviation of 3), and (d) the sample size (1000).

Step 2: Perform the Analysis We’re Going to Perform

We can take this data and perform whatever analysis we like on it. Let’s say we plan to regress $$Y$$ on $$X$$ using robust standard errors, so let’s use those.

library(fixest)
model <- feols(Y ~ X, data = tib, se = 'hetero')

Step 3: Pull out the Results we Want

Now that we have our analysis performed, we’re going to want to pull the results out so we can store them elsewhere. How can we pull out the results we want?

First, we want to think about what it is we want to pull out. If we’re interested in whether or not the result is significant, we should pull out the p-value. Maybe also the coefficient itself so we can see its distribution. Coefficients can usually be found using the coef() function - try it in the Console to see which position your coefficient of interest is at, and then use [] to get it out. Here we would want coef(model)[2].

But where is the p-value in our model object? You can try looking for what you need by typing model$ and seeing what RStudio autocompletes for you. Or try s <- summary(model) and then s$ to see what it autocompletes.

You could also use tidy() in the broom package, which turns your regression into a data frame you can easily pull things from.

library(broom)

tidy(model)
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  -0.0963     0.190    -0.507   0.612
## 2 X             0.349      0.333     1.05    0.295
# Here we go!
tidy(model)$p.value[2] ## [1] 0.294948 # And if we're just interested in significance, say at the 95% level... sig <- tidy(model)$p.value[2] <= .05
sig
## [1] FALSE

Step 4: Repeat!

Of course, this is only one generated data set. That doesn’t tell us much of anything about the sampling variation! So we need to do this all a bunch of times, maybe a few thousand, and see what we get.

While an R purist would probably opt for doing this with one of the apply() functions, or map() in the purrr package, for simplicity we’re just going to use the good ol’ for() loop.

for (iterator in range) { code } will run the code code a bunch of times, each time setting the iterator variable to a different value in range, until it’s tried all of them, like this:

for (i in 1:5) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

We’re going to take all of the steps we’ve done so far and put them inside those curly braces {}. Then we’re going to repeat the whole process a bunch of times!

for (i in 1:2000) {
# Have to re-create the data EVERY TIME or it will just be the same data over and over
tib <- tibble(
X = runif(1000, 0, 1)
) %>%
mutate(Y = .2*X + rnorm(1000, mean = 0, sd = 3))

# Run the analysis
model <- feols(Y ~ X, data = tib, se = 'hetero')

# Get the results
coef_on_X <- coef(model)[2]
print(coef_on_X)
sig <- tidy(model)$p.value[2] <= .05 print(sig) } If we run this code, it will generate our data 2000 times (i in 1:2000) and run the analysis on each of those data sets. Then it will get the coefficient on X and whether it’s significant at the 95% level and print those to screen. Of course, this will just leave us with 2000 results printed to screen. Not that useful! Step 5: Store the Results So instead of printing to screen, we’re going to save all of the results so we can look at them afterwards. We’ll start by creating some blank vectors to store our data in, like results <- c(). Then, as we do the analysis over and over, instead of printing the results to screen, we can store them in results[i], which will put it in the ith position of the vector, adding on new elements as i grows higher and higher with our for() loop. coef_results <- c() sig_results <- c() for (i in 1:2000) { # Have to re-create the data EVERY TIME or it will just be the same data over and over tib <- tibble( X = runif(1000, 0, 1) ) %>% mutate(Y = .2*X + rnorm(1000, mean = 0, sd = 3)) # Run the analysis model <- feols(Y ~ X, data = tib, se = 'hetero') # Get the results coef_results[i] <- coef(model)[2] sig_results[i] <- tidy(model)$p.value[2] <= .05
}

Step 6: Examine the Results

Our estimate of statistical power is the proportion of the results that are significant:

mean(sig_results)
## [1] 0.09

So we have statistical power of 9.00%. We might also want to look at the distribution of the coefficient itself. The standard deviation of the coefficient across all the simulated runs gives you a good idea of what the standard error of the coefficient will be (sd(coef_results), which gives us $_{} =$ 0.33).

We should probably also check the distribution directly with geom_density() in ggplot2 to make sure it looks appropriate and there aren’t weird outliers implying a super sensitive analysis. Right now, our coefficient is stored as a vector, not a data.frame or tibble, so we can’t use it with ggplot as we normally would. We either need to put it in the aes() argument, unlike our normal approach to ggplot2:

ggplot(mapping = aes(coef_results)) +
geom_density()

or just put it in a tibble or data.frame first:

results_tibble <- tibble(coef = coef_results,
sig = sig_results)
ggplot(results_tibble, aes(x = coef)) +
geom_density() +
# Prettify!
theme_minimal() +
labs(x = 'Coefficient', y = 'Density')

ggplot(results_tibble, aes(x = sig)) +
geom_bar() +
# Prettify!
theme_minimal() +
labs(x = 'Coefficient', y = 'Count') +
scale_x_discrete(labels = c('Insignificant','Significant'))

Step 7: Fiddle!

The goal of power analysis isn’t usually to just take one set of data-generating characteristics and generate a single power estimate, it’s to do things like calculate the minimum detectable effect or smallest sample size for a given power level.

How can we do that here? By trying different values of effect size and sample size and seeing what we get.

To do this, we’re first going to take everything we’ve done so far and put it inside a function that we can call.

my_power_function <- function(x) {
coef_results <- c()
sig_results <- c()

for (i in 1:2000) {
# Have to re-create the data EVERY TIME or it will just be the same data over and over
tib <- tibble(
X = runif(1000, 0, 1)
) %>%
mutate(Y = .2*X + rnorm(1000, mean = 0, sd = 3))

# Run the analysis
model <- feols(Y ~ X, data = tib, se = 'hetero')

# Get the results
coef_results[i] <- coef(model)[2]
sig_results[i] <- tidy(model)$p.value[2] <= .05 } } Now we can ask what parts of the simulation we want to try fiddling with. Perhaps the effect size or the sample size (although we could do this with any part of it). So let’s make those arguments of the function that we can change whenever we call it. That will happen in the function() call (and notice how throughout we replace references to the effect size or the sample size with the argument names): my_power_function <- function(effect, sample_size) { coef_results <- c() sig_results <- c() for (i in 1:500) { # Have to re-create the data EVERY TIME or it will just be the same data over and over tib <- tibble( X = runif(sample_size, 0, 1) ) %>% mutate(Y = effect*X + rnorm(sample_size, mean = 0, sd = 3)) # Run the analysis model <- feols(Y ~ X, data = tib, se = 'hetero') # Get the results coef_results[i] <- coef(model)[2] sig_results[i] <- tidy(model)$p.value[2] <= .05
}
}

(also notice I dropped down the number of iterations from 2000 to 500 to speed things up, since we’ll be doing this multiple times)

Next we can ask what result we really want to get back. I want to know what the power is for different combinations of effect and sample size, so I just want to get back the proportion that are significant. So I’ll return() that.

my_power_function <- function(effect, sample_size) {
sig_results <- c()

for (i in 1:500) {
# Have to re-create the data EVERY TIME or it will just be the same data over and over
tib <- tibble(
X = runif(sample_size, 0, 1)
) %>%
mutate(Y = effect*X + rnorm(sample_size, mean = 0, sd = 3))

# Run the analysis
model <- feols(Y ~ X, data = tib, se = 'hetero')

# Get the results
sig_results[i] <- tidy(model)$p.value[2] <= .05 } sig_results %>% mean() %>% return() } Now we can just call the function, setting effect and sample_size to whatever we want, and get the power back! Let’s check it with the values we had before and make sure we’re in the same range: my_power_function(.2, 1000) ## [1] 0.1 Seems good! Now let’s say we really are stuck with a sample size of 1000 and we want to know the minimum detectable effect size we can get a power of .8 with. To do this, we can just run our function with sample_size = 1000 and a bunch of different effect values until we get back a power above .8! power_levels <- c() effects_to_try <- c(.4, .8, 1.2, 1.6, 2) for (i in 1:5) { power_levels[i] <- my_power_function(effects_to_try[i], 1000) } # Where do we cross 80%? power_results <- tibble(effect = effects_to_try, power = power_levels) power_results ## # A tibble: 5 x 2 ## effect power ## <dbl> <dbl> ## 1 0.4 0.216 ## 2 0.8 0.688 ## 3 1.2 0.94 ## 4 1.6 0.994 ## 5 2 1 ggplot(power_results, aes(x = effect, y = power)) + geom_line(color = 'red', size = 1.5) + # add a horizontal line at 90% geom_hline(aes(yintercept = .8), linetype = 'dashed') + # Prettify! theme_minimal() + scale_y_continuous(labels = scales::percent) + labs(x = 'Linear Effect Size', y = 'Power') So it looks like we need an effect somewhere between .8 and 1.2 to have an 80% chance of finding a significant result. If we don’t think the effect is actually likely to be that large, then we need to figure out something else to do - find a bigger sample, use a more precise estimation method, something! Or else we should probably walk away. An Example Let’s walk quickly through an example that might commonly pop up but could be tricky to do with a power calculator. We will have: 1. Treatment assigned at the group level rather than the individual level 2. A back-door/confounder variable $$W$$ that we will control for in analysis. The average $$W$$ in your group affects your probability of receiving treatment. First, we generate data. Because of the group-level stuff this will be a bit trickier! tib <- tibble( # Let's make the confounder uniformly distributed # 0 and 1 start/end points are default W = runif(1000), # We can assign groups from 1 to 10, a categorical variable, using sample() # if we don't specify prob it defaults to equal probabilities group = sample(1:10, 1000, replace = TRUE) ) # Now for the group-level assignment. First let's get group-average W since that will affect assignment groupdata <- tib %>% group_by(group) %>% summarize(mean_W = mean(W)) %>% # Your group is treated with a probability of .5 + mean_W/10, which we can get # using runif() (X > runif() occurs with a probability of X) # notice for the sample size here we use nrow(.), which refers to the number of # rows in the data we're working with, in case by random chance not all 10 groups got assigned mutate(treated = .5 + mean_W/10 >= runif(nrow(.))) # Now let's bring the group-level treatment back into the data tib <- tib %>% left_join(groupdata) %>% # and make an outcome based on treatment and W and group # True effect of treatment 1 mutate(Y = 1*treated + W + group/10 + rnorm(1000)) Next, we run the analysis and pull out our result of interest. Here we’re again going to go for power, so we’ll pull significance, this time from an lm(), and testing at the 99% confidence level. m <- lm(Y~treated + W, data = tib) sig <- tidy(m)$p.value[2] <= .01

Now we wrap the whole thing in a loop.

sig_results <- c()

for (i in 1:500) {
tib <- tibble(
# Let's make the confounder uniformly distributed
# 0 and 1 start/end points are default
W = runif(1000),
# We can assign groups from 1 to 10, a categorical variable, using sample()
# if we don't specify prob it defaults to equal probabilities
group = sample(1:10, 1000, replace = TRUE)
)

# Now for the group-level assignment. First let's get group-average W since that will affect assignment
groupdata <- tib %>%
group_by(group) %>%
summarize(mean_W = mean(W)) %>%
# Your group is treated with a probability of .5 + mean_W/10, which we can get
# using runif() (X > runif() occurs with a probability of X)
# notice for the sample size here we use nrow(.), which refers to the number of
# rows in the data we're working with, in case by random chance not all 10 groups got assigned
mutate(treated = .5 + mean_W/10 >= runif(nrow(.)))

# Now let's bring the group-level treatment back into the data
tib <- tib %>%
left_join(groupdata) %>%
# and make an outcome based on treatment and W and group
# True effect of treatment 1
mutate(Y = 1*treated + W + group/10 + rnorm(1000))

m <- lm(Y~treated + W, data = tib)

sig_results[i] <- tidy(m)$p.value[2] <= .01 } And put that inside a function so we can manipulate things. my_power_function <- function(effect, sample_size) { sig_results <- c() for (i in 1:500) { tib <- tibble( # Let's make the confounder uniformly distributed # 0 and 1 start/end points are default W = runif(sample_size), # We can assign groups from 1 to 10, a categorical variable, using sample() # if we don't specify prob it defaults to equal probabilities group = sample(1:10, sample_size, replace = TRUE) ) # Now for the group-level assignment. First let's get group-average W since that will affect assignment groupdata <- tib %>% group_by(group) %>% summarize(mean_W = mean(W)) %>% # Your group is treated with a probability of .5 + mean_W/10, which we can get # using runif() (X > runif() occurs with a probability of X) # notice for the sample size here we use nrow(.), which refers to the number of # rows in the data we're working with, in case by random chance not all 10 groups got assigned mutate(treated = .5 + mean_W/10 >= runif(nrow(.))) # Now let's bring the group-level treatment back into the data tib <- tib %>% left_join(groupdata) %>% # and make an outcome based on treatment and W and group # True effect of treatment 1 mutate(Y = effect*treated + W + group/10 + rnorm(sample_size)) m <- lm(Y~treated + W, data = tib) sig_results[i] <- tidy(m)$p.value[2] <= .01
}

sig_results %>%
# Tack a na.rm = TRUE on here in case one of the regressions fails so it doesn't taint the whole thing!)
mean(na.rm = TRUE) %>%
return()
}

How often are we significant here? Running my_power_function(1, 1000) from the original values gives us significance 100% of the time! 100% power. Maybe with an effect so large we dont even need that many observations.

(although note that this estimate might be biased… we could change this function slightly to store and then return bias_results = coef(m)[2] - effect to see whether the estimate is on average accurate)

And finally, we can see how big of a sample size we need to get power of 90%.

power_levels <- c()

sample_sizes_to_try <- c(50, 100, 200, 300, 400, 500)

for (i in 1:6) {
power_levels[i] <- my_power_function(1, sample_sizes_to_try[i])
}

# Where do we cross 80%?
power_results <- tibble(sample = sample_sizes_to_try,
power = power_levels)
power_results
## # A tibble: 6 x 2
##   sample power
##    <dbl> <dbl>
## 1     50 0.628
## 2    100 0.91
## 3    200 0.986
## 4    300 0.998
## 5    400 0.996
## 6    500 0.998
ggplot(power_results,
aes(x = sample, y = power)) +
geom_line(color = 'red', size = 1.5) +
# add a horizontal line at 90%
geom_hline(aes(yintercept = .9), linetype = 'dashed') +
# Prettify!
theme_minimal() +
scale_y_continuous(labels = scales::percent) +
labs(x = 'Sample Size', y = 'Power')

Looks like we pass power of 90% with somewhere between 50 and 100 observations! That’s a big effect, requiring not a whole lot of sample to unearth.

Conclusion

So there we have it! The tricky part here is figuring out how to generate data where the variables have the properties and relationships you want them to (and things like treatment are assigned how you want). But that just takes a little practice - give it a whirl! I find simulation to answer questions about power or bias to be strangely addictive, and fun to do.