Applying Hypothesis Testing

class: center, middle, inverse, title-slide

.title[
# Applying Hypothesis Testing
]
.subtitle[
## Part 2: How to Do It
]
.date[
### Updated 2023-02-28
]

---

# Recap

- We are trying to characterize the *uncertainty* in our estimate that comes from sampling variation
- One way to do this is to characterize the sampling distribution: show the standard error and construct a confidence interval
- Another is if we have a *null hypothesis* of interest - for parameters giving relationships like slopes, usually that the parameter is 0 (no relationship)
- Then we can look at the sampling distribution *assuming the null is true* and see if our actual result is too weird to believe; if it is, reject the null!

---

# The Null Distribution

- Hypothesis testing centers around the concept of the sampling distribution and, further, the null distribution
- We've talked so far about parameters that have a normal sampling distribution, i.e. follow a normal distribution
- This is a key assumption to have made
- We have to have an idea of what the null distribution of our parameter *is* in order to figure out *how weird* our result is

---

# The Normal Null

- Let's talk about the normal distribution some more to get more intuition on what's happening, then branch out. 
- When we have an estimate, like `$\hat{\beta}_1$`, that follows a normal distribution, we know that the sampling distribution:
    - Is symmetric
    - Can be wider or narrower depending on the standard error
    - Can be transformed into a "standard normal" (mean 0 and standard deviation 1) by subtracting the mean and dividing by the standard deviation

---

# The Normal Null

- Because of that last point, when evaluating an estimate like `$\hat{\beta}_1$` we will transform it into a *Z-score*, which is

`$$\frac{\hat{\beta}_1 - Null}{s.e.(\hat{\beta_1})}$$`

- That is, we subtract our null hypothesis value and divide by the standard error
- This Z-score is our *test statistic*
- The thing about test statistics in general (Z-scores, t-scores, F statistics, Chi square statistics, etc.) and why we want them: *we know their sampling distribution very well*

---

# The Normal Null

- Why do this transformation? Because the *sampling distribution of `$Z$` assuming the null is true* is a normal with mean `$Null$` and standard error `$s.e.(\beta_1)$`
- So by subtracting the null (mean of the sampling distribution) and dividing by `$s.e.(\beta_1)$` (standard deviation of the sampling distribution), the distribution under the null becomes a normal with mean-0 and standard-deviation-1, the standard normal. Very easy to work with!
- The sampling distribution *of `$\hat{\beta}_1$`* under the null was the original normal distribution we started with, but the sampling distribution *of the Z-Score* under the null is a normal with mean 0 and s.d. 1
- So if 2.5% of the area of the distribution is above our original estimate of `$4.92$` in the original null distribution with mean `$1$` and s.e. `$2$`, then 2.5% of the area will also be above our Z-score of `$(4.92-1)/2$` in the standard normal

---

# The Normal Null

- This brings us to the concept of *critical values*
- When we pick an `$\alpha$`, we will reject the null if the p-value is below `$\alpha$`
- This happens if `$\hat{\beta}_1$` is a certain distance away from the null or farther
- If we use the *same distribution every time* (standard null), that distance will always be the same!
- So instead of checking the percentage under the distribution every time, we just figure out a critical value and see if we're farther away from that. Much easier!

---

# The Normal Null

- For the standard normal, 5% is outside the bounds of `$Z = -1.96$` and `$Z = 1.96$` (2.5% on each side - a two-sided test!) Remember this number?
- For 10% it's 1.65 and for 1% it's 2.58. How much area is in each shaded part?

![](data:image/png;base64,#Week_03_Slides_2_Applying_Hypothesis_Testing_files/figure-html/unnamed-chunk-1-1.png)

---

# The Normal Null

- This critical-value approach means that we can check results much more easily
- Look at this regression table regressing number of house listings on "months it would take to sell this inventory". We can eyeball a test with a null of 0 at the 95% confidence level by calculating the test statistic in our heads and comparing to 1.96 (or 2 to be even easier) - when the null is 0, the Z-score is just `$\hat{\beta}_1/s.e.(\hat{\beta})$`

```
##                       housingmodel
## Dependent Var.:           listings
##                                   
## Constant        4,164.3*** (130.3)
## inventory        -129.9*** (15.28)
## _______________ __________________
## S.E. type                      IID
## Observations                 7,135
## R2                         0.01003
## Adj. R2                    0.00989
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

# Concept Checks

- Why do we transform `$\hat{\beta}_1$` into a Z-score?
- When we transform `$\hat{\beta}_1$`, why do we have to use the null hypothesis value in our calculation?
- Why do we have to use the standard error?
- Is the coefficient on X in the below table significant at the 95% confidence level? Eyeball it!

```
##                 feols(y ~ x, da..
## Dependent Var.:                 y
##                                  
## Constant         0.2764. (0.1386)
## x               1.284*** (0.2159)
## _______________ _________________
## S.E. type                     IID
## Observations                   20
## R2                        0.66279
## Adj. R2                   0.64406
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

# The Normal Null

- So those are features of the normal null
- When do we use the normal null?
- Any time we have an *average* over *lots of observations*, the normal distribution pops up
- (it's not obvious, but OLS coefficients are averages)
- So for OLS coefficients, we just need "a lot of observations" to use a normal

---

# When Not to Use the Normal

- Small samples (how small is "small?" 0-30 is definitely small, 100+ probably not small, 30-100 gray area)
- Things that aren't means, like ratios
- In these cases, what can we do instead?

---

# The t-distribution

- The t-distribution is very similar to a normal distribution, except that it applies to means of smaller sample sizes, and instead of a mean and s.d. it has a number of degrees of freedom that determines how wide it is
- When you have fewer observations, you're more likely to get a mean that's far from the true mean, i.e. "fatter tails"

![](data:image/png;base64,#Week_03_Slides_2_Applying_Hypothesis_Testing_files/figure-html/unnamed-chunk-4-1.png)

---

# The t-distribution

- So, for smaller samples, it's a good idea to use the critical value from a t distribution instead of a normal
- We do the same subtract-the-mean-and-divide-by-s.e. steps, just now we call it a t-score rather than a z-score
- How can we find the critical values of the t distribution? Don't bother memorizing, as they change depending on your sample size. Instead use a function that measures the proportion of the distribution weirder than a value, and take the inverse to get the critical value!

```r
qt(.025, df = 28)
```

```
## [1] -2.048407
```

- (psst... we could have also done that with the normal: `qnorm(.025) = ` -1.959964)

---

# Small Samples

- Using a null t-distribution when dealing with a small sample is important
- But it's not the only issue with small samples (even small samples large enough to use a normal!)
- The practice of hypothesis testing in general makes small samples more perilous!

---

# Small Samples

- Small samples reduce *power* considerably - it becomes difficult to reject a false null
- And also, the estimate in general will be much noisier!
- As samples get small, the number of *true rejections* drops as power drops, but the number of *super extreme noisy results* goes up!
- Think back to that animation where we had `$N = 2$`...
- So the smaller your sample, the better the chance that a given rejection of a null is a false positive rather than a true positive

---

# Small Samples

- Small samples also make it harder to detect small effects
- Power, false positive rates, and false negative rates rely not just on sample size and `$var(X)$`, but also *how big the true effect is that you are trying to find*
- Big effects are easy to find - `$N = 2$` could tell you whether a parachute saves you from death
- Tiny effects need big samples to have power and estimate effects precisely - to have a 90% chance to reject the null of 0 effect for a pill that truly increases your IQ by .003 points you'd need a sample of about 8 billion

---

# Small Samples

- So in general:
    - Don't pay *too* much attention to studies with small samples
    - If a result is truly wild and unexpected, check if the sample is small - good chance it's just noise
    - If you do have to work with a small sample, maybe avoid hypothesis tests
    - But if you do use a test, use a t distribution null.

---

# F distribution

- Another null distribution that comes up a lot is the *F distribution*
- The F distribution is a distribution of *the ratio of two squared normal variables* (or the ratio of two sums of squared normal variables), a.k.a. the ratio of two `$\chi^2$` distributed variables

![](data:image/png;base64,#Week_03_Slides_2_Applying_Hypothesis_Testing_files/figure-html/unnamed-chunk-6-1.png)
---

# F distribution

- Why does this come up? Because it's useful for *comparing models*
- If we use a squared normal variable to measure some quality of a model (hint: OLS tries to minimize the sum of **squared** residuals; we can turn that into a measure of model fit, we'll get to that later)
- Then we can compare models by dividing one measure by the other
- If we surpass the critical value, then we can reject that the two models are equally good!

---

# F distribution

- Instead of a mean and a standard deviation, the F distribution is defined by two *degrees of freedom*, for the number of squared normals in the numerator and denominator, respectively
- When we're doing a comparison of models, these degrees of freedom will be based on *how many parameters* are being compared and the sample size
- We'll do more of that when we get to multivariate regression
- For now, we have one degree of freedom up top, and `$N-1$` degrees of freedom on the bottom

---

# F distribution

- When we use the F distribution to do a test of a single regression coefficient, we're comparing the model with the variable included against the model without it included (i.e. `$Y = \beta_0 + \beta_1X$` vs. `$Y = \beta_0$` )
- The p-value will be exactly the same as if we'd used a normal null for a null hypothesis value of 0
- But, as mentioned, F distributions will be more handy when we get to multivariate regression

---

# F distribution

```
## OLS estimation, Dep. Var.: inventory
## Observations: 496 
## Standard-errors: IID 
##                Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept)  5.66400485 0.22251450 25.45454 < 2.2e-16 ***
## median      -0.00000176 0.00000197 -0.89239   0.37262    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.952578   Adj. R2: -4.116e-4
```

```
## [1] "F statistic for the coefficient on median price is 0.80 p-value = 0.37"
```

---

# F distribution

- Testing one parameter with `$496$` observations, so numerator df is `$1$` and denominator is `$496 - 1 = 495$`

![](data:image/png;base64,#Week_03_Slides_2_Applying_Hypothesis_Testing_files/figure-html/unnamed-chunk-8-1.png)

---

# Concept Checks

- Why do small samples increase the false negative rate (low power) but not increase the false positive rate?
- You want to test a regression coefficient estimated from a N = 2000 sample with a null hypothesis value of 1/2. What null distribution should you use?
- You want to test whether removing some variables from a regression model makes it worse. What null distibution should you use?
- You want to test if the mean of `$X$` is equal to `$3$` or not, from a sample of 25 observations. What test should you use, and how should you calculate the test statistic?

---

# Hypothesis Testing in R

- By default, regression results will perform tests using the t distribution
- Rather than switching to a normal null for big samples, it will keep using t, but with large samples that's basically normal anyway
- `feols()` and show us the t-statistic, standard error, and p-value compared to 0 (and can adjust for heteroskedasticity and clustering if you want!).
- `etable()` will default to standard error, but `etable(coefstat = 'confint')` can give you confidence intervals
- Both will report "stars" - if the p-value is below a certain `$alpha$`, you get stars! Be careful - the defaults are different in `feols()` (*) = .05) than is standard in economics papers (** = .05)

---

# Hypothesis Testing in R

```r
data(SLID, package = 'carData')
model <- feols(wages ~ education, data = SLID)
print(model)
```

```
## OLS estimation, Dep. Var.: wages
## Observations: 4,014 
## Standard-errors: IID 
##             Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept) 4.937527   0.532847  9.26631 < 2.2e-16 ***
## education   0.794603   0.038941 20.40544 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 7.49195   Adj. R2: 0.0938
```

---

# Hypothesis Testing in R

```r
etable(model)
```

```
##                              model
## Dependent Var.:              wages
##                                   
## Constant         4.938*** (0.5328)
## education       0.7946*** (0.0389)
## _______________ __________________
## S.E. type                      IID
## Observations                 4,014
## R2                         0.09403
## Adj. R2                    0.09380
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

# Hypothesis Testing in R

- We can also see confidence intervals for our coefficients with `

```r
coefplot(model)
```

![](data:image/png;base64,#Week_03_Slides_2_Applying_Hypothesis_Testing_files/figure-html/unnamed-chunk-11-1.png)

---

# Hypothesis Testing in R

- How about tests against non-0 nulls, or F tests?
- We can calculate a z-score ourselves and do a non-0 null that way, or we can use `wald()` from **fixest** to do F-tests comparing multiple coefficents to 0 (note this does a "regular expression" so you can easily do many coefficients with little typing)
- or `glht()` from **multcomp** for more complex stuff

```
## Wald test, H0: nullity of education
##  stat = 416.4, p-value < 2.2e-16, on 1 and 4,012 DoF, VCOV: IID.
```

```
## 
## 	 Simultaneous Tests for General Linear Hypotheses
## 
## Fit: feols(fml = wages ~ education, data = SLID)
## 
## Linear Hypotheses:
##                   Estimate Std. Error t value Pr(>|t|)
## education == 0.75  0.79460    0.03894   1.145    0.252
## (Adjusted p values reported -- single-step method)
```

```
## 
## 	 Simultaneous Tests for General Linear Hypotheses
## 
## Fit: feols(fml = wages ~ education, data = SLID)
## 
## Linear Hypotheses:
##                                  Estimate Std. Error t value Pr(>|t|)
## 3 * education + (Intercept) == 7   7.3213     0.4197   0.766    0.444
## (Adjusted p values reported -- single-step method)
```

---

# Swirl Practice

Now on to the Hypothesis Testing Swirl lesson!