class: center, middle, inverse, title-slide .title[ # Within Variation and Fixed Effects ] .subtitle[ ## i.e. one thing to do when measurement eludes you ] .date[ ### Updated 2023-03-07 ] --- # Check-in - So far we've been learning about how to set up, run, and interpret an ordinary least squares regression - This is a key skill for anyone doing anything with data - even if you never run a regular ol' linear regression again, pretty much everything else in applied stats builds off of it in some way - Another thing we've been doing is thinking about how to design and add controls to that regression to *identify* our effect of interest by closing back doors --- # The Measurement Problem... - And this has led us to some issues that have already popped up! - For this approach to work, we need to not only *figure out* what we need to control for, using our diagram, but we need to *actually control for it* - A lot of the time we don't have that data! - And thus all the skeptical comments we had about the designs we came up with --- # A Pickle - So obviously this is a problem, and it's not one we can reason or trick our way out of - If we don't have the variable we need to control for, we don't have it - ... or do we? --- # The Rest of the Term - Much of the rest of the term is going to be focused on *finding ways to control for stuff that we can't measure* - Seems impossible! But it is possible, at least in some circumstances - Today, we will be talking about *within variation* and *between variation*, and the ability to control for all *between variation* using *fixed effects* --- # Panel Data - We are working now in the domain of *panel data* - Panel data is when you observe the same individual over multiple time periods - "Individual" could be a person, or a company, or a state, or a country, etc. There are `\(N\)` individuals in the panel data - "Time period" could be a year, a month, a day, etc.. There are `\(T\)` time periods in the data - For now we'll assume we observe each individual the same number of times, i.e. a *balanced* panel (so we have `\(N\times T\)` observations) - You can use this stuff with unbalanced panels too, it just gets a little more complex --- # Panel Data - Here's what (a few rows from) a panel data set looks like - a variable for individual (county), a variable for time (year), and then the data <table> <thead> <tr> <th style="text-align:right;"> County </th> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> CrimeRate </th> <th style="text-align:right;"> ProbofArrest </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 81 </td> <td style="text-align:right;"> 0.0398849 </td> <td style="text-align:right;"> 0.289696 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 82 </td> <td style="text-align:right;"> 0.0383449 </td> <td style="text-align:right;"> 0.338111 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 83 </td> <td style="text-align:right;"> 0.0303048 </td> <td style="text-align:right;"> 0.330449 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 84 </td> <td style="text-align:right;"> 0.0347259 </td> <td style="text-align:right;"> 0.362525 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 85 </td> <td style="text-align:right;"> 0.0365730 </td> <td style="text-align:right;"> 0.325395 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 86 </td> <td style="text-align:right;"> 0.0347524 </td> <td style="text-align:right;"> 0.326062 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 87 </td> <td style="text-align:right;"> 0.0356036 </td> <td style="text-align:right;"> 0.298270 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 81 </td> <td style="text-align:right;"> 0.0163921 </td> <td style="text-align:right;"> 0.202899 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 82 </td> <td style="text-align:right;"> 0.0190651 </td> <td style="text-align:right;"> 0.162218 </td> </tr> </tbody> <tfoot> <tr> <td style = 'padding: 0; border:0;' colspan='100%'><sup></sup> 9 rows out of 630. "Prob. of Arrest" is estimated probability of being arrested when you commit a crime</td> </tr> </tfoot> </table> --- # Between and Within - Let's pick a few counties and graph this out ![](data:image/png;base64,#Week_06_1_Within_Variation_and_Fixed_Effects_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- # Between and Within - If we look at the overall variation, just pretending this is all together, we get this ![](data:image/png;base64,#Week_06_1_Within_Variation_and_Fixed_Effects_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- # Between and Within - BETWEEN variation is what we get if we look at the relationship between the *means of each county* ![](data:image/png;base64,#Week_06_1_Within_Variation_and_Fixed_Effects_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- # Between and Within - And I mean it! Only look at those means! The individual year-to-year variation within county doesn't matter. ![](data:image/png;base64,#Week_06_1_Within_Variation_and_Fixed_Effects_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- # Between and Within - Within variation goes the other way - it treats those orange crosses as their own individualized sets of axes and looks at variation *within* county from year-to-year only! - We basically slide the crosses over on top of each other and then analyze *that* data ![](data:image/png;base64,#Week_06_1_Within_Variation_and_Fixed_Effects_files/figure-html/unnamed-chunk-6-1.gif)<!-- --> --- # Between and Within - We can clearly see that *between counties* there's a strong positive relationship - But if you look *within* a given county, the relationship isn't that strong, and actually seems to be negative - Which would make sense - if you think your chances of getting arrested are high, that should be a deterrent to crime - But what are we actually doing here? Let's think about the causal diagram / data-generating process! - What goes into the probability of arrest and the crime rate? Lots of stuff! --- # The Crime Rate - "LocalStuff" is just all the things unique to that area - "LawAndOrder" is how committed local politicians are to "Law and Order Politics" ![](data:image/png;base64,#Week_06_1_Within_Variation_and_Fixed_Effects_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- # Between and Within - For each of these variables we can ask if they vary *between groups* and/or *within groups* - LocalStuff is all the stuff unique to that county - geography, landmarks, the quality of the schools, almost by definition this only varies *between groups*. It's not like the things that make your county unique are different each year (or at least not very different) - Whether the county has LawAndOrder and how many CivilRights you have might change a bit year to year, but in general, political climates like that change pretty slowly. At a bit of a stretch we can call that something that only varies between groups too - Police budgets (and thus number of police on the streets) and Poverty (which varies with the economy) vary both between counties, but also *within* counties from year to year - Variables with between variation only (by our assumption): LocalStuff, LawAndOrder, CivilRights - Variables with both between and within variation: Police, Poverty --- # Between and Within - Let's simplify our graph! - Some of the variables only vary *between counties* - So, we can replace those variables on the graph with the variable County - Right? That's where all the variation is anyway --- # The Crime Rate - "LocalStuff" is just all the things unique to that area - "LawAndOrder" is how committed local politicians are to "Law and Order Politics" ![](data:image/png;base64,#Week_06_1_Within_Variation_and_Fixed_Effects_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- # Between and Within - Now the task of identifying ProbArrest `\(\rightarrow\)` CrimeRate becomes much simpler! - If we control for County, that will close a lot of back doors for us - (based on the diagram, all we need to control for is County and Poverty!) - Conveniently, we can control for County just like it was any other variable! - And when we do, we automatically *control for all variables that only have between variation*, whatever they are, even if we can't measure them directly or didn't think about them - *All that's left is the within variation* --- # Concept Checks - For each of these variables, would we expect them to have within variation, between variation, or both? - (Individual = person) How a child's height changes as they age. - (Individual = person) In a data set tracking many people over many years, the variation in the number of children a person has over their lifetime. - (Individual = city) Overall, Paris, France has more restaurants than Paris, Texas. - (Individual = genre) The average pop music album sells more copies than the average jazz album - (Individual = genre) Miles Davis' *Kind of Blue* sold very well *for a jazz album*. - (Individual = genre) Michael Jackson's *Thriller*, a pop album, sold many more copies than *Kind of Blue*, a jazz album. --- # Removing Between Variation - Okay so that's the concept - Remove all the between variation so that all that's left is within variation - And in the process control for any variables that are made up only of between variation - How can we actually do this? And what's really going on? - Let's first talk about the regression model itself that this implies - Then let's actually do the thing. There are two main ways: *de-meaning* and *binary variables* (they give the same result, for balanced panels anyway) --- # Estimation vs. Design - To be clear, this is *exactly 0% different* from what we've done before in terms of controlling for stuff - And in fact we're about to do the *exact same thing we did before* by just adding a categorical control variable for `county` or whatever - (and in fact the "within" thing holds with other categorical controls - a categorical control for education isolates variation "within education levels") - The difference is the *reason we're doing it*. It's fixed effects because *a categorical control for individual controls for a lot of stuff*, and we think closes a *lot* of back doors for us, not just one, and not just ones we can measure --- # The Model The `\(it\)` subscript says this variable varies over individual `\(i\)` and time `\(t\)` `$$Y_{it} = \beta_0 + \beta_1 X_{it} + \varepsilon_{it}$$` - What if there are individual-level components in the error term causing omitted variable bias? - `\(X_{it}\)` is related to LocalStuff which is not in the model and thus in the error term! - Regular ol' omitted variable bias. If we don't adjust for the individual effect, we get a biased `\(\hat{\beta}_1\)` - (this bias is called "pooling bias" although it's really just a form of omitted variable bias) - We really have this then: `$$Y_{it} = \beta_0 + \beta_1 X_{it} + (\alpha_i + \varepsilon_{it})$$` --- # De-meaning - Let's do de-meaning first, since it's most closely and obviously related to the "removing between variation" explanation we've been going for - The process here is simple! 1. For each variable `\(X_{it}\)`, `\(Y_{it}\)`, etc., get the mean value of that variable for each individual `\(\bar{X}_i, \bar{Y}_i\)` 2. Subtract out that mean to get residuals `\((X_{it} - \bar{X}_i), (Y_{it} - \bar{Y}_i)\)` 3. Work with those residuals - That's it! --- # How does this work? - That `\(\alpha_i\)` term gets absorbed - The residuals are, by construction, no longer related to the `\(\alpha_i\)`, so it no longer goes in the residuals! `$$(Y_{it} - \bar{Y}_i) = \beta_0 + \beta_1(X_{it} - \bar{X}_i) + \varepsilon_{it}$$` --- # Let's do it! - We can use `group_by` to get means-within-groups and subtract them out ```r data(crime4, package = 'wooldridge') crime4 <- crime4 %>% # Filter to the data points from our graph filter(county %in% c(1,3,7, 23), prbarr < .5) %>% group_by(county) %>% mutate(mean_crime = mean(crmrte), mean_prob = mean(prbarr)) %>% mutate(demeaned_crime = crmrte - mean_crime, demeaned_prob = prbarr - mean_prob) ``` --- # And Regress! ```r orig_data <- feols(crmrte ~ prbarr, data = crime4) de_mean <- feols(demeaned_crime ~ demeaned_prob, data = crime4) etable(orig_data, de_mean) ``` ``` ## orig_data de_mean ## Dependent Var.: crmrte demeaned_crime ## ## Constant 0.0118* (0.0050) 1.41e-18 (0.0004) ## prbarr 0.0486** (0.0167) ## demeaned_prob -0.0305* (0.0117) ## _______________ _________________ _________________ ## S.E. type IID IID ## Observations 27 27 ## R2 0.25308 0.21445 ## Adj. R2 0.22321 0.18303 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # Interpreting a Within Relationship - How can we interpret that slope of `-0.03`? - This is all *within variation* so our interpretation must be *within-county* - So, "comparing a county in year A where its arrest probability is 1 (100 percentage points) higher than it is in year B, we expect the number of crimes per person to drop by .03" - Or if we think we've causally identified it (and want to work on a more realistic scale), "raising the arrest probability by 1 percentage point in a county reduces the number of crimes per person in that county by .0003". - We're basically "controlling for county" (and will do that explicitly in a moment) - So your interpretation should think of it in that way - *holding county constant* i.e. *comparing two observations with the same value of county* i.e. *comparing a county to itself at a different point in time* --- # Concept Checks - Why does subtracting the within-individual mean of each variable "control for individual"? - In a sentence, interpret the slope coefficient in the estimated model `\((Y_{it} - \bar{Y}_i) = 2 + 3(X_{it} - \bar{X}_i)\)` where `\(Y\)` is "blood pressure", `\(X\)` is "stress at work", and `\(i\)` is an individual person --- # The Least Squares Dummy Variable Approach - De-meaning the data isn't the only way to do it! - You can also use the least squares dummy variable (another word for "binary variable") method - We just treat "individual" like the categorical variable it is and add it as a control! Again, the regression approach is exactly the same as with any categorical control, but the *research design* reason for doing it is different --- # Let's do it! ```r lsdv <- feols(crmrte ~ prbarr + factor(county), data = crime4) etable(orig_data, de_mean, lsdv, keep = c('prbarr', 'demeaned_prob')) ``` ``` ## orig_data de_mean lsdv ## Dependent Var.: crmrte demeaned_crime crmrte ## ## prbarr 0.0486** (0.0167) -0.0305* (0.0124) ## demeaned_prob -0.0305* (0.0117) ## _______________ _________________ _________________ _________________ ## S.E. type IID IID IID ## Observations 27 27 27 ## R2 0.25308 0.21445 0.94114 ## Adj. R2 0.22321 0.18303 0.93044 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # The same! - The result is the same, as it should be - Except for that `\(R^2\)` - What is that "within R2"? - Because de-meaning takes out the part explained by the fixed effects ( `\(\alpha_i\)` ) *before* running the regression, while LSDV does it *in* the regression - So the .94 is the portion of `crmrte` explained by `prbarr` *and* `county`, whereas the .21 is the "within - `\(R^2\)` " - the portion of *the within variation* that's explained by `prbarr` - Neither is wrong (and the .94 isn't "better"), they're just measuring different things --- # Why LSDV? - A benefit of the LSDV approach is that it calculates the fixed effects `\(\alpha_i\)` for you - We left those out of the table with the `coefs` argument of `export_summs` (we rarely want them) but here they are: ``` ## OLS estimation, Dep. Var.: crmrte ## Observations: 27 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.045631 0.004116 11.08640 1.7906e-10 *** ## prbarr -0.030491 0.012442 -2.45068 2.2674e-02 * ## factor(county)3 -0.025308 0.002165 -11.68996 6.5614e-11 *** ## factor(county)7 -0.009870 0.001418 -6.96313 5.4542e-07 *** ## factor(county)23 -0.008587 0.001258 -6.82651 7.3887e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 0.001933 Adj. R2: 0.930441 ``` - Interpretation is exactly the same as with a categorical variable - we have an omitted county, and these show the difference relative to that omitted county --- # Why LSDV? - This also makes clear another element of what's happening! Just like with a categorical var, the line is moving *up and down* to meet the counties - Graphically, de-meaning moves all the points together in the middle to draw a line, while LSDV moves the line up and down to meet the points ![](data:image/png;base64,#Week_06_1_Within_Variation_and_Fixed_Effects_files/figure-html/unnamed-chunk-13-1.png)<!-- --> --- # Why Not LSDV? - LSDV is computationally expensive - If there are a lot of individuals, or big data, or if you have many sets of fixed effects (yes you can do more than just individual - we'll get to that next time!), it can be very slow - Most professionally made fixed-effects commands use de-meaning, but then adjust the standard errors properly - (They also leave the fixed effects coefficients off the regression table by default) --- # Going Professional - Applied researchers rarely do either of these, and rather will use a command specifically designed for fixed effects - Like good ol' `feols()`! (what did you think the "fe" part stood for?) - Note there are also functions in **fixest** that do fixed effects in non-linear models like logit, probit, or poisson regression (`feglm()` and `fepois()`) - Plus, it clusters the standard errors by the first fixed effect by default, which we usually want! --- # Going Professional ```r library(fixest) pro <- feols(crmrte ~ prbarr | county, data = crime4) etable(de_mean, pro) ``` ``` ## de_mean pro ## Dependent Var.: demeaned_crime crmrte ## ## Constant 1.41e-18 (0.0004) ## demeaned_prob -0.0305* (0.0117) ## prbarr -0.0305* (0.0064) ## Fixed-Effects: ----------------- ----------------- ## county No Yes ## _______________ _________________ _________________ ## S.E. type IID by: county ## Observations 27 27 ## R2 0.21445 0.94114 ## Within R2 -- 0.21445 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # Limits to Fixed Effects - Okay! At this point we have the concept behind fixed effects, can execute them, and know what they're good for - What aren't they good for? 1. They don't control for anything that has within variation 2. They control away *everything* that's between-only, so we can't see the effect of anything that's between-only ("effect of geography on crime rate?" Nope!) 3. Anything with only a *little* within variation will have most of its variation washed out too ("effect of population density on crime rate?" probably not) 4. The estimate pays the most attention to individuals with *lots of variation in treatment* - 2 and 3 can be addressed by using "random effects" instead but we aren't covering that in this class (see the The Effect chapter on Fixed Effects for more) --- # Concept Checks - Why can't we use individual-person fixed effects to study the impact of race on traffic stops? - The within `\(R^2\)` from is .3, and the overall `\(R^2\)` is .5. Interpret these two numbers in sentences - In a sentence, interpret the slope coefficient in the estimated model `\((Y_{it} - \bar{Y}_i) = 1 + .5(X_{it} - \bar{X}_i)\)` where `\(Y\)` is "school funding per child" and `\(X\)` is "population growth", and `\(i\)` is city --- # Swirl - Open up the Fixed Effects Swirl and let's do it!