class: center, middle, inverse, title-slide # Practice ### Updated 2022-03-16 --- # Check-in - In the previous lecture we covered a lot of stuff about how the right-hand-side can be used - But honestly for a lot of that, it only makes sense once you start practicing with it - That's what we will do today! - These examples will also give us some brief recaps of prior material in preparation for the midterm - We will first work with a data set of stops by the police in Minneapolis in 2017. What effect does race have on whether someone is searched by police? --- # Police Stops <table class="table" style="margin-left: auto; margin-right: auto;"> <caption>MplsStops</caption> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Class </th> <th style="text-align:left;"> Values </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> problem </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'suspicious' 'traffic' </td> </tr> <tr> <td style="text-align:left;"> citationIssued </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'NO' 'YES' </td> </tr> <tr> <td style="text-align:left;"> personSearch </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'NO' 'YES' </td> </tr> <tr> <td style="text-align:left;"> vehicleSearch </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'NO' 'YES' </td> </tr> <tr> <td style="text-align:left;"> preRace </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'Black' 'White' 'Unknown' 'East African' 'Latino' and more </td> </tr> <tr> <td style="text-align:left;"> race </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'Black' 'White' 'Unknown' 'East African' 'Latino' and more </td> </tr> <tr> <td style="text-align:left;"> gender </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'Female' 'Male' 'Unknown' </td> </tr> <tr> <td style="text-align:left;"> lat </td> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> Num: 44.89 to 45.051 </td> </tr> <tr> <td style="text-align:left;"> long </td> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> Num: -93.329 to -93.199 </td> </tr> <tr> <td style="text-align:left;"> policePrecinct </td> <td style="text-align:left;"> integer </td> <td style="text-align:left;"> Num: 1 to 5 </td> </tr> <tr> <td style="text-align:left;"> neighborhood </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'Armatage' 'Audubon Park' 'Bancroft' 'Beltrami' 'Bottineau' and more </td> </tr> </tbody> </table> --- # Police Stops - Let's say our reseach question is "is a police officer more likely to do a bodily search (`personSearch`) on someone who is Black than an otherwise similar person who is non-Black?" Questions: 1. What is the causal effect we are trying to identify? 2. What do we mean by "an otherwise similar driver" and why might that be part of the research question? 3. What might the causal diagram look like? (Note: this data only includes people *who were stopped* - that might be relevant!) 4. What needs to be controlled for or not controlled for? Can we identify the effect? 5. What *kinds of variables* are we working with? 6. What should the regression look like? 7. What is the result? (what tests should we look at?) --- # Discuss - (1) What is the causal effect we are trying to identify? --- # 1. What is the Causal Effect? - We want to know if race affects whether you will be searched. - What does it mean exactly for race to *affect* something - it's not like we can reach in and change someone's race - Really we're asking if the *officer's decision to search* looks at race or just at other factors - So the treatment variable really is "a Black person *is stopped* vs. a non-Black person *is stopped*", not "this particular person is Black vs. non-Black" --- # Discuss - (2) What do we mean by "an otherwise similar driver" and why might that be part of the research question? --- # 2. "An Otherwise Similar Driver" - What does the phrase "otherwise similar" imply here? - It means that we want to know if two people *who are the same other than their race* would be treated differently - It implies that we're looking for a causal effect of the race of the stopped person --- # Discuss - (3) What might the causal diagram look like? (Note: this data only includes people *who were stopped* - that might be relevant!) --- # 3. The Causal Diagram - This is a bit of a tricky one, because "is stopped by the police" should be a part of this! - (but what if that itself is affected by the race of the person the police officer sees? Hmm...) - Other things that might be relevant - location, what the person is doing (which could include lots of things like the kind of car they're driving, or committing a traffic violation) --- # 3. The Causal Diagram A simple version (Black means "the person an officer sees and has the opportunity to stop is Black"): ![](data:image/png;base64,#Week_05_2_Functional_Form_Practice_and_Midterm_Prep_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- # Discuss - (4) What needs to be controlled for or not controlled for? Can we identify the effect? --- # 4. Controls - We want to isolate Black `\(\rightarrow\)` Searched and Black `\(\rightarrow\)` Stopped `\(\rightarrow\)` Searched. Other paths include: - Black `\(\leftarrow\)` Location `\(\rightarrow\)` Searched - Black `\(\leftarrow\)` Location `\(\rightarrow\)` Stopped `\(\rightarrow\)` Searched - Black `\(\leftarrow\)` Location `\(\rightarrow\)` Stopped `\(\leftarrow\)` WhatDoing `\(\rightarrow\)` Searched - Black `\(\rightarrow\)` Stopped `\(\leftarrow\)` Location `\(\rightarrow\)` Searched - Black `\(\rightarrow\)` Stopped `\(\leftarrow\)` WhatDoing `\(\rightarrow\)` Searched --- # 4. Controls - Uh-oh. Our data set only includes people who were stopped. That means we're controlling for it! - So we're necessarily shutting down the Black `\(\rightarrow\)` Stopped `\(\rightarrow\)` Searched path - And also opening up paths on which STopped is a collider - So we'd need to control for Location and also WhatDoing, and even then we are necessarily going to have to ignore the Black `\(\rightarrow\)` Stopped `\(\rightarrow\)` Searched path - (Perhaps we could include non-stopped people by looking at neighborhood demographics - the `MplsDemo` data! But that's a bit too much wotk for right now) --- # Discuss - (5) What *kinds of variables* are we working with? - (do the variables we have cover the kinds of things we need to control?) --- # 5. Kinds of Variables <table class="table" style="margin-left: auto; margin-right: auto;"> <caption>MplsStops</caption> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Class </th> <th style="text-align:left;"> Values </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> problem </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'suspicious' 'traffic' </td> </tr> <tr> <td style="text-align:left;"> citationIssued </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'NO' 'YES' </td> </tr> <tr> <td style="text-align:left;"> personSearch </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'NO' 'YES' </td> </tr> <tr> <td style="text-align:left;"> vehicleSearch </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'NO' 'YES' </td> </tr> <tr> <td style="text-align:left;"> preRace </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'Black' 'White' 'Unknown' 'East African' 'Latino' and more </td> </tr> <tr> <td style="text-align:left;"> race </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'Black' 'White' 'Unknown' 'East African' 'Latino' and more </td> </tr> <tr> <td style="text-align:left;"> gender </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'Female' 'Male' 'Unknown' </td> </tr> <tr> <td style="text-align:left;"> lat </td> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> Num: 44.89 to 45.051 </td> </tr> <tr> <td style="text-align:left;"> long </td> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> Num: -93.329 to -93.199 </td> </tr> <tr> <td style="text-align:left;"> policePrecinct </td> <td style="text-align:left;"> integer </td> <td style="text-align:left;"> Num: 1 to 5 </td> </tr> <tr> <td style="text-align:left;"> neighborhood </td> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> 'Armatage' 'Audubon Park' 'Bancroft' 'Beltrami' 'Bottineau' and more </td> </tr> </tbody> </table> --- # 5. Kinds of Variables - Pretty much everything is binary or categorical, so we're going to need to pull out our binary-variable-interpretation skills! - Looking at the variables, we might want to ask whether the racial effect varies by whether it's an on-foot or in-car stop (which we don't have data for!), by the kind of `problem`, or perhaps by `gender` - Those would be interaction effects - need to be careful about digging around for interaction effects but we can think about this - (also maybe they should have been on our diagram?) - Can we control for what we need to? - `neighborhood` seems like a good control for location (or perhaps `lat` and `lon`) - WhatDoing we're kind of hopeless on, though - we have `problem` but not what the *non*-stopped people were doing - Note that the search outcomes are binary, so the interpretation will be changes in the probability of search --- # Discuss - (6) What should the regression look like? - What kinds of functional form checks should we do to see if we want polynomials or logs? - How should we construct the regression? --- # 6. The Regression - We might want to regress `I(personSearch == 'YES' | vehicleSearch == 'YES')` on `I(race == 'black')`, `neighborhood`, and `problem` - Let's first look at a basic comparison of means just to see what we're looking at. - Keep in mind that the result will still be biased! We haven't really controlled for WhatDoing, and we've shut off the Black `\(\rightarrow\)` Stopped `\(\rightarrow\)` Searched pathway - (plus, should we have gone back and put Gender in there and figured out whether to control for that too?) - These are all binary variables, we can't really do logs or polynomials of them - Unfortunately, going through the work of laying out what regression we should run doesn't always give us a regression that's feasible! Often what we can run has problems still - We can run another analysis with an interaction between Black and Problem --- # 6. The Regression ```r MplsStops <- MplsStops %>% mutate(Stopped = personSearch == 'YES' | vehicleSearch == 'YES', Black = race == 'Black') m1 <- feols(Stopped ~ Black, data = MplsStops) m2 <- feols(Stopped ~ Black + problem + neighborhood, data = MplsStops) m3 <- feols(Stopped ~ Black*problem + neighborhood, data = MplsStops) ``` --- # Discuss - (7) What is the result? (what tests should we look at?) - Before we look at the result, what results might we see and how would we interpret them? - What checks would we want to do? - What tests should we run? - How should we think about whether we want to use robust standard errors? --- # 7. The Result (neighb. coefs omitted) - Interpret these coefficients (keep in mind Neighborhood is not shown)! - Do we think that the coefficient on Black is likely to be biased up or down? ``` ## m1 m2 ## Dependent Var.: Stopped Stopped ## ## BlackTRUE 0.1357*** (0.0034) 0.1293*** (0.0036) ## problemtraffic -0.0829*** (0.0034) ## BlackTRUE x problemtraffic ## __________________________ __________________ ___________________ ## S.E. type IID IID ## Observations 43,699 43,699 ## R2 0.03617 0.06981 ## Adj. R2 0.03614 0.06793 ## m3 ## Dependent Var.: Stopped ## ## BlackTRUE 0.2030*** (0.0053) ## problemtraffic -0.0394*** (0.0041) ## BlackTRUE x problemtraffic -0.1287*** (0.0069) ## __________________________ ___________________ ## S.E. type IID ## Observations 43,699 ## R2 0.07718 ## Adj. R2 0.07530 ``` --- # R Squared - How about that R squared? Should we be concerned? What does it mean exactly? ```r fitstat(m3, 'r2') ``` ``` ## R2: 0.07718 ``` --- # The Result - How is that heteroskedasticity looking? Let's look at Model 2 ```r MplsStops %>% mutate(resid = resid(m2)) %>% group_by(Black) %>% summarize(variance_of_resid = var(resid)) ``` ``` ## # A tibble: 2 x 2 ## Black variance_of_resid ## <lgl> <dbl> ## 1 FALSE 0.0768 ## 2 TRUE 0.165 ``` - Pretty different! Let's use robust SEs (note we can do these after the fact in `etable()` directly) - (tip: with a binary dependent variable you pretty much ALWAYS want robust SEs) --- # The Result with Robust SEs ``` ## m1 m2 ## Dependent Var.: Stopped Stopped ## ## BlackTRUE 0.1357*** (0.0038) 0.1293*** (0.0039) ## problemtraffic -0.0829*** (0.0036) ## BlackTRUE x problemtraffic ## __________________________ __________________ ___________________ ## S.E. type Heteroskedas.-rob. Heteroskedast.-rob. ## Observations 43,699 43,699 ## R2 0.03617 0.06981 ## Adj. R2 0.03614 0.06793 ## m3 ## Dependent Var.: Stopped ## ## BlackTRUE 0.2030*** (0.0069) ## problemtraffic -0.0394*** (0.0036) ## BlackTRUE x problemtraffic -0.1287*** (0.0081) ## __________________________ ___________________ ## S.E. type Heteroskedast.-rob. ## Observations 43,699 ## R2 0.07718 ## Adj. R2 0.07530 ``` --- # Other Tests - We see that officers choose to search stopped Black people much more often than stopped non-Black people, controlling for neighborhood and the type of problem - 13% more! - It also seems like the effect is much stronger for stops the police reported as "suspicious" (20%) rather than "traffic" stops (20-13 = 7%). The difference (13%) is statistically significant at the .1% level - Is the effect of Black still significant for Traffic stops? What test would we run? --- # Other Tests ```r library(multcomp) glht(m3, 'BlackTRUE + BlackTRUE:problemtraffic = 0') %>% summary() ``` ``` ## ## Simultaneous Tests for General Linear Hypotheses ## ## Fit: feols(fml = Stopped ~ Black * problem + neighborhood, data = MplsStops) ## ## Linear Hypotheses: ## Estimate Std. Error t value Pr(>|t|) ## BlackTRUE + BlackTRUE:problemtraffic == 0 0.074229 0.004621 16.06 <2e-16 ## ## BlackTRUE + BlackTRUE:problemtraffic == 0 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## (Adjusted p values reported -- single-step method) ``` - Yep, still significant at the .1% level! A 7% gap is pretty big, too. --- # Now You - Now we'll have you walk through a similar exercise on your own, which should take you all the way through the course up to this point! - You know how when you donate to a charity, they just send you nonstop mail afterwards? Why do they do that? Does it actually increase future contributions, or does it annoy people away? - What is the effect of frequency of charity mailings on the size of donations? - We'll use the `charity` data set from the **wooldridge** package. `library(wooldridge); data(charity)` (after `install.packages('wooldridge')` if necessary) which looks at some Dutch donations data - The next slide will have the questions to answer, and the following slides will have useful information for answering them (you can also do some coding on your own) --- # Questions 1. What is the causal effect we are trying to identify? 2. Explain in words why we might have endogeneity 3. What might the causal diagram look like? Remember that the diagram might well include variables not in the data, and that this diagram only needs to apply to people *who have already donated once in the past* 4. What needs to be controlled for or not controlled for? Can we identify the effect? 5. What *kinds of variables* are we working with? 6. What should the regression look like? 7. What is the result? (what tests should we look at?) 8. What might make us still skeptical of our result? 9. Explain in words how *sampling variation* affects our results. How much uncertainty is there in our results? --- # Variable Descriptions (from `help(charity)`) - respond: = 1 if responded with gift - gift: amount of gift, Dutch guilders - resplast: =1 if responded to most recent mailing - weekslast: number of weeks since last response - propresp: response rate to mailings - mailsyear: number of mailings per year - giftlast: amount of most recent gift - avggift: average of past gifts --- # Variable Content <table class="table" style="margin-left: auto; margin-right: auto;"> <caption>charity</caption> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Class </th> <th style="text-align:left;"> Values </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> respond </td> <td style="text-align:left;"> integer </td> <td style="text-align:left;"> Num: 0 to 1 </td> </tr> <tr> <td style="text-align:left;"> gift </td> <td style="text-align:left;"> integer </td> <td style="text-align:left;"> Num: 0 to 250 </td> </tr> <tr> <td style="text-align:left;"> resplast </td> <td style="text-align:left;"> integer </td> <td style="text-align:left;"> Num: 0 to 1 </td> </tr> <tr> <td style="text-align:left;"> weekslast </td> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> Num: 13.143 to 195 </td> </tr> <tr> <td style="text-align:left;"> propresp </td> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> Num: 0.091 to 1 </td> </tr> <tr> <td style="text-align:left;"> mailsyear </td> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> Num: 0.25 to 3.5 </td> </tr> <tr> <td style="text-align:left;"> giftlast </td> <td style="text-align:left;"> integer </td> <td style="text-align:left;"> Num: 1 to 10000 </td> </tr> <tr> <td style="text-align:left;"> avggift </td> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> Num: 1 to 5005 </td> </tr> </tbody> </table> --- # Raw Data with mean-by-value ![](data:image/png;base64,#Week_05_2_Functional_Form_Practice_and_Midterm_Prep_files/figure-html/unnamed-chunk-11-1.png)<!-- --> --- # Raw Data with mean-by-value, logged gift+1 - Remember, `\(\log(x+1)\)` is a hack - (why is that mean so low?) ![](data:image/png;base64,#Week_05_2_Functional_Form_Practice_and_Midterm_Prep_files/figure-html/unnamed-chunk-12-1.png)<!-- --> --- # Regressions ``` ## m1 m2 m3 ## Dependent Var.: gift gift gift ## ## (Intercept) 2.014** (0.7395) -4.552*** (0.8030) -0.4199 (0.7036) ## mailsyear 2.650*** (0.3431) 2.166*** (0.3319) 1.733*** (0.3253) ## propresp 15.36*** (0.8745) ## giftlast 0.0059*** (0.0014) -0.2525*** (0.0112) ## avggift 0.5096*** (0.0220) ## _______________ _________________ __________________ ___________________ ## S.E. type IID IID IID ## Observations 4,268 4,268 4,268 ## R2 0.01379 0.08336 0.12685 ## Adj. R2 0.01356 0.08271 0.12624 ``` --- # Residual Plot from Model 3 ![](data:image/png;base64,#Week_05_2_Functional_Form_Practice_and_Midterm_Prep_files/figure-html/unnamed-chunk-14-1.png)<!-- -->