class: center, middle, inverse, title-slide # Ordinary Least Squares ## Part 1: What it Is ### Updated 2022-02-23 --- # What is Regression? - In statistics, regression is the practice of *line-fitting* - We want to *use one variable to predict another* - Let's say using `\(X\)` to predict `\(Y\)` - We'd refer to `\(X\)` as the "independent variable", and `\(Y\)` as the "dependent variable" (dependent on `\(X\)` that is) - Regression is the idea that we should characterize the relationship between `\(X\)` and `\(Y\)` as a *line*, and use that line to predict `\(Y\)` --- # `\(X\)` and `\(Y\)` - Here we have `\(X\)` and `\(Y\)` ![](data:image/png;base64,#Week_02_Slides_1_What_is_Regression_files/figure-html/unnamed-chunk-1-1.png)<!-- --> --- # `\(X\)` and `\(Y\)` - I have an `\(X\)` value of 2.5 and want to predict what `\(Y\)` will be. What can I do? ![](data:image/png;base64,#Week_02_Slides_1_What_is_Regression_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- # `\(X\)` and `\(Y\)` - I can't just say "just predict whatever values of `\(Y\)` we see for `\(X = 2.5\)`, because there are multiple of those! - Plus, what if we want to predict for a value we DON'T have any actual observations of, like `\(X = 4.3\)`? ![](data:image/png;base64,#Week_02_Slides_1_What_is_Regression_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- # Data is Granular - If I try to fit *every point*, I'll get a mess that won't really tell me the relationship between `\(X\)` and `\(Y\)` - So, we *simplify* the relationship into a *shape*: a line! The line smooths out those three points around 2.5 and fills in that gap around 4.3 ![](data:image/png;base64,#Week_02_Slides_1_What_is_Regression_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- # Isn't This Worse? - By adding a line, we are necessarily *simplifying* our presentation of the data. We're tossing out information! - Our prediction of the *data we have* will be less accurate than if we just make predictions point-by-point - However, we'll do a better job predicting *other* data (avoiding "overfitting") - And, since a *shape* is something we can interpret, as opposed to a long list of predictions, which we can't really, the line will do a better job of telling us about the *true underlying relationship* --- # The Line Does a Few Things: - We can get a *prediction* of `\(Y\)` for a given value of `\(X\)` (If we follow `\(X = 2.5\)` up to our line we get `\(Y = 7.6\)`) - We see the *relationship*: the line slopes up, telling us that "more `\(X\)` means more `\(Y\)` too!" ![](data:image/png;base64,#Week_02_Slides_1_What_is_Regression_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- # Lines - That line we get is the *fit* of our model - A model "fit" means we've taken a *shape* (our line) and picked the one that best fits our data - All forms of regression do this - Ordinary least squares specifically uses a *straight line* as its shape - The resulting line we get can also be written out as an actual line, i.e. $$ Y = intercept + slope*X $$ --- # Lines - We can use that line as... a line! - If we plug in a value of `\(X\)`, we get a prediction for `\(Y\)` - Because these `\(Y\)` values are predictions, we'll give them a hat `\(\hat{Y}\)` $$ Y = 3 + 4*X $$ $$ \hat{Y} = 3 + 4*(3.2) $$ $$ \hat{Y} = 15.8 $$ --- # Lines - We can also use it to explain the relationship - Whatever the intercept is, that's what we predict for `\(Y\)` when `\(X = 0\)` $$ Y = 3 + 4*X $$ $$ \hat{Y} = 3 + 4*0 $$ $$ \hat{Y} = 3 $$ --- # Lines - And as `\(X\)` increases, we know how much we expect `\(Y\)` to increase because of the slope $$ Y = 3 + 4*X $$ $$ \hat{Y} = 3 + 4*3 = 15 $$ $$ \hat{Y} = 3 + 4*4 = 19 $$ - When `\(X\)` increases by `\(1\)`, `\(Y\)` increases by the slope (which is `\(4\)` here) --- # Ordinary Least Squares SO! - Regression fits a *shape* to the data - Ordinary least squares specifically fits a *straight line* to the data - The straight line is described using an `\(intercept\)` and a `\(slope\)` - When we plug an `\(X\)` into the line, we get a prediction for `\(Y\)`, which we call `\(\hat{Y}\)` - When `\(X = 0\)`, we predict `\(\hat{Y} = intercept\)` - When `\(X\)` increases by `\(1\)`, our prediction of `\(Y\)` increases by the `\(slope\)` - If `\(slope > 0\)`, `\(X\)` and `\(Y\)` are positively related/correlated - If `\(slope < 0\)`, `\(X\)` and `\(Y\)` are negatively related/correlated --- # Concept Checks - How does producing a *line* let us use `\(X\)` to predict `\(Y\)`? - If our line is `\(Y = 5 - 2*X\)`, explain what the `\(-2\)` means in a sentence - Not all of the points are exactly on the line, meaning some of our predictions will be wrong! Should we be concerned? Why or why not? --- # How? - We know that regression fits a line - But how does it do that exactly? - It picks the line that produces the *smallest squares* - Thus, "ordinary least squares" - Wait, huh? --- # Predictions and Residuals - Whenever you make a prediction of any kind, you rarely get it *exactly right* - The difference between your prediction and the actual data is the *residual* $$ Y = 3 + 4*X $$ If we have a data point where `\(X = 4\)` and `\(Y = 18\)`, then $$ \hat{Y} = 3 + 4*4 = 19 $$ Then the *residual* is `\(Y - \hat{Y} = 18 - 19 = -1\)`. --- # Predictions and Residuals So really, our relationship doesn't look like this... $$ Y = intercept + slope*X $$ Instead, it's... $$ Y = intercept + slope*X + residual $$ We still use `\(intercept + slope*X\)` to predict `\(Y\)` though, so this is also $$ Y = \hat{Y} + residual $$ --- # Ordinary Least Squares - As you'd guess, a good prediction should make the residuals as small as possible - We want to pick a line to do that - And in particular, we're going to *square* those residuals, so the really-big residuals count even more. We really don't want to have points that are super far away from the line! - Then, we pick a line to minimize those squared residuals ("least squares") --- # Ordinary Least Squares - Start with our data ![](data:image/png;base64,#Week_02_Slides_1_What_is_Regression_files/figure-html/unnamed-chunk-6-1.png)<!-- --> --- # Ordinary Least Squares - Let's just pick a line at random, not necessarily from OLS ![](data:image/png;base64,#Week_02_Slides_1_What_is_Regression_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- # Ordinary Least Squares - The vertical distance from point to line is the residual ![](data:image/png;base64,#Week_02_Slides_1_What_is_Regression_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- # Ordinary Least Squares - Now square those residuals ![](data:image/png;base64,#Week_02_Slides_1_What_is_Regression_files/figure-html/unnamed-chunk-9-1.png)<!-- --> --- # Ordinary Least Squares - Can we get the total area in the squares smaller with a different line? ![](data:image/png;base64,#Week_02_Slides_1_What_is_Regression_files/figure-html/unnamed-chunk-10-1.png)<!-- --> --- # Ordinary Least Squares - Ordinary Least Squares, I can promise you, gets it the smallest ![](data:image/png;base64,#Week_02_Slides_1_What_is_Regression_files/figure-html/unnamed-chunk-11-1.png)<!-- --> --- --- # Ordinary Least Squares - How does it figure out which line makes the smallest squares? - There's a mathematical formula for that! - First, instead of thinking of `\(intercept\)` and `\(slope\)`, we reframe the line as having *parameters* we can pick $$ Y = intercept + slope*X + residual $$ $$ Y = \beta_0 + \beta_1X + \varepsilon $$ --- # Terminology Sidenote $$ Y = \beta_0 + \beta_1X + \varepsilon $$ - In statistics and econometrics, Greek letters represent "the truth" - in *the true process by which the data is generated*, a one-unit increase in `\(X\)` is related to a `\(\beta_1\)` increase in `\(Y\)` - When we put a hat on anything, that is our *prediction* or *estimation* of that true thing. `\(\hat{Y}\)` is our prediction of `\(Y\)`, and `\(\hat{\beta_1}\)` is our estimate of what we think the true `\(\beta_1\)` is - Note the switch from "residual" to `\(\varepsilon\)` - residuals are what's actually left over from our prediction in the real data, but the *error* `\(\varepsilon\)` is the *true* difference between our line and `\(Y\)`. Even if we get the correct line with `\(\beta_0\)` and `\(\beta_1\)`, there are still points off the line! --- # Ordinary Least Squares - Now that we have our line in parametric terms, we can pick our *estimates* of `\(\beta_0\)` and `\(\beta_1\)` in order to make the squared residuals as small as possible - Pick `\(\hat{\beta_0}\)` and `\(\hat{\beta_1}\)` to minimize: $$ \sum_i (residual_i^2) $$ $$ \sum_i ((Y_i - \hat{Y})^2) $$ $$ \sum_i ((Y_i - \hat{\beta_0} - \hat{\beta_1}X_i)^2) $$ Where the `\(_i\)` refers to a particular observation. `\(\sum_i\)` means "sum this up over all the observations" (Conveniently, you can pick `\(\hat{\beta_0}\)` and `\(\hat{\beta_1}\)` to minimize that expression with basic calculus) --- # Let's Play - Take a look at this OLS simulator and play around with it: [https://econometricsbysimulation.shinyapps.io/OLS-App/](https://econometricsbysimulation.shinyapps.io/OLS-App/) - Click "Show Residuals" to turn that on - Try different data generating processes and standard deviations - What settings make the residuals small or large? Any guesses why? - What happens if you take the intercept out? What does that make our line do? - How close does the line come to the data generating process? Intercept and slope are in the second table below the graph under "Estimate" --- --- # Concept Checks - Why might we want to minimize squared residuals rather than just residuals? - What's the difference between a residual and an error? - If I have the below OLS-fitted line from a dataset of children: $$ Height (Inches) = 18 + 2*Age$$ And we have the kids Darryl who is 10 years old and 40 inches tall, and Bijetri who is 9 years old and 37 inches tall, what are each of their: (a) predicted values, (b) residuals, and then what is the sum of their squared residuals? --- # Ordinary Least Squares in R - Ordinary Least Squares is built in to R using the `lm` function - Let's run a regression on the `Orange` data set of tree age and circumference ```r data(Orange) lm(circumference ~ age, data = Orange) ``` ``` ## ## Call: ## lm(formula = circumference ~ age, data = Orange) ## ## Coefficients: ## (Intercept) age ## 17.3997 0.1068 ``` --- --- # Ordinary Least Squares in R - In this class, we'll be using `feols()` from the **fixest** package instead of `lm()` - This doesn't make too much difference now, and the code looks the same so far, but this will help us easily connect to other stuff ```r library(fixest) feols(circumference ~ age, data = Orange) ``` ``` ## OLS estimation, Dep. Var.: circumference ## Observations: 35 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 17.39965 8.622660 2.01790 5.1793e-02 . ## age 0.10677 0.008277 12.90023 1.9306e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 23.0 Adj. R2: 0.829502 ``` --- # Ordinary Least Squares in R - What's going on here? - `circumference ~ age` is a *formula* object. It says to take the `circumference` variable and treat that as our dependent variable `\((Y)\)`. Have it vary (`~`) according to `age` (independent variable, `\(X\)`) - `data = Orange` tells it the data set to look for those variables in - The output shows the `(Intercept)` `\((\hat{\beta}_0)\)` as well as the slope `\((\hat{\beta}_1)\)` on `age` (why doesn't it just say `slope`? Because later we'll have more than one slope!) --- # Ordinary Least Squares in R - There's lots more information we can get from our regression, but that will wait for later - For now, let's just make a nice graph of it using the **ggplot2** library (which you already got installed when you installed the **tidyverse**) ```r library(tidyverse) # This loads ggplot2 as well ggplot(Orange, aes(x = age, y = circumference)) + geom_point() + # Draw points geom_smooth(method = 'lm') # add OLS line ``` --- # Ordinary Least Squares in R - The result: ![](data:image/png;base64,#Week_02_Slides_1_What_is_Regression_files/figure-html/unnamed-chunk-16-1.png)<!-- --> --- # Practice - Let's work through the "Ordinary Least Squares Part 1" module in the econometrics **swirl**