9 Simple and Multiple Linear Regressions
In statistical modeling, we seek to describe relationships between variables using mathematical models. A model specifies a dependent variable (the outcome of interest) and one or more independent variables (predictors or features that help explain variability in the outcome). For example, we might model how a company’s sales (dependent variable) relate to its advertising spend in various media (independent variables). The goal is to summarize patterns in data in a way that allows us to infer or predict outcomes. However, building a good model isn’t just about plugging data into formulas – it requires careful thought about the data-generating process, underlying assumptions, and potential biases in the data.
A famous illustration of careful model thinking comes from World War II. The U.S. military examined returning aircraft to decide where to add armor. They observed that the surviving planes had many bullet holes in certain sections (e.g. wings and fuselage) and relatively few in others (e.g. engines). The initial plan was to reinforce the areas with the most holes (the fuselage and wings). Statistician Abraham Wald, part of the classified Statistical Research Group at Columbia University, realized the data were misleading due to survivorship bias. The planes riddled with bullets in critical areas (like engines) often did not return at all – those bullet holes were missing from the analysis. Wald famously advised that armor should go where the bullet holes aren’t, i.e. reinforce the engines. In other words, the military needed to protect the spots that, in the surviving planes, showed little damage, because hits to those spots likely caused the plane to be lost. His insight was that the missing bullet holes were on the missing planes. This story underscores how a modeler must question data and consider what’s not observed in addition to what is.
Figure 1: Illustration of survivorship bias using WWII bomber data. Red dots indicate bullet holes on returning aircraft. The military initially considered reinforcing the areas with heavy damage (wings and tail), but Wald pointed out the planes hit in other areas (engines) never made it back, implying those areas were more vulnerable.
A similar cautionary tale appears in finance. Investment analysts once noted that mutual funds in a certain category had extremely high 10-year returns (averaging around 178% total growth from 1995–2004, or roughly 10.8% per year) among the funds that still existed at the end of the period. On the surface, this implied stellar performance. However, statisticians pointed out that this calculation only included funds that survived the entire decade. Many poorly performing funds had been closed or merged (i.e. dropped from the dataset), biasing the results upward. When those “dead” funds were included, the true average return was much lower – on the order of 134% total (~8.9% per year) for that same period. In other words, the impressive 178% figure was an overestimate caused by ignoring the failures. Likewise, a comprehensive analysis of thousands of funds found that focusing only on surviving funds can overstate performance by roughly 2% per year – which over a decade would inflate reported returns by around 20%. Again, the lesson is that data must be examined in context: Which data are we using to build the model, and which data might be missing? Such awareness is crucial for valid inference.
These examples highlight the modeler’s mindset: before rushing to fit equations, take time to “sharpen the axe.” This means understanding your data context, considering biases or confounding factors, and clarifying the question you’re trying to answer. Always ask whether an observed correlation might be masquerading as causation, and whether there are lurking variables or subgroups that could distort the results. For instance, ice cream sales might correlate with drowning incidents, but a modeler recognizes that a third factor (hot weather) drives both; increasing ice cream sales do not cause more drownings. In regression modeling, it’s vital to consider such confounding variables or missing pieces of data that could bias conclusions. Only after this critical thinking should we proceed to build a statistical model.
9.1 What is a Model?
A statistical model is a formal way to describe the relationship between variables. In a regression context, we typically denote the outcome/dependent variable as \(Y\) and the predictor/independent variable(s) as \(X\). For a simple linear regression with one predictor, the model can be written as:
\[ Y = \beta_0 + \beta_1 X + \epsilon, \]
where \(\beta_0\) is the intercept and \(\beta_1\) is the slope coefficient for \(X\). The term \(\epsilon\) represents the error (or residual), capturing randomness or influences on \(Y\) not included in the model. The intercept \(\beta_0\) is the expected value of \(Y\) when \(X = 0\). The slope \(\beta_1\) indicates how much \(Y\) is expected to increase or decrease, on average, for a one-unit increase in \(X\). In other words, the slope is the rate of change of the outcome with respect to the predictor. It’s important to note that unless the data come from a randomized experiment, a regression coefficient does not by itself imply causation – it only quantifies association. A large \(\beta_1\) suggests that \(X\) is associated with changes in \(Y\), but we must be cautious in interpreting this as a causal effect without further evidence.
In a multiple linear regression (with more than one predictor), the model extends to include additional \(X\) terms. For example, with \(k\) predictors \(X_1, X_2, \dots, X_k\), the model is:
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + \epsilon. \]
Here we have one slope coefficient \(\beta_j\) for each independent variable \(X_j\). Each coefficient \(\beta_j\) represents the relationship between that predictor and the outcome holding all the other variables constant. This phrase “holding others constant” is key: in multiple regression we can isolate the effect of one factor while controlling for others. For instance, if we regress a person’s weight (\(Y\)) on both height and age (\(X_1\) and \(X_2\)), the coefficient for height reflects the average change in weight for a one-unit increase in height at a given age. Multiple regression thus allows us to account for control variables (additional factors that might affect \(Y\)).
However, multiple regression also brings new challenges. One common issue is multicollinearity – when two or more predictors are highly correlated with each other. In such cases, it becomes difficult to disentangle their individual effects, because changes in one predictor are associated with changes in another. Multicollinearity inflates the uncertainty (standard errors) of the coefficients, and coefficients for correlated variables can become unstable or difficult to interpret. As modelers, we must be mindful of multicollinearity and other complexities when building models with multiple variables.
Example: Advertising Budget and Sales
To make these ideas concrete, consider a business question: Is there a relationship between advertising budget and sales, and if so, how strong is it? Suppose a company advertises on TV, radio, and newspapers and keeps track of its sales each period along with how much was spent in each advertising channel. Some key questions a regression model could help answer include:
- Is there a relationship between advertising expenditure and sales?
- How strong is that relationship? For example, does a big increase in ad spend correspond to a big increase in sales, or only a small one?
- Which media contribute most to sales? Do all channels (TV, radio, newspaper) have significant impacts, or only some of them?
- How accurately can we estimate the effect of each advertising medium on sales? (In other words, how confident are we in the contribution of TV vs. radio, etc.)
- How accurately can we predict future sales given a certain advertising plan?
By collecting data on past advertising spends in each medium and the resulting sales, we can fit a multiple regression model to address these questions. For instance, we might use Sales as \(Y\) and the budgets for TV, Radio, and Newspaper as three predictors \(X_1, X_2, X_3\). The fitted model would yield an equation like:
\[ \text{Sales} = \beta_0 + \beta_1(\text{TV Budget}) + \beta_2(\text{Radio Budget}) + \beta_3(\text{Newspaper Budget}) + \epsilon. \]
The output of this model would tell us the estimated coefficient for each medium’s budget. We could learn which coefficients are statistically significant and how large they are. For example, we might find that \(\hat\beta_1\) (TV) and \(\hat\beta_2\) (radio) are positive and significantly different from zero, while \(\hat\beta_3\) (newspaper) is near zero and not significant. This would suggest that TV and radio advertising have a measurable impact on sales, whereas newspaper ads, in this dataset, do not. In that scenario, the company might consider reallocating some of its newspaper ad budget into TV or radio where it seems to have more effect.
The model will also provide an \(R^2\) value (explained below) which indicates how much of the variance in sales is explained by the combined advertising budgets. Suppose the regression yielded an \(R^2\) of 0.90 (90%). We would interpret that as the ad budgets collectively explaining 90% of the variation in sales – an indication of a very strong relationship. In contrast, an \(R^2\) of, say, 0.30 would indicate that 70% of the sales variation is due to factors outside the model (economy, product quality, etc.). The advertising example illustrates how regression models can be used both for explanation (understanding which inputs matter and how) and for prediction (forecasting sales given a new set of ad budgets).
9.2 Estimating Regression Coefficients (Least Squares Method)
Once we have data and a proposed linear relationship, we need to estimate the model coefficients \(\beta_0, \beta_1\) (and \(\beta_2, \beta_3, \dots\) if multiple regression). The most common estimation technique is Ordinary Least Squares (OLS). OLS finds the line (in multiple regression, the hyperplane) that minimizes the sum of squared residuals. A residual is the difference between an observed value and the value predicted by our model. For the \(i\)-th observation, we write the residual as \(e_i = y_i - \hat{y}_i\), where \(y_i\) is the actual observed outcome and \(\hat{y}_i\) is the prediction given by our regression line.
The OLS method chooses the estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) (and others \(\hat{\beta}_2, \dots\) for multiple regression) to minimize the Residual Sum of Squares (RSS):
\[ RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2. \]
This is the sum of the squares of all residuals. Squaring the residuals serves two purposes: (1) it makes all errors positive and emphasizes larger errors (for example, a residual of 4 contributes 16 to RSS, four times the contribution of a residual of 2), and (2) it leads to convenient calculus for finding the minimum. The result of this minimization is the least-squares regression line – the line that best fits the data in the sense of having the smallest total squared error.
Mathematically, for simple linear regression, the formulas for the OLS estimates can be derived (they are based on the covariance of \(X\) and \(Y\) and the variances). But conceptually, you can imagine shifting and rotating a candidate line until the sum of squared vertical distances from the data points is as low as possible. That is the optimal OLS fit.
Interpretation of coefficients: Once we have the estimated regression line, say \(\hat{y} = b_0 + b_1 x\), we interpret \(b_0\) as the predicted value of \(Y\) when \(X = 0\), and \(b_1\) as the average change in \(Y\) associated with a one-unit increase in \(X\). To illustrate, suppose we have data on the percentage of high school graduates in each U.S. state (our \(X\) variable) and the percentage of the population living in poverty in each state (our \(Y\) variable). Fitting a simple linear regression to data (for example, from 2012) might yield a model:
\[ \widehat{\text{Poverty\%}} = 64.78 \;-\; 0.62 \times (\text{HS Graduation\%}). \]
In this equation, \(b_0 = 64.78\) and \(b_1 = -0.62\). The intercept \(64.78\) suggests that if the high school graduation rate were 0% (a hypothetical, extreme scenario), the model predicts a poverty rate of about 64.78%. The slope \(-0.62\) tells us that for each 1 percentage point increase in a state’s high school graduation rate, the poverty rate is expected to decrease by 0.62 percentage points on average (because the coefficient is negative). In practical terms, higher educational attainment in a state is associated with lower poverty rates – a relationship that makes intuitive sense.
Let’s use the model to predict a value and examine a residual: Georgia had a high school graduation rate of 85.1%. Plugging \(X = 85.1\) into our model gives
\[ \hat{y} = 64.78 - 0.62(85.1) \approx 12.0. \]
So the model would predict about 12.0% of Georgia’s population to be in poverty. If the actual observed poverty rate in Georgia was, say, 15%, then the residual for Georgia would be \(e = 15 - 12.0 = 3.0\). A positive residual means the model underpredicted (Georgia’s poverty was 3 points higher than the model expected given its graduation rate). If another state had a negative residual, it means the model overpredicted for that state (actual poverty was lower than predicted). By examining residuals for each observation, we can see where the model is performing well or poorly – perhaps indicating outliers or regions where the linear relationship doesn’t hold as strongly.
The goal in OLS regression is to make residuals as small as possible on average (in a squared-error sense). OLS gives equal weight to positive and negative deviations (since they’re squared), and heavily penalizes large deviations. There are other fitting methods (for example, one could minimize the sum of absolute residuals, which leads to a median-based fit), but OLS is by far the most common due to its desirable statistical properties (under certain assumptions, OLS gives the most unbiased precise estimates – Gauss-Markov theorem) and its relative simplicity.
9.3 Assessing the Model’s Fit and Accuracy
Fitting a line is just the start – we then need to assess how well the model explains the data and how reliable our estimates are. Key metrics and tools for this include the standard errors of the coefficients, the Residual Standard Error, the \(R^2\) statistic, and various hypothesis tests (t-tests for individual coefficients and F-tests for overall model significance). Let’s break these down:
Standard Errors and Confidence Intervals for Coefficients
Each estimated coefficient \(b_j\) (including the intercept \(b_0\)) comes with a standard error (SE). The standard error reflects how much the estimate would vary by chance if we were to repeat the data collection and re-fit the model many times. In other words, it’s the (estimated) standard deviation of the sampling distribution of the coefficient. A smaller standard error means we have a more precise estimate of that coefficient.
With the standard error, we can construct a confidence interval for the true (unknown) coefficient. For example, a 95% confidence interval for a slope \(b_1\) is:
\[b_1 \pm 1.96 \times SE(b_1),\]
assuming the errors are roughly normally distributed. This interval means that we’re 95% confident (under the model assumptions) that the true slope \(\beta_1\) lies within those bounds. If the interval does not include 0, it suggests \(\beta_1\) is significantly different from zero at the 5% level (because 0 is outside the plausible range of values).
Standard errors also allow for hypothesis testing of coefficients. The typical null hypothesis is \(H_0: \beta_j = 0\) (meaning predictor \(j\) has no linear effect on \(Y\)). We compute a t-statistic:
\[t = \frac{b_j - 0}{SE(b_j)},\]
which tells us how many standard errors away from zero our estimate is. If \(|t|\) is large, that corresponds to a small p-value, leading us to reject \(H_0\) (evidence that \(\beta_j \neq 0\)). In regression output, you will often see each coefficient accompanied by its standard error, t-value, and p-value.
For instance, in our poverty vs. education model, the slope was \(b_1 \approx -0.62\). Suppose the standard error for this slope was about \(0.15\). Then \(t = -0.62/0.15 \approx -4.13\). Such a large-magnitude t (in excess of 4 in absolute value) would have a very small p-value (much smaller than 0.01). This indicates strong evidence against \(H_0: \beta_1 = 0\). In practical terms, we’d say the data provide strong evidence that graduation rate is associated with poverty rate. Typically, a p-value < 0.05 is considered statistically significant, though of course the threshold can be adjusted depending on context.
Residual Standard Error (RSE)
The Residual Standard Error (RSE) gives a measure of the typical size of the residuals – essentially, it’s the average error made by the model in predicting \(Y\), measured in the units of \(Y\). Formally, the RSE is defined as:
\[RSE = \sqrt{\frac{RSS}{n - k - 1}},\]
where \(RSS\) is the residual sum of squares (as before), \(n\) is the sample size (number of observations), and \(k\) is the number of predictors in the model. The denominator \((n - k - 1)\) is the degrees of freedom (we lose one degree for each estimated coefficient, including the intercept). For a simple linear regression (\(k=1\) predictor), this is \((n-2)\). For multiple regression with \(k\) predictors, it’s \((n - k - 1)\).
The RSE can be viewed as an estimate of the standard deviation of the model’s error term \(\epsilon\). Roughly speaking, it tells us on average, how far off the model’s predictions are from the actual values. A smaller RSE indicates that data points are tightly clustered around the regression line (so predictions are, on average, very close to true values), whereas a larger RSE indicates more scatter (less precision in predictions).
For example, suppose in our poverty vs. graduation model the residual standard error turned out to be about 3.0 (in percentage points of poverty). This would mean that the typical prediction error is around 3 percentage points. If one state’s predicted poverty rate is 12%, a rough 95% prediction interval for an individual state might be about 12% ± 2*(RSE) = 12% ± 6%, i.e. from roughly 6% to 18%. We’d expect most states with similar graduation rates to have poverty rates in that range. If that band (6%–18%) is too wide for practical purposes, then our model might be considered not very precise. On the other hand, an RSE of 1% would indicate very tight predictions (12% ± 2% for a 95% interval, which is much narrower).
When evaluating a model, we often compare the RSE to the mean or range of \(Y\). For instance, if poverty rates range from 5% to 25% across states, an RSE of 3 is moderate. But if we were predicting something with a very narrow range, an RSE of 3 might be large. Thus, context matters for judging whether an RSE is “small” or “large.”
\(R^2\) – Coefficient of Determination
While the RSE is an absolute measure of error (in the units of \(Y\)), the \(R^2\) statistic provides a relative measure of how well the model fits, by describing the proportion of variance in \(Y\) that is explained by the model. \(R^2\) (pronounced “R-squared”) ranges from 0 to 1 (or 0% to 100%). It is defined as:
\(R^2 = 1 - \frac{RSS}{TSS},\)
where \(TSS = \sum_{i=1}^{n} (y_i - \bar{y})^2\) is the Total Sum of Squares – basically the total variance in \(Y\) around its mean \(\bar{y}\). \(RSS\) is, as before, the sum of squared residuals from the model. If our model explains none of the variability in \(Y\), then \(RSS \approx TSS\) and \(R^2\) will be near 0. If the model perfectly predicts every \(Y\) (no residual error), then \(RSS = 0\) and \(R^2 = 1\). In most real-world cases, \(R^2\) lies somewhere in between these extremes.
An \(R^2\) of 0.38, for instance, means 38% of the variation in the dependent variable is “explained” by the independent variable(s) in the model. The remaining 62% of variation is due to other factors not in the model or inherent randomness. In our high school graduation vs. poverty example, the \(R^2\) was about 0.38 (38%). We interpret this as: about 38% of the variability in state poverty rates is accounted for by differences in high school graduation rates. The fact that this isn’t higher tells us that factors beyond education (such as economic conditions, cost of living, employment opportunities, etc.) play a substantial role in poverty rates.
It’s important to clarify what \(R^2\) does and does not mean. \(R^2\) is not “the percent of predictions the model gets right” (a common misconception). It’s about variance explained, not classification accuracy or probability of being correct. An \(R^2\) of 0.38 doesn’t mean the model is “38% correct” – it means that 38% of the variance in \(Y\) is captured by the linear relationship with \(X\). For linear relationships, \(R^2\) is actually the square of the Pearson correlation coefficient between \(X\) and \(Y\) (in simple regression). In multiple regression, it generalizes to the proportion of variance explained by all the included predictors together.
One must also be cautious: adding more predictors to a model will generally increase \(R^2\) (or at least not decrease it), even if those predictors are only weakly related to \(Y\). This happens because adding a variable can only explain more variance (or leave it unchanged, in the worst case). Therefore, when comparing models with different numbers of predictors, it’s often better to look at Adjusted \(R^2\), which penalizes model complexity (the formula for adjusted \(R^2\) includes a penalty for additional predictors). Information criteria like AIC or BIC are also used for model comparison in such cases. But for a given model, \(R^2\) is a convenient summary of fit.
To give a sense of scale: in the advertising budget example, if we found \(R^2 \approx 0.90\), that would indicate a very strong relationship (90% of the variance in sales explained by the ad budgets). In contrast, our education vs. poverty model’s \(R^2\) of 0.38 is a moderate relationship – there is some predictive power, but a lot of variation remains unexplained.
Example of Model Summary
After fitting a regression model, statistical software typically provides a summary output. For our simple regression of poverty on graduation rate, the output might look like this (in a simplified format):
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 64.78 12.01 5.39 1.2e-06
GraduationRate -0.620 0.155 -4.00 0.0002
This table tells us: \(\hat{\beta}_0 = 64.78\) (with SE ≈ 12.01) and \(\hat{\beta}_1 = -0.620\) (with SE ≈ 0.155). The “t value” column is just the estimate divided by the standard error (for the slope, \(-0.620/0.155 \approx -4.00\)). The Pr(>|t|) column is the p-value. For GraduationRate, the p-value \(0.0002\) indicates strong statistical significance (since 0.0002 ≪ 0.05). This means we have very strong evidence that the true slope is not zero – graduation rate is significantly associated with poverty. The intercept’s p-value (1.2e-06, which is \(1.2\times10^{-6}\)) is also statistically significant in this case, though the substantive importance of the intercept is less meaningful (an intercept of 64.78% poverty at 0% graduation is extrapolating far beyond the data since no state has anywhere near 0% graduation).
Elsewhere in the model summary we’d see goodness-of-fit measures. For example:
Residual standard error: 3.02 on 49 degrees of freedom
Multiple R-squared: 0.380, Adjusted R-squared: 0.369
F-statistic: 16.0 on 1 and 49 DF, p-value: 0.0002
This tells us the RSE is about 3.02 (with 49 degrees of freedom, since there were 51 observations – 50 states plus DC – and we estimated 2 parameters, intercept and slope). The \(R^2\) is 0.380 (38.0%), and the adjusted \(R^2\) is 0.369 (36.9%). For simple regression, the difference between \(R^2\) and adjusted \(R^2\) is minor; with more predictors, adjusted \(R^2\) would penalize the inclusion of unnecessary variables. The F-statistic (16.0 on 1 and 49 DF) is another way of testing the overall model significance: it tests the null hypothesis that all slope coefficients are zero (here it’s just one slope, so it’s equivalent to the t-test for that slope). The F-test p-value 0.0002 matches the slope’s p-value since it’s a one-predictor model. In multiple regression, the F-test checks whether the model with all predictors is significantly better than a model with no predictors (just an intercept).
From such a summary, we would report that the model explains about 38% of the variance in poverty rates (\(R^2 = 0.38\)) and that there is a statistically significant negative relationship between graduation rates and poverty (p ≈ 0.0002). The residual standard error of ~3.02 percentage points gives a sense of prediction error magnitude. All of this helps us assess the model’s accuracy and usefulness.
9.4 Checking Model Assumptions (Validity)
Fitting and summarizing a linear model is not the end of the story. We must also verify that the data and model satisfy the assumptions under which our inferential statistics (standard errors, p-values, confidence intervals) are valid. The main assumptions for OLS linear regression are:
- Linearity: The relationship between the independent variables and the dependent variable is approximately linear. (The model is a linear approximation of the true relationship.)
- Nearly normal residuals: The residuals (errors) of the model are approximately normally distributed and centered at zero.
- Constant variability (Homoscedasticity): The residuals have constant variance at all levels of the fitted values (no systematic change in spread as the prediction changes).
- Independence: Each observation is independent of the others (for example, no correlated errors for clustered data or time series without adjustment).
We will focus on the first three (which can be partially checked with diagnostic plots), though the fourth is equally important – it often depends on study design. If these assumptions are violated, the model’s estimates or inferences may be unreliable or require adjustment.
Let’s address each assumption and how to check it:
1. Linearity
We assume that the true relationship between \(X\) and \(Y\) is linear (or that our chosen linear model form is a good approximation). To validate this, always plot the data first. A simple scatterplot of \(Y\) versus \(X\) (for each predictor) is extremely useful. The points should roughly cluster around a straight line if a linear model is appropriate. If the scatterplot shows a clear curve or non-linear pattern, a straight line will likely be a poor fit.
Another diagnostic is the residuals vs. fitted values plot. Here, we plot each residual \(e_i\) on the y-axis against the corresponding fitted value \(\hat{y}_i\) (or equivalently, against \(X_i\) in simple regression). In a properly linear relationship, the residuals should scatter randomly around 0, with no systematic pattern. If we see a distinct shape – for example, residuals positive at low \(X\), negative at mid \(X\), and positive again at high \(X\) (forming a U-shape) – that indicates a violation of linearity. The model is systematically overpredicting in some regions and underpredicting in others, which suggests the true relationship is non-linear (perhaps quadratic, exponential, etc.).
Example: In our education vs. poverty case, the scatterplot of poverty% vs. HS graduation% showed a roughly linear downward trend. The residuals vs. fitted plot did not display any obvious curve, supporting the linearity assumption. If instead the data had a pronounced curve (imagine poverty rates dropping quickly as graduation rises from 60% to 85%, then leveling off for higher graduation rates), a straight line would either overestimate poverty at low education levels or underestimate it at high education levels. In that scenario, a linear model would not be adequate – we might consider a nonlinear transformation of either variable or include a quadratic term (e.g. \((\text{Graduation%})^2\)) to capture the curvature.
If the linearity assumption fails, the solution is to modify the model rather than hope the inference holds. Sometimes a simple logarithmic or polynomial transformation can straighten the relationship. Other times, a completely different modeling approach (like a generalized additive model or another nonlinear model) may be needed.
2. Nearly Normal Residuals
Regression inference (t-tests, F-tests, etc.) typically assumes that the residuals are drawn from a normal distribution with mean 0. This matters most for smaller sample sizes – by the Central Limit Theorem, with large \(n\) the sampling distributions of estimates tend to normality even if residuals are slightly non-normal. We don’t need perfect normality, but we do want to check that residuals are approximately symmetric and not wildly skewed or outlier-ridden.
To assess this, we can look at a histogram or density plot of the residuals and see if it looks roughly bell-shaped and centered at zero. A Normal Q-Q plot (quantile-quantile plot) is another tool: if residuals are normal, their Q-Q plot will be approximately a straight 45° line. Deviations (especially S-shaped curves in the Q-Q plot) indicate departures from normality.
What could violate normality? Perhaps the outcome \(Y\) has some outliers – extreme values that the model doesn’t fit well – resulting in very large residuals on one side. Or the error distribution could be skewed (maybe errors are often positive but occasionally very negative, or vice versa). As an example, consider an income regression: a few individuals with much higher income than others could create right-skewed residuals if the model underestimates those high incomes.
In our poverty model example, with 51 observations (50 states plus DC), we might plot the residuals. Suppose we see a histogram that is roughly symmetric and centered near 0, with most residuals between -5 and +5 and no extreme outliers beyond that. That would suggest the residuals are approximately normal. If instead we saw a long tail on the right (maybe one state had a residual of +10, far separated from the rest), we’d worry about non-normality and investigate that state (perhaps Washington D.C. or another outlier in poverty data). Minor deviations from normality are usually not fatal, but severe ones (especially with small sample sizes) might lead us to be cautious with p-values and consider using statistical methods that don’t assume normality (or transforming the response variable).
3. Constant Variability (Homoscedasticity)
Homoscedasticity means “equal scatter.” In regression, this assumption is that the variance of the errors is the same regardless of the level of \(X\) (or of the fitted value \(\hat{Y}\)). When this holds, the residuals vs. fitted plot should show a roughly constant vertical spread of points across the range of fitted values. If the residuals grow more spread out as the fitted values increase (or conversely, start wide and funnel down), that’s called heteroscedasticity – the error variance is not constant.
For example, imagine we model household income as a function of age. At younger ages, incomes might be relatively low and not too variable (most 22-year-olds have similar low starting salaries). By mid-career (say age 40-50), some people are making modest incomes but others are making very high incomes, so the variability in income is much larger. If we fit a linear model, the residuals for younger ages might be small (low scatter) while for middle age the residuals are huge (high scatter). The residual plot would show narrow scatter on the left (young age, low \(\hat{Y}\)) and very wide scatter on the right (middle age, high \(\hat{Y}\)), i.e. a fanning-out pattern. This violates the constant variance assumption.
Why do we care? Heteroscedasticity does not bias the coefficient estimates (\(b_j\) are still unbiased), but it does affect our uncertainty estimates. When error variance isn’t constant, the standard errors for coefficients (and hence t-tests and p-values) that assume homoscedasticity may be inaccurate – typically, OLS standard errors can be underestimated in presence of heteroscedasticity, leading to overly optimistic p-values. In other words, if the spread of residuals changes, we might be giving some points too much influence and others too little when computing standard errors.
How to check: again, the residuals vs. fitted plot is key. If you see the points forming a clear pattern of increasing or decreasing spread, that’s a red flag. Sometimes one can also plot residuals vs. each predictor to see if variance changes with certain predictors.
If we detect heteroscedasticity, possible remedies include transforming the response (e.g., using log(\(Y\)) often stabilizes variance when variability increases with the mean) or using weighted least squares (giving less weight to points with larger variance). Alternatively, one can compute robust standard errors that adjust for heteroscedasticity (these won’t fix the fit, but will correct the standard error estimates to be valid even when variance isn’t constant).
In the context of our example, suppose we check the residuals vs. fitted for the poverty model. If states with low predicted poverty (say 5–10%) show residuals mostly within ±2 points, but states with high predicted poverty (20%+) show residuals ranging ±5 points, that’s some increase in spread – though maybe not severe. If the plot looked like a clear “funnel” opening up, we’d say there is heteroscedasticity. In our actual diagnostic we didn’t see a strong fanning pattern, so we concluded the variance looked roughly constant across the range of fitted values.
Figure 2: Example of a residuals vs. fitted values plot showing heteroscedasticity. The vertical spread of residuals increases as the fitted values grow, forming a “fanning out” shape toward the right. In such cases, the assumption of constant variance is violated, as larger predicted values come with larger errors (wider scatter).
To summarize assumptions in practical terms: When our linear model is appropriate, the residuals should look like unstructured noise. They should be centered around 0 with no systematic patterns (linearity check), they should form a consistent cloud with roughly the same thickness across the plot (homoscedasticity check), and their histogram should look roughly bell-shaped (normality check).
If we see patterns, here’s a recap of what they might indicate:
- Non-linearity: Systematic patterns in residuals (e.g. curvature). The model is missing a nonlinear trend.
- Heteroscedasticity: Residuals’ spread changes with fitted value (e.g. funnel shape: narrow one side, wide on the other). The variance of errors isn’t constant.
- Non-normality: The residual distribution is skewed or has outliers. This could be due to a few anomalous points or inherent skewness in data.
It’s also important to watch for outliers or influential points specifically. An outlier is an observation with an unusual \(Y\) value given its \(X\) (a large residual). An influential point is one that has a big impact on the model fit; often this is a case with an extreme \(X\) value (high leverage) and also somewhat off the trend. Such a point can pull the regression line towards itself. You might have a case that isn’t a huge residual (because the line moved toward it), but if you removed that one point, the line would shift considerably – that’s an influential point. Diagnostic measures like Cook’s distance exist to quantify influence, but a simple approach is to fit the model with and without suspicious points to see how much results change.
If outliers or influential observations are present, you should investigate them. Sometimes they are data errors or special cases that merit exclusion; other times they are legitimately part of the population and tell us that the relationship might be different in that regime. You might decide to include additional predictors to explain that point, or use a robust regression method that downweights outliers.
Example Diagnostic Scenarios: Suppose we fit a regression and then examine the diagnostics:
- In the residual vs. fitted plot, we notice a clear curve – residuals are positive at the low end of \(\hat{Y}\), negative in the middle, positive at the high end. This suggests our linear model is inadequate (violating linearity). The remedy could be to add a quadratic term or some nonlinear transformation of \(X\) to capture the curvature.
- Alternatively, imagine the residual vs. fitted plot shows that residuals are very tightly clustered around 0 for small fitted values, but the scatter grows for larger fitted values (like Figure 2 above). This indicates heteroscedasticity – perhaps the variance of \(Y\) increases with its mean. We might address this by using a log scale for \(Y\) (which often stabilizes variance) or using weighted regression.
- Finally, if a histogram of residuals shows a long tail on one side or a couple of extreme outliers, we might be concerned about normality (and those outliers). We could check if removing an outlier dramatically changes results (if so, report both results or find a reason to exclude it), or consider a transformation that makes the distribution more symmetric.
In short, checking these assumptions helps ensure that our model is not only fitting the data but also suitable for drawing reliable conclusions. If assumptions are seriously violated, the standard errors and p-values we calculated may no longer be valid, and we’d need to take corrective steps or be more cautious in interpretation.
9.5 Example: Model Diagnostics for Education vs. Poverty
Let’s apply the above diagnostics to our simple regression of state poverty% on high school graduation%. We’ll walk through what we found:
Linearity: A scatterplot of poverty rate vs. graduation rate showed a roughly linear negative trend. In the residuals vs. fitted plot, we did not see obvious curvature. The residuals appeared randomly scattered around 0, rather than following a pattern. This suggests the linear model form was reasonable for this relationship. (Had we seen, say, a U-shaped pattern in residuals, we would conclude the relationship isn’t purely linear and consider adding a squared term or otherwise transforming the model.)
Normal Residuals: We plotted a histogram (and we could examine a Normal Q-Q plot) of the residuals. The histogram was roughly unimodal and symmetric about 0 – it looked somewhat bell-shaped, with most residuals around 0 and tapering out to ±5 or so. There were a couple of moderately large residuals (for example, Washington D.C. had a residual around +5.4 points, meaning the model underpredicted DC’s poverty rate by 5.4; and perhaps a state like New Hampshire had a residual of -4, meaning the model overpredicted its poverty). However, there were no extreme outliers beyond that, and the overall shape wasn’t severely skewed. The Normal Q-Q plot showed points roughly along the line with only minor deviations in the tails. In sum, the residuals were approximately normal – certainly close enough for our inference to be reliable. (If we had seen a very skewed or bimodal residual distribution, we’d worry more. Here it was fine.)
Constant Variability: The residuals vs. fitted plot did not exhibit a clear fanning or funneling; the spread of residuals looked roughly consistent across the range of fitted values (from low predicted poverty to high predicted poverty). For states with low predicted poverty (~8-12%), residuals spanned maybe -3 to +3; for states with higher predicted poverty (~18-22%), residuals spanned maybe -4 to +5. There might have been a slight increase in spread at the higher end, but nothing dramatic. We concluded that the homoscedasticity assumption was not obviously violated. (If, hypothetically, the variance had doubled for high-prediction states, we’d take note and potentially use a transformation or robust SEs. In this case, it appeared reasonably constant.)
Independence: Each state’s data point can be considered independent of others (one state’s graduation and poverty rate doesn’t directly affect another’s in this dataset). There is no time series or hierarchical structure here that would violate independence, so we are good on this front as well. (If we were analyzing something like repeated measurements on the same individuals, or neighboring regions influencing each other, we would need to account for that structure. Independence is an assumption to think about at the data collection stage – e.g., ensure random sampling, etc.)
Since all diagnostic checks came out acceptable, we gain confidence that the linear regression model is appropriate for this data. Therefore, our earlier results – the coefficient estimates, confidence intervals, and p-values – can be trusted and interpreted in the usual way. If any assumption had been strongly violated, we would have to be more cautious. For example, if we had found a very large outlier that was overly influential, we might report the analysis with and without that point to see how much it affects the conclusions. Or if we found strong heteroscedasticity, we might quote robust standard errors (which are valid even when variance isn’t constant) to see if the significance of the graduation rate predictor still holds.
The general lesson is: diagnostics are a vital part of regression analysis. They help avoid being fooled by a seemingly significant result that is actually an artifact of a violated assumption or a quirky dataset. By checking plots and conditions, we either validate our model or gain clues on how to improve it.
9.6 Multiple Regression Considerations
Thus far, we’ve focused on simple linear regression (one predictor). Multiple linear regression – with several predictors – follows the same principles, but with a few additional considerations.
All the assumptions we discussed still apply, but now we are in a multi-dimensional space. Linearity means that the outcome is a linear combination of the predictors. We should check whether relationships are roughly linear in each predictor (possibly using residual plots or partial regression plots for each \(X_j\)). If not, we might need to include transformations or interaction terms.
Residuals should still be approximately normal and homoscedastic. We can look at the overall residuals vs. fitted plot for the model. We can also plot residuals against each predictor to see if variance changes with any particular \(X\). Sometimes heteroscedasticity might be linked to one of the predictors (e.g., error variance grows with income level, regardless of other factors).
One new issue in multiple regression is multicollinearity among predictors, as mentioned earlier. If two predictors \(X_1\) and \(X_2\) are highly correlated, the regression may have trouble distinguishing their individual effects. The coefficients might become very sensitive to small changes in data, and their standard errors inflate (making it hard to declare either effect statistically significant even if jointly they matter). A common diagnostic for multicollinearity is the Variance Inflation Factor (VIF). VIF measures how much the variance of an estimated coefficient is increased due to multicollinearity. As a rule of thumb, a VIF above 5 or 10 is considered high, indicating that the predictor is highly collinear with others. In such cases, one might consider removing or combining correlated predictors, or interpreting the coefficients with caution. In our advertising example, if TV and radio budgets were themselves very correlated (say companies that spend a lot on TV also spend a lot on radio), we might find each of their individual p-values to be high (insignificant) even if together they explain sales well. If the VIF for TV and radio budgets were large, we might decide that it’s hard to separate their effects – perhaps we’d combine them into a single “broadcast advertising” variable or gather more data to break the correlation.
Interpretation of coefficients in multiple regression must always be in the context of “holding other variables constant.” This is a strength of multiple regression – we can attempt to isolate the effect of one factor net of others – but it can also be a source of confusion. For example, a positive coefficient on \(X_1\) (say, years of education) in a multiple regression predicting income (with other factors like experience, age, etc. included) suggests that, for people with the same age and experience, more education is associated with higher income. This is closer to a causal interpretation (though still not guaranteed without a randomized design) than a simple correlation between education and income would be, because we’ve controlled for some confounders. However, if important confounding variables are omitted, the coefficients can still be biased. One must think hard about which variables to include. Including too few (omitting key ones) leaves bias; including too many (especially irrelevant ones) can add noise and multicollinearity issues.
In practice, multiple regression is used to answer questions like the one posed by researcher Nitin Pangarkar in a study: “Does the degree of internationalization affect the performance of small- and medium-sized enterprises (SMEs)?”. Here, performance (e.g. profit growth) would be \(Y\), and degree of internationalization (DOI) (e.g. percent of revenue from foreign markets) is a key \(X\) of interest. But an SME’s performance could also depend on other factors – firm size, industry sector, domestic economic conditions, etc. So Pangarkar’s analysis included such controls in a multiple regression. The question becomes: after accounting for firm size, industry, and other relevant variables, is DOI still associated with performance? According to the study, the answer was yes – Pangarkar found that a higher degree of internationalization positively impacts SME performance on average. The multiple regression framework allowed isolating the effect of internationalization while holding other factors constant. Of course, one must still be cautious about causality (maybe the causation could partly run the other way, or maybe only the more capable firms expand internationally – a self-selection issue). But the regression provides a quantitative estimate of the association: e.g., “all else equal, a 10% increase in foreign revenue share is associated with X% increase in annual profit growth.”
When building multiple regressions, we also check assumptions via diagnostics. We might make residual plots for each predictor or use added-variable plots to see if each relationship is linear. We check for outliers in a multivariate sense – sometimes an observation isn’t an outlier on \(Y\) or any single \(X\), but is unusual in the combination of \(X\) values (high leverage point). Influential point analysis (like Cook’s distance) helps here. Essentially, the modeling process often involves iterating: include variables, check diagnostics, possibly transform variables or remove outliers, perhaps go back and include an omitted variable if diagnostics suggest it, and so on.
So, multiple regression adds nuance and power by handling many variables at once, but it requires the same careful validation as simple regression, plus vigilance about multicollinearity and the interpretation of each coefficient in context. The reward is a richer understanding: for example, we might conclude “TV advertising has a strong effect on sales (holding other media constant), radio has a smaller effect, and newspaper has none,” or “International expansion significantly improves SME performance when controlling for size and industry.” These insights are valuable for decision-making.
9.7 Conclusion
Linear regression is a foundational tool in data analysis for quantifying relationships between variables. A simple linear regression provides an easy-to-understand equation of a line (\(Y = \beta_0 + \beta_1 X\)) that characterizes how \(Y\) changes on average with \(X\). We learned how to estimate this line using the least squares method and interpret the resulting coefficients: the intercept as a baseline and the slope as the rate of change. We also saw how to measure the model’s effectiveness with metrics like the Residual Standard Error (to quantify the typical prediction error) and \(R^2\) (to quantify the proportion of variance explained by the model).
Crucially, we discussed the importance of validating a fitted model by checking assumptions. Diagnostics like residual plots and normality checks help ensure that our inferences (confidence intervals and hypothesis tests) are valid. When the assumptions (linearity, normality, homoscedasticity, independence) hold reasonably well, we can trust the p-values and predictions from the model. When they don’t, we might need to refine the model or use caution in interpretation. For example, a nonlinear pattern suggests we improve the model form; heteroscedasticity might lead us to use a transformation or robust standard errors; an outlier might prompt further investigation or a separate analysis.
Regression models allow us not only to explain relationships (e.g. which factors are associated with higher sales or lower poverty) but also to predict outcomes (e.g. forecasting sales for a given advertising plan, or predicting poverty rates for a given education level). However, the quality of a model’s results is only as good as the data and assumptions that underlie it. Always be mindful of potential biases (like survivorship bias or omitted variable bias) and the difference between correlation and causation. As we saw with the WWII planes and mutual funds, failing to account for what data aren’t there or what factors are lurking can lead to faulty conclusions.
Multiple linear regression extends these ideas to more complex scenarios with several predictors. It gives us a more powerful lens to control for confounding factors and test specific effects in a ceteris paribus way (“all else being equal”). But with that power comes the need for diligence about issues like multicollinearity and model interpretability. Through an iterative process of model fitting, checking, and refinement, we aim to arrive at a regression model that provides both explanatory insights and reliable predictive power.
In the spirit of Abraham Lincoln’s quote, “Give me six hours to chop down a tree and I will spend the first four sharpening the axe,” a good analyst will spend substantial time understanding the data and verifying the model assumptions before trusting any regression results. This preparatory work – examining data, considering what might be missing or misleading, choosing the right form of the model – is the “sharpening” that ensures when we do make conclusions or predictions (the actual “chopping”), our analysis is as solid as possible.
Finally, to reinforce the importance of statistical thinking, consider what has been nicknamed “The Most Dangerous Equation.” This refers to the formula for the standard error of the mean (developed by De Moivre in the 18th century), which shows that the variability of an average is inversely proportional to the sample size (specifically, \(SE_{\bar{Y}} = \sigma/\sqrt{n}\)). Ignorance of this fact – that small samples yield much more variable outcomes – has misled people in many fields for centuries. For example, among all schools in a state, the very best and very worst average test scores often come from the smallest schools; their performance appears extreme due to high variability in small samples, not necessarily because those schools are truly the best or worst. Similarly, investment funds with only a short track record or small number of assets might show spectacular returns simply by chance variation. Recognizing this, we become cautious about dramatic findings from small samples. In statistics, “small is not beautiful” when it comes to reliability – a reminder that we must account for sample size in our interpretations. This principle ties back to regression as well: when comparing groups or making predictions, consider the strength of evidence (how much data) behind those numbers. By keeping such insights in mind, we build and interpret regression models (and indeed all statistical analyses) more wisely, tempering our conclusions with an understanding of variability and uncertainty.