4 Regression Models in Machine Learning with R

Regression models are a cornerstone of machine learning and statistics, used for both predictive modeling and causal inference. In predictive modeling, the goal is to accurately predict an outcome variable given a set of features, whereas in causal inference the aim is to estimate the effect of certain variables (e.g. treatments or policies) on an outcome, often under theoretical assumptions about the data-generating process. These two goals can require different modeling approaches. In many scientific disciplines, regression has been traditionally used for explanatory modeling (seeking causal understanding), and it is sometimes incorrectly assumed that a model with high explanatory power will automatically have high predictive power. However, explanatory (causal) and predictive models serve different purposes, and conflating them is problematic for scientific progress. This chapter will clarify these distinctions and demonstrate how regression techniques can be applied for both purposes.

We provide a comprehensive overview of common regression approaches in machine learning, focusing on implementation in R and applications in social science contexts. We will cover:

Linear regression and its assumptions and limitations.
Regularized linear models including ridge regression, lasso, and elastic net.
Support Vector Regression (SVR) for nonlinear regression using kernel methods.
Tree-based methods such as decision trees, random forests, and gradient boosting machines.
Generalized Additive Models (GAMs) for capturing nonlinear relationships in an interpretable way.

For each approach, we discuss the theoretical foundations, including key assumptions, strengths, and limitations, and appropriate use cases. We also address important concepts that cut across these methods: the bias-variance tradeoff, risks of overfitting, the role of regularization, strategies for feature selection, model cross-validation and tuning, and considerations of model interpretability. Throughout the chapter, code examples in R are provided to demonstrate how to implement, tune, and evaluate each model using packages such as glmnet, caret, mlr3, ranger, e1071, and xgboost. We use either real datasets (e.g. the Boston housing data or social science datasets) or simulated data to illustrate model behavior in a social science context. All code is presented in R Markdown format with executable code chunks, allowing the reader to reproduce the analysis.

By the end of this chapter, a reader should understand not only how to fit various regression models in R, but also when and why to use each approach. We emphasize an academic yet accessible style, blending mathematical intuition with practical examples. All claims are supported by literature in the field, with inline citations in APA style and a full reference list at the end.

4.1 The Bias-Variance Tradeoff and Model Complexity

Effective use of regression models requires understanding the fundamental bias-variance tradeoff. In any predictive modeling task, the expected prediction error can be decomposed (approximately) into bias, variance, and irreducible error components. Bias is error due to erroneous or overly simplistic assumptions in the model (i.e. underfitting), while variance is error due to sensitivity to small fluctuations in the training data (i.e. overfitting). Generally, as model complexity increases, bias decreases and variance increases. Simpler models (e.g. a linear regression with few predictors) tend to have higher bias but lower variance (they might miss important relationships but are stable across datasets), whereas more complex models (e.g. a deep decision tree or a high-degree polynomial) can fit nuances in the training data (lower bias) but may capture noise as if it were signal, leading to high variance on new data.

This trade-off implies the existence of an optimal model complexity that minimizes the total prediction error. As complexity (e.g. flexibility or number of parameters) increases, the sum of squared bias initially decreases and then increases once variance grows too large, resulting in a U-shaped test error curve. In practice, we seek models that balance bias and variance to achieve low generalization error on new data.

Figure 1. Illustration of the bias–variance tradeoff for a regression model. As model complexity increases, bias (black curve) decreases and variance (green curve) increases. The test mean squared error (purple curve) is high for very simple models (high bias) and very complex models (high variance), with an optimal complexity in between.

Managing the bias-variance tradeoff is crucial in regression modeling. Techniques such as cross-validation are commonly used to estimate out-of-sample error for different model choices and select an appropriate complexity. Cross-validation involves partitioning the data (e.g. into k folds) and ensuring that models are evaluated on data not used for training, providing an unbiased estimate of prediction error. We will use cross-validation extensively for model tuning (e.g. choosing regularization parameters) and model selection. In R, packages like caret and mlr3 facilitate these procedures by automating data splitting, model training, and performance evaluation in a grid search or iterative framework.

Another key concept is overfitting versus underfitting. Overfitting occurs when a model is so flexible that it captures random noise in the training data as if it were true structure, yielding excellent training fit but poor predictive performance on new data. Underfitting occurs when a model is too simple to capture important patterns, resulting in both training and test errors being high. The methods discussed in this chapter include built-in mechanisms to control overfitting: for example, regularization penalties in ridge/lasso reduce variance at the cost of a bit more bias, pruning in decision trees restricts overly complex trees, and ensemble averaging in random forests reduces variance by aggregating many overfitted trees. Understanding these concepts will inform how we tune and compare regression models.

We now turn to individual regression modeling approaches, starting with the classical linear model and then moving to more advanced techniques.

4.2 Linear Regression

Linear regression is the oldest and most widely used regression technique. It assumes that the relationship between the independent variables (features) and the dependent variable (outcome) is linear. In a simple linear regression with one predictor, this means $y = \beta_0 + \beta_1 x + \varepsilon$; in a multiple linear regression with features $x_1, x_2, \dots, x_p$, the model is:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon, \]

where $\beta_j$ are coefficients to be estimated from data and $\varepsilon$ is an error term. Typically, the coefficients are estimated by Ordinary Least Squares (OLS), which finds the $\hat{\beta}$ that minimize the residual sum of squares $RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2$. Under standard assumptions (linearity, independence of errors, homoscedasticity, normality of errors), OLS is the Best Linear Unbiased Estimator (BLUE) and yields convenient inference via t-tests and F-tests on coefficients. Linear regression models are easy to interpret—the coefficients $\beta_j$ represent the expected change in $y$ for a one-unit change in $x_j$ (holding other variables constant)—making them popular in social sciences for explanatory analysis.

Assumptions and Limitations: Linear regression relies on the assumption of a linear relationship between each predictor and the outcome. If the true relationship is non-linear, a simple linear model will exhibit bias. One can partially address non-linearity by including transformations of predictors (quadratic terms, logarithms, interactions, etc.), but this requires manual specification. The model also assumes that errors have constant variance (homoscedasticity) and are uncorrelated; violation of these can be diagnosed by residual plots. Multicollinearity (high correlation between predictors) can inflate the variance of coefficient estimates, making inference unreliable. Furthermore, outliers can have a large influence on OLS estimates. Despite these limitations, linear models often serve as a useful baseline in analysis.

Strengths: Linear regression is interpretable, analytically tractable, and quick to train even on large datasets. If the linearity assumption holds reasonably well, it provides an accurate and parsimonious description of data relationships. It also forms the foundation for more complex models (many advanced methods build on linear model ideas).

Implementation in R: R has excellent support for linear modeling. The basic function is lm(). For example, using the Boston housing dataset (where the outcome medv is median home value and predictors include socioeconomic and environmental variables):

# Load data (Boston housing from MASS package)
install.packages("MASS")    # if not already installed
library(MASS)
data("Boston")              # load the Boston dataset
str(Boston)                 # view structure of the data

Assume we want to predict medv (housing value) from the average number of rooms rm and the percent of lower-status population lstat (two important predictors in this dataset):

# Fit a multiple linear regression model
lm_fit <- lm(medv ~ rm + lstat, data = Boston)
summary(lm_fit)

This would output the estimated coefficients, their standard errors, t-statistics, and p-values. We might see output like:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.94      3.73      0.789    0.430  
rm            4.93      0.42     11.741   <2e-16 ***
lstat        -0.77      0.05    -15.280   <2e-16 ***

Interpreting these: holding lstat constant, each additional room is associated with an increase of about 4.93 (in $1000s) in median home price, and holding rm constant, each 1% increase in lower-status population is associated with a decrease of about $0.77k in home price. Both predictors are highly significant (p < .001). The $R^2$ and residual standard error in the summary indicate how well the model fits; for example, a multiple $R^2$ of 0.64 would mean the two predictors explain 64% of the variance in home prices.

Predictive Performance: In terms of prediction, a linear model may suffice if relationships are roughly linear and additive. However, if the true relationships are more complex, a linear regression will exhibit bias (underfitting). For instance, on the full Boston data with all 13 predictors, a linear model might achieve an $R^2$ around 0.73–0.74 on held-out test data (meaning ~74% of variance explained). More flexible models like random forests can achieve higher $R^2$ (around 0.85–0.90 on the same data) by capturing nonlinearities and interactions, at the expense of interpretability (more on this later). In a social science context, if our goal is causal inference (e.g. the effect of education on income while controlling for other factors), linear regression remains a primary tool – but one must be cautious about unmeasured confounders and use domain knowledge to justify the model specification.

Before moving to advanced models, it’s good practice to check linear model diagnostics (e.g. residual plots for non-linearity or heteroscedasticity, Q-Q plots for normality, etc.) and possibly apply remedial measures (transformations or weighted least squares) if assumptions are violated. Assuming we have a solid baseline, we can then explore ways to improve predictive accuracy and deal with high-dimensional settings using regularization techniques.

4.3 Regularization: Ridge, Lasso, and Elastic Net

One limitation of OLS linear regression is that it can overfit when the number of predictors $p$ is large relative to the number of observations $n$, or when predictors are highly correlated. Regularization techniques address this by adding a penalty term to the loss function, discouraging overly complex models (large coefficients) and thus reducing variance at the cost of some bias. Two widely used regularization methods are ridge regression and lasso regression, and a hybrid of the two called elastic net. These methods still assume a linear functional form, but shrink or select coefficients to improve predictive performance and interpretability.

Ridge Regression (L2 regularization)

Ridge regression adds an $L_2$ penalty (the squared magnitude of coefficients) to the OLS loss. The ridge estimates $\hat{\beta}^{\text{ridge}}$ minimize:

\[ \sum_{i=1}^{n} \Big(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\Big)^2 + \lambda \sum_{j=1}^p \beta_j^2, \]

where $\lambda \ge 0$ is the tuning parameter that controls the strength of the penalty. When $\lambda = 0$, this is just OLS; as $\lambda$ increases, the penalty term $\lambda\sum \beta_j^2$ grows, forcing coefficients toward zero. In the extreme $\lambda \to \infty$, the coefficients (except intercept) all approach zero. The effect is that ridge regression shrinks coefficients, especially those of less predictive variables, thereby reducing model complexity. Ridge is particularly helpful in situations with multicollinearity or $p > n$, where OLS coefficients would be unstable or non-unique. In fact, ridge regression can yield a unique solution even if $p > n$, whereas OLS cannot. Another benefit is computational: ridge avoids searching through subsets of predictors (an infeasible $2^p$ problem) by solving a single convex optimization.

Bias-Variance Perspective: The ridge penalty introduces some bias (shrinkage of coefficients) but can dramatically lower variance, often reducing test error as long as the increase in bias is smaller than the decrease in variance. In situations where the true relationship is close to linear, OLS is unbiased but may have high variance (especially if many predictors are included); ridge will introduce a bit of bias but achieve a lower overall mean squared error by taming variance. Empirically, one often finds an optimal $\lambda$ (via cross-validation) that gives the lowest validation error, balancing this tradeoff.

Choosing $\lambda$: The regularization parameter $\lambda$ is usually selected via cross-validation. We try a grid of $\lambda$ values and choose the value that minimizes the average validation set error (e.g. mean squared error). In R, the glmnet package makes this easy via the function cv.glmnet(), which performs $k$-fold cross-validation and identifies the optimal $\lambda$. By default, glmnet standardizes predictors (zero mean, unit variance) because the penalty is not invariant to scaling; this is important since ridge treats a change of 1 unit in any predictor equally in the penalty, and unscaled variables could unfairly dominate.

Implementation in R (Ridge): Using the Hitters data (a classic dataset of baseball players’ statistics and salaries, often used to demonstrate regularization), we can perform ridge regression as follows:

# Install and load glmnet for penalized regression
install.packages("glmnet")
library(glmnet)

# Prepare data: Hitters dataset from ISLR package
install.packages("ISLR")
library(ISLR)
data("Hitters")
Hitters <- na.omit(Hitters)  # remove rows with missing Salary

# Define matrix of predictors (x) and response (y)
x <- model.matrix(Salary ~ ., data = Hitters)[, -1]  # model.matrix creates dummies; drop intercept column
y <- Hitters$Salary

# Perform ridge regression with cross-validation to choose lambda
set.seed(1)
cv_fit <- cv.glmnet(x, y, alpha = 0)  # alpha=0 for ridge
best_lambda <- cv_fit$lambda.min
best_lambda

The output best_lambda might be, for example, 326. This is the value of $\lambda$ that minimized the 10-fold CV error. We can then examine the coefficients at this $\lambda$:

ridge_mod <- glmnet(x, y, alpha = 0, lambda = best_lambda)
coef(ridge_mod)

We would find that many coefficients are shrunk towards zero relative to an OLS fit. For instance, a particular player’s performance metric that was not very predictive will have a coefficient close to 0 under ridge. Notably, ridge regression does not set coefficients exactly to zero (except at $\lambda=\infty$) – all predictors remain in the model, just with smaller weights. This is a key difference from lasso, as we discuss next.

Use Cases and Strengths: Ridge regression is particularly useful when we have many correlated features and we care about prediction accuracy more than model sparsity. For example, in a survey with many related socioeconomic indicators predicting an outcome, ridge will include them all but shrink their effects, effectively sharing information among correlated variables. Ridge often achieves better predictive accuracy than OLS on new data by trading a little bias for a lot of variance reduction. It’s also a go-to method when $p > n$ (more features than observations), where OLS fails. However, if we believe only a subset of features are truly relevant, ridge’s inability to perform variable selection can be a drawback for interpretability.

Lasso Regression (L1 regularization)

The Lasso (Least Absolute Shrinkage and Selection Operator) introduced by Tibshirani (1996) adds an $L_1$ penalty to the OLS loss. The lasso coefficients $\hat{\beta}^{\text{lasso}}$ minimize:

\[ \sum_{i=1}^{n} \Big(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\Big)^2 + \lambda \sum_{j=1}^p |\beta_j|\,. \]

This $L_1$ penalty has the effect of shrinking some coefficients exactly to zero for sufficiently large $\lambda$. In other words, lasso performs automatic feature selection: it will include only a subset of the predictors in the final model (those with the strongest signals), setting the rest to zero. This sparsity makes lasso very appealing when we suspect many predictors are irrelevant, or when interpretability of the model (identifying key drivers) is important.

Comparison to Ridge: Both ridge and lasso shrink coefficients toward zero to avoid overfitting, but the nature of the $L_1$ penalty (which has a sharp corner at zero in the constraint region geometry) means lasso tends to zero-out weaker coefficients entirely. Ridge, with its $L_2$ penalty (which gives a circular constraint region), instead keeps all coefficients small but generally nonzero. As a consequence, lasso can be seen as a variable selection method in addition to a regularization method. However, lasso has downsides: in cases of extreme multicollinearity or when $p > n$, lasso may arbitrarily pick one variable of a correlated set to keep and ignore others, leading to instability. Additionally, if $p > n$, lasso can select at most $n$ variables before it saturates (once it has $n$ parameters it can fit the data perfectly, then additional variables won’t enter). Elastic net, discussed next, was developed to address some of these issues by combining ridge and lasso penalties.

Implementation in R (Lasso): The procedure is analogous to ridge but with alpha=1 in glmnet (alpha is the mixing parameter between ridge (0) and lasso (1)). Using the same Hitters data:

set.seed(1)
cv_fit_lasso <- cv.glmnet(x, y, alpha = 1)
best_lambda_lasso <- cv_fit_lasso$lambda.min
coef(cv_fit_lasso, s = "lambda.min")

Now, many coefficients will be exactly zero. For instance, in the Hitters data, lasso might select a handful of the most important player statistics (like hits, home runs, etc.) and eliminate others. The exact list of selected features can be seen from the coefficient output: any predictor with a coefficient of 0 is effectively dropped. This makes lasso useful as a feature selection tool in high-dimensional data. It is not uncommon in social science to have far more predictors (e.g. hundreds of survey questions or demographic variables) than observations; lasso can help focus on the most relevant ones, albeit with caution if inference is the goal (the lasso’s regularization bias means standard errors of coefficients are not straightforward).

Tuning and Use Cases: Like ridge, lasso requires choosing $\lambda$ via cross-validation. The selected $\lambda$ balances model complexity and error. We often examine not just lambda.min (which minimizes error) but also a slightly larger value like lambda.1se (the largest $\lambda$ within one standard error of the minimum CV error) for a more parsimonious model. Lasso tends to perform well in scenarios where there are a few strong predictors and many weak ones, effectively finding the needle in the haystack. It has been used successfully in genetics (selecting important genes) and economics (selecting relevant instruments or controls). One must be careful, however, in using lasso for causal inference: by selecting variables, it introduces bias to coefficient estimates. In some advanced causal modeling strategies, a “post-lasso OLS” is used (run lasso to select variables, then refit OLS on those) to mitigate bias, but that is beyond our scope here.

Why Lasso Yields Sparsity: Geometrically, the $L_1$ penalty’s constraint region (an $L_1$ ball) has corners, and the optimal solution often lies on a corner, meaning some coefficients are zero. In contrast, ridge’s $L_2$ penalty (an $L_2$ ball) produces solutions that typically lie on the smooth boundary (circle) without hitting axes, so ridge does not zero out coefficients. Thus, lasso achieves a form of built-in feature selection that ridge does not.

Elastic Net

Elastic Net regression, proposed by Zou and Hastie (2005), combines the $L_1$ and $L_2$ penalties. Its penalty is $\lambda_1 \sum_j |\beta_j| + \lambda_2 \sum_j \beta_j^2$. Often one uses a convex combination: $\alpha \sum |\beta_j| + (1-\alpha)\sum \beta_j^2$, where $\alpha=1$ is pure lasso and $\alpha=0$ pure ridge. The elastic net penalty yields a model that can select variables like lasso while also handling correlated predictors more gracefully like ridge. In practice, elastic net with $\alpha$ somewhere between 0 and 1 might include groups of correlated variables together (not arbitrarily dropping one of a group), a desirable property if predictors are highly correlated (e.g., in sociological indices where several variables move together). Elastic net also avoids an issue where lasso might perform poorly if $p \gg n$ and the true model is highly sparse; the ridge component helps stabilize it.

Implementation in R (Elastic Net): In glmnet, any $0 < \alpha < 1$ gives an elastic net. We typically cross-validate to choose both $\alpha$ and $\lambda$. The caret package can facilitate a two-dimensional tuning (because glmnet itself tunes $\lambda$ for a given $\alpha$). For example:

install.packages("caret")
library(caret)

# Define training control with cross-validation
train_control <- trainControl(method = "cv", number = 10)
# Train an elastic net model with caret, tuning over alpha and lambda
set.seed(1)
enet_model <- train(Salary ~ ., data = Hitters,
                    method = "glmnet",
                    trControl = train_control,
                    tuneLength = 10)  # tuneLength will try various alpha and lambda
enet_model$bestTune

Suppose the bestTune results in alpha = 0.5 and some lambda value; this would indicate a balance where roughly equal parts L1 and L2 penalty gave the lowest error. We could then examine enet_model$finalModel or use predict(enet_model, newdata=...) to make predictions.

When to use Elastic Net: Whenever we encounter a scenario with many predictors, especially with groups of correlated ones, elastic net is a robust default. It effectively handles multicollinearity by distributing coefficient weight across correlated variables. It also has an additional advantage: if there are more variables than observations, lasso can select at most $n$ variables, whereas elastic net (with $\alpha<1$) doesn’t have such a strict limit because of the ridge component. In social science data with many related indicators (education, income, occupation, etc.), an elastic net might include all variables in a group (with moderate shrinkage) rather than pick only one and ignore others, which may align better with theory (e.g., multiple factors all contribute some effect). Zou and Hastie’s original paper showed that elastic net often outperforms pure lasso in predictive accuracy when variables are correlated.

Example Result: To illustrate the benefit of regularization, consider a comparison of test set performance (measured by Root Mean Squared Error, RMSE) for different models:

OLS: RMSE = 5.3
Ridge (CV optimal $\lambda$): RMSE = 4.7
Lasso (CV optimal $\lambda$): RMSE = 4.6 (with 10 of 20 predictors selected)

In this hypothetical example, regularization improved predictive accuracy by mitigating overfitting. In general, if we observe that an OLS model has much lower training error than test error, ridge or lasso are promising remedies.

4.4 Support Vector Regression (SVR)

Support Vector Machines (SVMs) are a class of powerful machine learning models more commonly used for classification, but they can be adapted for regression problems via Support Vector Regression (SVR). An SVR does not assume a simple linear form $y = \beta^T x$; instead, it seeks a function (linear or nonlinear via kernels) that deviates from the true targets by at most an $\epsilon$ margin for all training data, while at the same time being as flat as possible. This is achieved by using an $\epsilon$-insensitive loss function: errors within $\pm \epsilon$ of the true value incur no loss (they’re “in the tube”), and beyond that margin a linear penalty kicks in. The model tries to balance model complexity (flatness of the function) with tolerance to small errors, controlled by a regularization parameter.

Theory of SVR

In a linear SVR, we solve an optimization problem: minimize $\frac{1}{2}|\beta|^2 + C \sum_{i=1}^n L_\epsilon(y_i, \beta^T x_i + b)$, where $L_\epsilon$ is the $\epsilon$-insensitive loss (0 if $|y - \hat{y}| \le \epsilon$, and grows linearly for deviations beyond $\epsilon$). The term $\frac{1}{2}|\beta|^2$ is a regularization term (keeping the model flat/small weights), and $C$ is a tuning parameter (often denoted $1/\lambda$ in other formulations) that determines the trade-off between maximizing the margin (model simplicity) and minimizing training errors beyond $\epsilon$. A larger $C$ places more penalty on errors (thus the model will fit training data more closely, potentially at the expense of flatness), while a smaller $C$ emphasizes a wider tube (more regularization, potentially underfitting). SVR can also be nonlinear: by using the kernel trick, we can implicitly map features into a high-dimensional space and do linear SVR there. Common kernels include polynomial and radial basis function (RBF), which allow fitting very flexible nonlinear relationships.

One important concept in SVR (and SVMs generally) is that the solution for $\beta$ can be expressed in terms of a subset of training points called support vectors. These are the points that lie outside the $\epsilon$-tube or on its boundary – they are the observations that “support” or define the regression function. Points well inside the tube (with error less than $\epsilon$) have no influence on the model (their Lagrange multipliers are zero in the dual formulation). This makes SVR robust to small fluctuations or outliers: a few extreme points might lie outside the tube and get penalized, but those within the $\epsilon$ range have no effect on the fitted function.

Strengths and Weaknesses: SVR is a non-parametric method (especially with nonlinear kernels) – it does not assume a specific parametric form for the function beyond the general smoothness implied by the kernel. It often performs well in moderate dimensions and with data that has a clear margin of tolerance for error (the $\epsilon$). It can capture complex nonlinear patterns that linear regression would miss, without explicitly trying many transformations. SVRs, like SVMs, tend to have good generalization due to the max-margin principle and the regularization parameter $C$. However, they come with computational costs: training complexity is on the order of $O(n^3)$ for naive implementations, which can be slow for very large $n$. Memory usage can also be high since all pairwise kernel evaluations may be needed. In addition, SVR has multiple hyperparameters ($C$, $\epsilon$, and kernel parameters like the RBF width) that need tuning via grid search and cross-validation, which can be time-consuming. Interpretability is limited: while one can inspect which data points are support vectors or examine dual coefficients, SVR does not yield simple coefficients or variable importance measures like a linear model or tree.

Implementation in R (SVR)

The e1071 package in R provides an svm() function that can perform both classification and regression (it infers the mode of the response). We can use it for SVR. For example, consider a toy problem: we have a single feature $x$ and outcome $y = \sin(x) + \text{noise}$. A linear model will underfit (since $\sin$ is nonlinear), but an SVR with an RBF kernel can fit the sine curve.

install.packages("e1071")
library(e1071)

# Simulate some nonlinear data
set.seed(123)
n <- 100
x <- seq(0, 2*pi, length.out = n)
y <- sin(x) + rnorm(n, sd = 0.2)
data <- data.frame(x, y)

# Fit a linear model vs SVR
lin_mod <- lm(y ~ x, data = data)
svr_mod <- svm(y ~ x, data = data, kernel = "radial", cost = 1, epsilon = 0.1)

# Predictions
data$pred_lin <- predict(lin_mod, data)
data$pred_svr <- predict(svr_mod, data)

Now, if we plot the true $y$ versus $x$ along with predictions, we would see that the linear model is a straight line (a poor fit), whereas the SVR with an RBF kernel closely follows the sine wave pattern (with slight deviations where noise occurs). The cost parameter here is analogous to $C$, and epsilon is the $\epsilon$ in the loss function. We could tune these via tune.svm() or using caret:

# Tune SVR hyperparameters via caret
install.packages("caret")
library(caret)
svr_grid <- expand.grid(C = c(0.1, 1, 10), sigma = 0.5, epsilon = c(0.01, 0.1, 1))
set.seed(1)
svr_tune <- train(y ~ x, data = data, method = "svmRadial",
                  tuneGrid = svr_grid,
                  trControl = trainControl(method = "cv", number = 5))
svr_tune$bestTune

This process would yield the best combination of C and epsilon (and kernel parameters like sigma for the RBF kernel) according to cross-validation. For higher-dimensional data, say using SVR on the Boston housing dataset, one might do:

set.seed(1)
train_index <- sample(1:nrow(Boston), 400)
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]
svr_model <- svm(medv ~ ., data = train_data, kernel = "radial", cost = 10, epsilon = 1)
preds <- predict(svr_model, test_data)
mean((preds - test_data$medv)^2)  # test MSE

Comparing an SVR to earlier models: one might find SVR does fairly well, though on tabular data like Boston, tree-based methods often perform better due to their ability to capture interactions automatically. SVR’s advantage shows more in cases with moderate $p$ (feature count) and complex nonlinear relationships that are hard to hand-engineer features for.

Interpretability: SVR is not very interpretable. One cannot easily decompose the prediction into contributions of each feature (as in linear regression). However, techniques like partial dependence plots or SHAP (Shapley additive explanations) can be applied to SVR to understand the effect of a single feature on the predictions on average. These tools treat the SVR as a black-box function $f(x)$ and probe its behavior by varying one feature at a time or computing contribution values. If interpretability is paramount, one might lean towards GAMs or decision trees for the final model, or use SVR only for prediction and use simpler models to explain the insights.

Epsilon-insensitive loss and support vectors: To summarize SVR’s core idea, imagine a tube of width $2\epsilon$ around the true target values. The SVR finds a function (e.g., a line or curve) that stays as much as possible within this tube. Errors within the tube incur no loss, and only errors outside the tube are penalized linearly. The data points outside the tube (or on the boundary) are the support vectors — these points influence the SVR solution, whereas points inside the tube do not affect the model. The parameter $C$ controls how much the SVR tries to avoid errors versus keeping the tube wide; a high $C$ will shrink the tube (allow fewer errors, potentially overfitting), while a low $C$ will allow a wider tube (more tolerance, potentially underfitting).

4.5 Tree-Based Methods

Tree-based methods partition the feature space into a set of rectangular regions and fit a simple model (like a constant) in each region. They offer a very different approach to regression than the global linear or kernel methods. Decision trees are non-parametric and can capture complex interaction structures in the data, often at the expense of fine-grained smoothness or extrapolation. We discuss decision trees first, then ensembles of trees: random forests (using bagging) and gradient boosting.

Decision Trees for Regression

A decision tree for regression (also known as a regression tree) is a flowchart-like structure where each internal node represents a binary decision on one of the input features, and each leaf node gives a predicted outcome (typically the mean of the target for observations in that leaf). The CART algorithm (Classification and Regression Trees; Breiman et al., 1984) builds the tree by recursively splitting the data based on feature values that reduce the sum of squared errors (SSE) the most. At each step, among all features and all possible split points, it chooses the split that maximally decreases the SSE of the two resulting partitions (for regression; for classification, a measure like Gini impurity or information gain is used). This process continues until some stopping criterion is met: for example, a minimum number of observations in a leaf, a minimum reduction in SSE required for a split, or a maximum tree depth. If grown fully, a tree can perfectly fit the training data (each leaf could be a single observation, yielding SSE = 0), which is overfitting; hence, pruning or early stopping is usually applied to limit complexity.

Properties: A basic decision tree provides a set of if-else rules that are highly interpretable: one can follow the path for a given observation to see why a prediction was made (e.g., IF education > 12.5 years and urbanicity = yes and income < $30k, THEN predicted happiness = 6.8). Trees can naturally handle qualitative (categorical) predictors by splitting on categories, and handle interactions between variables (each split can be on a different feature, so a leaf implicitly represents an interaction of conditions). They require little preprocessing — there’s no need to standardize variables, they can handle missing values (via surrogate splits), and they are invariant to monotonic transformations of individual predictors. Trees are also robust to outliers in predictors (an extremely large value will just end up in some partition and not affect other splits).

Strengths and Weaknesses: As noted, interpretability is a strength (especially for small trees). Trees are also nonlinear and capture heterogeneity: different regions of the feature space can have different relationships or different mean outcomes. However, a single tree is often not very accurate. Trees tend to be high-variance learners; a small change in the data can substantially change the splits (especially when two splits yield nearly equal SSE reduction, the choice between them may flip with a slight data perturbation). They also can easily overfit if grown too deep, capturing noise. Empirical comparisons often find that an optimally pruned tree does not match the predictive performance of well-tuned regression models like lasso or of ensemble methods like random forests. This is why trees are often used in ensembles (discussed next) to improve performance. Another limitation is that the greedy splitting may not find a globally optimal tree – it makes locally optimal splits at each step.

Nonetheless, decision trees remain popular for their transparency and ease of use. In some cases, a tree can reveal interesting structure in the data (e.g., threshold effects or interactions) that a linear model would miss.

Implementation in R (Decision Tree): The classic packages are rpart (which implements CART) and the older tree package. We will use rpart here, along with rpart.plot for visualization:

install.packages(c("rpart", "rpart.plot"))
library(rpart)
library(rpart.plot)

# Using Boston housing data for a regression tree example
set.seed(1)
train_idx <- sample(1:nrow(Boston), 300)
train_data <- Boston[train_idx, ]
test_data <- Boston[-train_idx, ]

# Train a regression tree to predict medv
tree_model <- rpart(medv ~ ., data = train_data, method = "anova",
                    control = rpart.control(cp = 0.01))  # cp = complexity parameter for pruning
printcp(tree_model)  # display the cost-complexity pruning table

The output of printcp shows various possible tree sizes and their cross-validated errors. Suppose the tree with complexity parameter 0.01 has 5 splits. We can visualize the tree:

rpart.plot(tree_model, type = 2, extra = 101)

Figure 2. Example regression tree predicting medv (housing price) from the Boston data.

Figure 2. Each internal node splits on a predictor (with the splitting rule shown), and each leaf node contains the predicted value (mean outcome) for that region. For instance, one branch of this tree might correspond to homes in neighborhoods with low lstat (percent lower-status population) and high rm (number of rooms), yielding a higher predicted medv, whereas another branch corresponds to areas with high lstat and thus lower medv. The tree is easily interpretable but may not capture smooth linear trends as well as some other methods.

From Figure 2, we can interpret the model: the first split might be on lstat (e.g., if lstat < 9.5% vs. higher), indicating that neighborhood socio-economic status is a primary driver of home value. One branch leads to high-value neighborhoods, which are then split by another variable (say rm, the number of rooms), and so forth. Each leaf gives a predicted medv for that region of the feature space. This tree, while simpler than a full unpruned tree, highlights interactions (e.g., the effect of rm is considered separately in high-lstat vs low-lstat neighborhoods) and non-linearity (piecewise constant predictions).

To evaluate performance:

pred_tree <- predict(tree_model, newdata = test_data)
tree_mse <- mean((pred_tree - test_data$medv)^2)
tree_mse

We can compare this MSE to that of a linear model or other models on the same training/test split. Often, a single tree’s performance will be worse than a tuned linear or regularized model if the true relationships are mostly linear. However, if there are threshold effects or strong interactions, the tree might do better.

Pruning: The cp (complexity parameter) in rpart controls how the tree is pruned. A smaller cp allows more splits (a more complex tree); a larger cp prunes more aggressively. Typically, one would use cross-validation (the cptable from printcp) to find the optimal cp that minimizes the cross-validated error. In the code above, we set cp = 0.01 initially. We could adjust this to the value that minimizes the xerror in tree_model$cptable. For example:

best_cp <- tree_model$cptable[which.min(tree_model$cptable[,"xerror"]), "CP"]
pruned_tree <- prune(tree_model, cp = best_cp)

A pruned tree (with fewer splits) is generally preferable to the fully grown tree to avoid overfitting.

Decision trees in social science: Decision trees can be useful as an exploratory tool to uncover segmentations in the data and potential interactions. For example, an education researcher might use a tree to discover that different predictors of test scores matter in urban vs. rural schools (the tree might first split on an indicator for urban/rural, then split on teacher-student ratio or income level within each group). However, for final inferential analysis, one might still prefer a parametric model or other approaches that provide confidence intervals and are easier to connect to theory. In recent years, trees have also been adapted for causal inference (e.g., causal trees and causal forests for heterogeneous treatment effect estimation), but those are specialized extensions beyond our scope.

Random Forests (Ensembles of Trees via Bagging)

A Random Forest is an ensemble of decision trees built using the technique of bagging (bootstrap aggregating) combined with random feature selection, introduced by Breiman (2001). The idea is to combine the predictions of many trees to reduce variance and improve accuracy, at the cost of interpretability of the overall model.

How it works: We grow $B$ trees on bootstrap samples of the training data (each tree sees a random sample of the data with replacement). This by itself is bagging; if we average the predictions of many such trees, the average tends to have lower variance than a single tree (because the individual tree errors cancel out to some extent). Random forests add an extra twist: at each candidate split in the tree-growing process, instead of considering all $p$ predictors, the algorithm selects a random subset of $m_{\text{try}}$ predictors and only considers splits among those. Typically $m_{\text{try}} = \sqrt{p}$ for classification and $m_{\text{try}} \approx p/3$ for regression by default (these can be tuned). This random feature selection further decorrelates the trees: if there is a very strong predictor, bagged trees would all tend to use it in the top split, making their predictions highly correlated. By forcing different trees to consider different predictors, the ensemble’s overall variance is reduced more than by bagging alone.

After training, we have an ensemble ${\hat{f}_1(x), \hat{f}*2(x), \dots, \hat{f}*B(x)}$ of $B$ tree predictors. The random forest prediction is simply the average of the individual tree predictions: $\displaystyle \hat{f}*{RF}(x) = \frac{1}{B}\sum*{b=1}^B \hat{f}_b(x)$. For regression, this is the mean; for classification, it could be majority vote or the average class probability.

Key advantages:

Improved accuracy: Random forests often achieve excellent predictive performance out-of-the-box, typically outperforming a single tree by a wide margin. In many cases, RFs are competitive with the best tuned models on a given task.
Robustness: They are more robust to overfitting than single trees; using an adequate number of trees $B$ (e.g. 500 or more) ensures the ensemble won’t overfit simply by adding more trees (in fact, as $B \to \infty$, the generalization error converges to a limit). We still need to tune parameters like $m_{\text{try}}$ and sometimes the tree depth (via minimum node size or maximum depth), but random forests are less sensitive to these choices than boosting methods.
Out-of-bag (OOB) error: A useful byproduct of the bootstrap procedure is that for each tree, about one-third of the training instances are left out (not in that bootstrap sample). These are “out-of-bag” for that tree. We can predict those OOB instances with the tree (or the forest) and get an OOB error estimate by averaging over all trees. This OOB error is an internal cross-validation estimate of model performance – often very close to the true test error – so one may not need a separate validation set.
Variable importance: Random forests can provide measures of feature importance. One approach is permutation importance: randomly permute values of a given feature in the OOB data and see how much the prediction error increases (a large increase suggests the feature was important). Another is to accumulate the total decrease in node impurity (SSE for regression) attributable to each feature across the forest. These measures give a high-level view of which predictors are driving the model’s predictions.

Limitations: The main downside is the loss of interpretability compared to a single tree. It’s difficult to interpret the ensemble of hundreds of trees in a simple way – one cannot easily summarize the relationship as a set of rules or coefficients. Partial dependence plots or feature importance metrics help, but they provide only aggregate insights. Random forests are also more computationally intensive than a single tree (though they are embarrassingly parallel since trees can be grown independently, and modern implementations like ranger in C++ are quite fast). For extremely large datasets (especially with many features), training can be slow or memory-intensive, but random forests scale reasonably well for moderate-sized data.

Implementation in R (Random Forest): The original implementation is in the randomForest package, and a more recent efficient implementation is ranger. We will use ranger for speed:

install.packages("ranger")
library(ranger)

# Fit a random forest on Boston housing data
rf_model <- ranger(medv ~ ., data = train_data, 
                   num.trees = 500, mtry = 6, importance = "permutation")
print(rf_model)

This will output something like: 500 trees, OOB prediction error (MSE) = 10.5. The OOB MSE can be compared to the test MSE we computed for a single tree. Typically, the random forest’s OOB error is much lower (indicating a better model). For instance, if our single tree had MSE around 20 on the test set, the random forest might have MSE around 10–12, roughly halving the error. In terms of $R^2$, the single tree might be around 0.5, whereas the random forest might be 0.85 or higher on the same data – a dramatic improvement in predictive power.

In the code above, we chose mtry = 6 (for 13 predictors, $p/3 \approx 4.3$, we tried a slightly larger value to see if it helps). We could tune mtry by trying a few values and checking OOB error, or use caret to tune it. The argument importance = "permutation" tells ranger to compute permutation importance for features.

We can examine which variables were important:

rf_model$variable.importance

This might show, for example, that lstat and rm are the top two important features (aligning with what we saw in the linear model and tree), followed by others like nox (nitric oxide concentration) or dis (distance to employment centers), etc. Variable importance gives a sense of which predictors contribute most to reducing prediction error in the forest.

To use the random forest model for prediction on the test set:

rf_preds <- predict(rf_model, data = test_data)$predictions
rf_mse <- mean((rf_preds - test_data$medv)^2)
rf_mse

The test MSE should be close to the OOB estimate. In our hypothetical example, if OOB MSE was ~10.5 (in $1000^2$ units), the test MSE might be similar, indicating good generalization. The improvement over the single tree is clear – random forests often achieve significantly lower error by averaging many overfit trees, thereby canceling out the noise each tree picks up.

Using caret or mlr3: One can also use caret to train a random forest (method = "rf" uses the randomForest package by default) and tune mtry. The mlr3 framework likewise has learners for random forests and can tune hyperparameters with grid search, random search, or other strategies. For example, in mlr3 you could define a LearnerRegrRanger, set a search space for mtry, and use tune_grid with cross-validation.

Interpretability considerations: While the forest as a whole is complex, tools like vip (Variable Importance Plot) and pdp (Partial Dependence Plot) can help interpret it. For instance, a partial dependence plot for lstat (holding other variables at average values) might show that predicted house prices sharply decrease as lstat increases from low values, then level off at high lstat – capturing a nonlinear effect that a linear model would approximate with a single coefficient. Similarly, one could examine partial dependence of rm to see its effect. These plots provide a way to explain the model’s behavior feature by feature, which is useful for presenting results to stakeholders.

From a causal perspective, random forests are generally used for prediction rather than inference. However, there are methods like causal forests (Athey & Imbens, 2016) that modify the splitting criteria to estimate treatment effects in experimental or observational data, using the random forest idea to detect heterogeneity in effects. Those are beyond our scope, but it’s worth noting that tree ensembles have roles in modern causal analysis (e.g., for propensity score estimation or meta-learning frameworks like double ML).

Gradient Boosting Machines (GBM) and XGBoost

Gradient boosting is another powerful ensemble technique. Unlike the parallel trees in bagging (random forests), boosting builds trees sequentially, each one trying to correct the errors of the previous ensemble. Pioneered by Friedman (2001), gradient boosting views the task of fitting a model as an optimization problem: we start with a simple model (like predicting the mean of $y$ for all observations), then iteratively add new “weak” models (such as small trees) that point in the direction of the gradient of the loss function (hence gradient boosting). In each iteration, a new tree is fit to the residuals (the remaining errors) of the current model. For least squares regression, this is equivalent to fitting a tree to the difference $y - \hat{y}$ (which is the negative gradient of MSE loss). This new tree is then added to the existing model with a scaling factor called the learning rate $\nu$ (a small number, e.g. 0.1), which controls how fast we correct the errors. By sequentially “boosting” the model in the direction of steepest descent of the loss, the method can fit very complex functions additively as a sum of many simple trees.

Key characteristics of boosting:

Many shallow trees: Typically, each individual tree in a boosting model is limited in depth (e.g., 3–8 terminal nodes) to ensure the trees remain weak learners. The strength comes from the accumulation of many such trees. The final model might be an ensemble of hundreds or even thousands of small trees.
Learning rate: The learning rate $\nu$ (also called shrinkage) slows down the learning by scaling the contribution of each new tree. A smaller $\nu$ (like 0.01) usually requires more trees but can yield better generalization, whereas a larger $\nu$ (0.1 or 0.2) learns faster but might overfit if too many trees are added. The learning rate is a crucial regularization parameter in boosting.
Regularization: In addition to the learning rate, modern boosting implementations include other regularization: subsampling (Stochastic Gradient Boosting, where each tree is fit on a random subset of data, adding randomness similar to bagging), column subsampling (like in random forest, using a subset of features for each tree), and direct penalties on tree complexity (e.g., XGBoost can penalize the L2 norm of leaf weights). These techniques help prevent overfitting even after many boosting rounds.

XGBoost (eXtreme Gradient Boosting, Chen & Guestrin 2016) is a popular high-performance implementation of gradient boosting that includes efficient handling of sparse data, built-in regularization, and parallelization. XGBoost has seen wide adoption and success in data science competitions and real-world applications. For example, in 2015 Kaggle competitions, 17 out of 29 winning solutions used XGBoost (8 of them using it as the sole method), establishing it as a de facto choice for many tabular data problems. Gradient boosted trees often outperform random forests in predictive accuracy because boosting can adapt bias-variance tradeoff more finely (though this comes at the cost of more hyperparameters to tune).

Implementation in R (Boosting): The xgboost package provides a low-level interface to train models. Alternatively, one can use caret with method = "xgbTree" or the higher-level tidyverse interface via xgboost in tidymodels. Here we illustrate using xgboost directly on the Boston housing data:

install.packages("xgboost")
library(xgboost)

# Prepare data matrices for XGBoost (requires numeric matrices)
train_matrix <- data.matrix(train_data[ , setdiff(names(train_data), "medv") ])
train_label <- train_data$medv
test_matrix  <- data.matrix(test_data[ , setdiff(names(test_data), "medv") ])
test_label  <- test_data$medv

# Set up parameters for XGBoost
params <- list(objective = "reg:linear",  # linear regression objective
               eta = 0.1,                # learning rate
               max_depth = 3,            # max depth of trees
               subsample = 0.8,          # subsample 80% of rows for each tree
               colsample_bytree = 0.8)   # subsample 80% of features for each tree

# Train XGBoost model with 100 trees (rounds)
set.seed(1)
xgb_model <- xgboost(data = train_matrix, label = train_label,
                     params = params, nrounds = 100, verbose = FALSE)
# Predict on test data
preds <- predict(xgb_model, test_matrix)
xgb_test_mse <- mean((preds - test_label)^2)
xgb_test_mse

We would compare xgb_test_mse to our previous models’ performance. Typically, a well-tuned XGBoost model could achieve even lower error than a random forest on many datasets. For instance, on Boston housing data, one might push the $R^2$ to ~0.90 with careful boosting (whereas our linear model was ~0.74 and random forest ~0.85, hypothetically). The advantage of boosting is its ability to model complex relationships by iteratively focusing on the remaining errors. The disadvantage is that it can overfit if we add too many trees or use too high a learning rate without sufficient regularization. Therefore, it’s common to use cross-validation to find the optimal number of boosting rounds (trees) – for example, using xgb.cv() with an early stopping criterion.

Tuning hyperparameters: Boosting has several hyperparameters: the number of trees (nrounds), the learning rate (eta), tree depth (max_depth), and others like min_child_weight (minimum sum of instance weights in a leaf), gamma (minimum loss reduction required to make a split), subsample, and colsample_bytree. It’s important to tune these. A common strategy is to start with a relatively large eta (like 0.1) and tune max_depth and nrounds, possibly using early stopping to decide when to stop adding trees. Then one might lower eta and increase nrounds for a final model. The caret package can perform a grid search over a subset of these parameters. For example:

grid <- expand.grid(nrounds = c(100, 200, 500),
                    max_depth = c(3, 5, 7),
                    eta = c(0.1, 0.3),
                    gamma = 0,
                    colsample_bytree = 0.8,
                    min_child_weight = 1,
                    subsample = 0.8)
set.seed(1)
xgb_tune <- train(medv ~ ., data = train_data,
                  method = "xgbTree",
                  tuneGrid = grid,
                  trControl = trainControl(method="cv", number=5))
xgb_tune$bestTune

This might tell us, for example, that max_depth = 3, eta = 0.1, nrounds = 200 was best (just an illustration). We would then use those parameters to fit the final model on the full training data.

Interpretability: Like random forests, gradient boosting yields a black-box model. But we can extract variable importance (e.g., using xgb.importance() to see which features had the highest importance across splits). We can also use partial dependence plots to visualize how changing one feature affects predictions on average, or compute SHAP values for more nuanced explanations of individual predictions. These tools can help translate the model’s complexity into understandable insights. For example, we might find that the partial dependence of lstat is strongly nonlinear (steep drop then leveling off), indicating a diminishing returns effect of improving neighborhood status on housing prices – an insight that aligns with what a GAM or domain knowledge might also suggest.

Use in social science context: Boosted trees are typically used for prediction tasks (e.g., predicting economic indices, risk scores, etc.) where accuracy is the primary goal. If causal interpretation is needed, one must be careful; a boosted model might fit patterns that are predictive but not causal. That said, one can use boosting in supportive roles – for instance, to impute missing data, to model propensity scores in observational studies, or as a way to explore nonlinear relationships which are later examined with simpler models for causality. In purely predictive competitions or applications (like predicting which individuals will respond to an intervention), gradient boosting often shines. The key is to guard against overfitting with appropriate tuning and validation.

Caution: Because boosting is so powerful, it is possible to memorize noise if the model is made too complex. Always monitor training vs. validation performance: as you add more trees, the training error will keep dropping, but the validation error will typically decrease to a point and then start increasing when overfitting sets in. Using techniques like early stopping (stop adding trees when validation error hasn’t improved for a certain number of rounds) is common. Additionally, ensure that the data is randomly shuffled (for cross-validation) and consider using a separate test set to double-check performance.

In summary, gradient boosting machines (GBMs) are among the most accurate and versatile modeling techniques for regression and classification. With libraries like XGBoost, LightGBM, and CatBoost, practitioners have access to highly optimized implementations. While they require careful tuning and are less interpretable, their predictive performance often justifies the effort in competitive settings.

4.6 Generalized Additive Models (GAMs)

Generalized Additive Models represent an important middle ground between the interpretability of linear models and the flexibility of non-parametric models. A GAM (Hastie & Tibshirani, 1990) models the outcome as an additive combination of unspecified smooth functions of predictors:

\[ g\!\big(E[Y]\big) = \beta_0 + f_1(X_1) + f_2(X_2) + \cdots + f_m(X_m)\,, \]

where $g$ is a link function (identity for regular regression, or logit, etc. for generalized outcomes) and each $f_j$ is a smooth function fitted from the data (often using splines or other basis expansions, subject to a smoothness penalty). In a regression context with identity link, this is: $y = \beta_0 + f_1(x_1) + \cdots + f_m(x_m) + \varepsilon$. The key idea is that we relax the linearity assumption by allowing each predictor to have a nonlinear effect, but we keep the additivity (no automatic interactions between different $f_j$ terms, unless we explicitly include an interaction function). This way, the model remains relatively interpretable—one can plot $f_j(x_j)$ to see how predictor $X_j$ influences the response $Y$, holding other variables constant.

Theory and Assumptions: By being additive, GAMs assume no strong interactions between predictors (unless we add interaction terms or use tensor product smooths for multiple variables together). If strong interactions exist and are not modeled, a GAM can miss those patterns (leading to bias). However, often additivity is a reasonable assumption or can be a starting point. Each smooth function $f_j$ is typically represented using basis functions (e.g., spline basis) and fitted with a roughness penalty to avoid overfitting. In R’s mgcv package (by Simon Wood), GAMs use penalized regression splines by default, and the amount of smoothing (the effective degrees of freedom for each $f_j$) is chosen by the model fitting process (usually by minimizing a criterion like generalized cross-validation or using restricted maximum likelihood). This means the user often does not need to manually choose the smoothing parameter; the algorithm will find an optimal balance between fit and smoothness for each term.

Strengths: GAMs are interpretable — one can examine each $f_j$ plot to understand the relationship between $X_j$ and $Y$. They can model nonlinear relationships that linear models would miss, thus reducing bias when nonlinearities are present, while still using a form of regularization (spline smoothing penalties) to avoid excessive variance. They strike a good bias-variance balance in many situations. Moreover, GAMs naturally handle mixtures of continuous and categorical predictors (categorical predictors can be included as linear terms or factors, equivalent to step functions for those variables). They also extend generalized linear models (GLMs), so one can use them for binary outcomes (logistic GAM), count outcomes (Poisson GAM), etc., by choosing an appropriate link function $g$.

Limitations: The additivity means GAMs will not automatically capture interactions between predictors (unless you explicitly add interaction terms like $f_{ij}(X_i, X_j)$). If important interactions exist (e.g., maybe the effect of income on happiness differs by gender), a pure additive model will miss that unless you include a specific term for it. Also, while more interpretable than, say, a random forest, GAMs are less succinct than a linear model — instead of a single coefficient, each feature’s effect is a curve that must be visualized. There is an art to choosing the type of smooth (e.g., thin plate spline vs. cubic spline vs. loess) and the basis dimension (how flexible it can be). If the smoothing parameter selection chooses too much flexibility, the model could overfit wiggly patterns; if too restrictive, it might underfit. Diagnostic tools in mgcv (like checking the estimated degrees of freedom and the significance of smooth terms, residual plots, concurvity checks for multicollinearity in smooths) are important to validate the model.

Implementation in R (GAM with mgcv): The mgcv package’s main function is gam(). Suppose we suspect that in the Boston housing data the relationships between medv and some predictors are nonlinear (as is likely for lstat and rm). We can fit a GAM:

install.packages("mgcv")
library(mgcv)

gam_model <- gam(medv ~ s(lstat) + s(rm) + rad + tax, data = Boston)
summary(gam_model)

In this formula, s(lstat) means we include a smooth function of lstat, and s(rm) a smooth function of rm. We left rad (index of accessibility to highways) and tax (property tax rate) as linear terms for illustration – perhaps because rad is an index with only a few values (we might treat it as categorical or linear) and tax might have a roughly linear effect. The summary of gam_model will show something like:

EDF  Ref.df  F p-value
s(lstat)  3.7   4.5   42.6   <2e-16 ***
s(rm)     1.0   1.0   18.3   3.2e-05 ***
rad       (Linear coefficients and p-values)
tax       (Linear coefficients and p-values)

Here, EDF stands for effective degrees of freedom for each smooth. For s(lstat), EDF ~ 3.7 suggests it used a spline with about 4 degrees of freedom (so the function is nonlinear). The p-value < 2e-16 indicates the smooth is highly significant. For s(rm), EDF ~ 1.0, meaning the model found that a linear function in rm was sufficient (or it heavily penalized any wiggliness, effectively making it linear). The linear terms rad and tax would have ordinary regression coefficients in the summary.

We can visualize the fitted smooth functions:

par(mfrow = c(1, 2))
plot(gam_model, shade = TRUE, se = TRUE)

This would produce plots of $f(\text{lstat})$ and $f(\text{rm})$. For example, $f(\text{lstat})$ might show a sharply declining curve for low lstat values that levels off as lstat becomes high (indicating that reducing the percentage of lower-status population from 20% to 10% has a big effect on home prices, but below 5% there is little additional gain — a diminishing returns phenomenon). The $f(\text{rm})$ plot might appear roughly as a straight line (since EDF ~ 1), suggesting a linear relationship between rm (number of rooms) and medv in this model, or possibly a gentle curve if EDF was slightly above 1. These plots are easily communicable: they show the shape of the relationship, with confidence bands (if se=TRUE) indicating uncertainty.

To evaluate the GAM’s predictive performance, we could do:

gam_preds <- predict(gam_model, newdata = test_data)
gam_mse <- mean((gam_preds - test_data$medv)^2)
gam_mse

We might find the GAM improves over a linear model slightly if those nonlinear effects were important (for instance, a lower MSE than the linear model’s ~0.74 $R^2$). However, a well-tuned random forest or boosted model might still outperform the GAM in pure prediction, because they can capture interactions and more complex shapes. The GAM’s advantage is that it provides an interpretable summary of the relationships.

Use cases in social science: GAMs have been used in many fields. For example, in epidemiology, one might model how air pollution levels affect health outcomes with a smooth function (since the relationship might not be strictly linear). In economics, one might allow for a nonlinear effect of age on income (wage profiles often rise then fall) or of education on income (diminishing returns to schooling). GAMs allow the data to suggest the form of these curves rather than prespecifying a polynomial of a certain degree. Importantly, because GAMs still have a notion of degrees of freedom, one can perform approximate hypothesis tests on whether a nonlinear term is needed (e.g., compare a model with $s(x)$ to one with just $x$ linearly, or check if EDF is significantly greater than 1). This can guide theory: if the data suggests a nonlinear effect, researchers might theorize reasons for it.

Regularization in GAMs: The spline smoothness selection is effectively a form of regularization (an $L_2$ penalty on the second derivative of the function, for example). The “penalty” prevents overfitting by discouraging wiggly curves unless the data strongly warrants them. If a predictor has no relationship with $Y$, the best fit might be essentially a flat line (EDF near 0, meaning $f_j(x)$ is nearly constant, contributing nothing). In that sense, a GAM can zero-out a predictor’s effect (by making it a flat function) analogous to how lasso can zero out coefficients, though the GAM’s approach is more akin to ridge (continuous shrinkage rather than a hard threshold to zero). There are extensions like sparse GAM that combine lasso and GAM to select which variables to include as well as estimate their shapes, but in practice one can use the significance of smooth terms as a guide.

Finally, as with any model, checking diagnostics is important. For GAMs, one can plot residuals vs fitted values to see if patterns remain, use gam.check() to see if basis dimensions were sufficient (it will indicate if more flexibility might improve the fit), and look at concurvity (GAM’s version of multicollinearity) if there are concerns that two smooths might be jointly identifying the same pattern.

In summary, GAMs provide a powerful and relatively interpretable way to discover and model nonlinear relationships. They are a valuable tool in the analyst’s toolkit when linear models are too restrictive and black-box machine learning models are too opaque or not needed for the task.

4.7 Model Tuning, Evaluation, and Comparison

Throughout this chapter, we emphasized the importance of tuning hyperparameters (like $\lambda$ in ridge/lasso, $C$ and $\epsilon$ in SVR, the depth and number of trees in ensembles, degrees of freedom in GAMs, etc.) using techniques like cross-validation. In practice, one should perform model selection by comparing models on a hold-out validation set or via cross-validated performance, rather than relying solely on training error or theoretical appeal. For example, one might compare a linear model, a lasso, a random forest, and an XGBoost model on the same dataset to see which yields the lowest predictive error (using 10-fold CV or a separate test set). Often, the more flexible models (random forests, XGBoost) win on pure prediction metrics. However, they might be harder to interpret. Thus, the choice may depend on context: if we need to explain the relationship or test hypotheses (as in many social science settings where causal understanding is the goal), a simpler or at least an interpretable model (like linear regression or GAM) might be preferable so we can clearly communicate the effects. If we primarily need accurate predictions (say, for predicting an outcome to allocate resources), a complex ensemble might be the best choice.

Key topics in evaluation include:

Cross-Validation: Using $k$-fold CV or repeated CV to estimate out-of-sample performance for different models and parameter settings. We often use metrics like mean squared error (MSE) or root MSE for regression. In R, packages like caret automate this process (train() will by default do CV and pick the best model). The mlr3 ecosystem provides a more programmable framework to do the same with more customization. The goal is to simulate how the model would perform on new data by not training and testing on the same data points.
Learning Curves and Bias-Variance Diagnostics: Sometimes we diagnose a model by its learning curve – plotting training error and validation error as model complexity increases. For example, for ridge regression, as $\lambda$ increases, training error will decrease (for small $\lambda$) then increase (for large $\lambda$) while validation error typically has a U-shape. For a random forest or boosting, we can monitor how the OOB/validation error changes as we add more trees. If the validation error starts rising after a point, we know further complexity is overfitting. Conversely, if the validation error is still going down when we stop (and training error is near zero), perhaps we could have fit a more complex model or we need more data. These diagnostics help in understanding whether a model is high-bias (underfitting) or high-variance (overfitting).
Feature Selection: Aside from automated methods like lasso, one can perform manual or stepwise feature selection. However, stepwise selection can be risky (it may select different features in different folds, and the statistical inference after selection is complicated). In a predictive context, it is often better to use regularization or tree importance or domain expertise to select features. For instance, one might use a random forest to rank variables by importance and drop those below a certain threshold to simplify a model. Or use lasso to screen variables, then use the subset in a final linear model (this is sometimes done to balance prediction and interpretability). Always be cautious: removing too many features might remove signal, but keeping all features might lead to overfitting or unnecessary complexity.
Model Interpretability: If a black-box model (like an ensemble) is chosen for its accuracy, consider applying post-hoc interpretation tools to extract insights. Partial Dependence Plots (PDPs) show the marginal effect of one feature on the predicted outcome, averaging over all other features. For example, PDP might reveal that a model’s predictions of income flatten out after 16 years of education, suggesting diminishing returns to education. Individual Conditional Expectation (ICE) plots go further by showing lines for individual instances’ predictions as a function of a feature (to detect heterogeneity in the effect). SHAP values (from cooperative game theory) provide an additive attribution of each feature’s contribution to an individual prediction; they have the advantage of a solid theoretical foundation and consistency. In R, packages like shapley or fastshap can compute SHAP values for any model. These tools help in explaining why the model made certain predictions, which is increasingly important for trust and accountability (e.g., in finance or healthcare applications of ML).
Causal Inference vs. Prediction: It’s important to reiterate the difference between explaining and predicting (Shmueli, 2010). A model that is excellent at prediction may use complex patterns that are not causal; using it to infer “what if” scenarios can be misleading. Conversely, a simple interpretable model might be preferred to answer a causal question, even if a black-box would predict better. Sometimes, researchers use machine learning within a causal analysis pipeline: for example, using lasso to select control variables for an observational study, or using random forests to estimate propensity scores for matching. These hybrid approaches leverage the strength of ML in dealing with many variables or nonlinearities, but then combine with traditional inferential techniques. It is an active research area (e.g., the field of “double ML” or targeted learning) to ensure unbiased estimation of causal effects while using flexible ML models. In practice, be clear about the goal: for causal interpretation, lean towards models that are transparent and grounded in theory; for prediction, feel free to use more complex models but validate them well and use interpretation tools to verify they make sense.

4.8 Conclusion

In this chapter, we surveyed a spectrum of regression models in R, from the venerable linear regression to modern machine learning techniques like random forests and gradient boosting. Each method has its strengths: linear models and GAMs offer interpretability and are useful for inference and communicating insights, whereas support vector machines and tree-based ensembles offer high flexibility and often superior predictive performance (at the cost of opacity). Regularization methods like ridge and lasso provide a middle ground, improving linear models’ performance in high-dimensional settings and performing feature selection for interpretability. Throughout, we emphasized critical concepts such as the bias-variance tradeoff and the importance of proper model tuning and validation.

For a practitioner, the choice of model should be guided by the problem context. If the task is to predict an outcome as accurately as possible (e.g. predicting economic indicators or individual risk scores), machine learning models like ensembles or SVR might be the best choice – and indeed we saw that these can dramatically reduce prediction error compared to a simple linear baseline. If the task is to explain relationships or test hypotheses (e.g. does education affect income, and by how much?), one might start with a linear model or a GAM for a more transparent view of the data-generating process, perhaps supplemented by regularization to handle many covariates, and use the insights from more flexible models as a supplement rather than a replacement. Often, a combination of approaches yields the richest understanding: for instance, using a random forest to identify important variables and potential interactions, then carefully modeling those with a parametric or semiparametric model for interpretability.

Finally, we stress the importance of sound evaluation. All models can be made to look good on the training data; the real test is performance on new data. Techniques like cross-validation and the use of independent test sets (when available) are indispensable to prevent being misled by an overfit model. When communicating results, especially in a social science context, be clear about whether a model’s results are correlational or have a causal interpretation, and what the limitations of the modeling approach are. For example, a random forest might tell us that variable X is a strong predictor of Y, but we should not immediately interpret that as “X causes Y” without further analysis.

The R code examples provided in this chapter offer a template for how to implement, tune, and interpret each type of model. The R ecosystem for machine learning is rich: beyond the packages we used, there are the tidymodels collection (which unifies modeling syntax), deep learning packages (keras, torch), Bayesian modeling tools (brms, rstanarm for Bayesian GLMs), and specialized packages for time-series, spatial data, etc. The principles discussed here will help in understanding and applying those advanced tools as well.

In conclusion, regression models remain a fundamental toolkit for data science. Whether one’s goal is predicting the future or understanding the past (to paraphrase Shmueli, 2010), choosing the appropriate regression technique and using it wisely is key to success. We encourage the reader to practice with the provided code, explore variations (try more predictors, other datasets, tune parameters), and consult the references for deeper theoretical grounding and extensions.

References

Breiman, L. (1984). Classification and Regression Trees. Monterey, CA: Wadsworth.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. DOI: 10.1023/A:1010933404324.

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 785–794. DOI: 10.1145/2939672.2939785.

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. DOI: 10.1214/aos/1013203451.

Shmueli, G. (2010). To explain or to predict? Statistical Science, 25(3), 289–310.

Hastie, T. J., & Tibshirani, R. J. (1990). Generalized Additive Models. London: Chapman and Hall.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R (2nd ed.). New York: Springer.

Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1–26. DOI: 10.18637/jss.v028.i05.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.

Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York: Springer.

Wood, S. N. (2017). Generalized Additive Models: An Introduction with R (2nd ed.). Boca Raton, FL: CRC Press.

Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. DOI: 10.18637/jss.v077.i01.

Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.*

# Regression Models in Machine Learning with R Regression models are a cornerstone of machine learning and statistics, used for both **predictive modeling** and **causal inference**. In predictive modeling, the goal is to accurately predict an outcome variable given a set of features, whereas in causal inference the aim is to estimate the effect of certain variables (e.g. treatments or policies) on an outcome, often under theoretical assumptions about the data-generating process. These two goals can require different modeling approaches. In many scientific disciplines, regression has been traditionally used for explanatory modeling (seeking causal understanding), and it is sometimes incorrectly assumed that a model with high explanatory power will automatically have high predictive power. However, **explanatory (causal) and predictive models serve different purposes**, and conflating them is problematic for scientific progress. This chapter will clarify these distinctions and demonstrate how regression techniques can be applied for both purposes. We provide a comprehensive overview of common regression approaches in machine learning, focusing on implementation in R and applications in social science contexts. We will cover: * **Linear regression** and its assumptions and limitations. * **Regularized linear models** including ridge regression, lasso, and elastic net. * **Support Vector Regression (SVR)** for nonlinear regression using kernel methods. * **Tree-based methods** such as decision trees, random forests, and gradient boosting machines. * **Generalized Additive Models (GAMs)** for capturing nonlinear relationships in an interpretable way. For each approach, we discuss the theoretical foundations, including key assumptions, strengths, and limitations, and appropriate use cases. We also address important concepts that cut across these methods: the **bias-variance tradeoff**, risks of **overfitting**, the role of **regularization**, strategies for **feature selection**, model **cross-validation** and tuning, and considerations of **model interpretability**. Throughout the chapter, code examples in R are provided to demonstrate how to implement, tune, and evaluate each model using packages such as **`glmnet`**, **`caret`**, **`mlr3`**, **`ranger`**, **`e1071`**, and **`xgboost`**. We use either real datasets (e.g. the Boston housing data or social science datasets) or simulated data to illustrate model behavior in a social science context. All code is presented in **R Markdown** format with executable code chunks, allowing the reader to reproduce the analysis. By the end of this chapter, a reader should understand not only how to fit various regression models in R, but also **when and why** to use each approach. We emphasize an academic yet accessible style, blending mathematical intuition with practical examples. All claims are supported by literature in the field, with inline citations in APA style and a full reference list at the end. ## The Bias-Variance Tradeoff and Model Complexity Effective use of regression models requires understanding the fundamental **bias-variance tradeoff**. In any predictive modeling task, the expected prediction error can be decomposed (approximately) into bias, variance, and irreducible error components. **Bias** is error due to erroneous or overly simplistic assumptions in the model (i.e. underfitting), while **variance** is error due to sensitivity to small fluctuations in the training data (i.e. overfitting). Generally, as model complexity increases, bias decreases and variance increases. Simpler models (e.g. a linear regression with few predictors) tend to have higher bias but lower variance (they might miss important relationships but are stable across datasets), whereas more complex models (e.g. a deep decision tree or a high-degree polynomial) can fit nuances in the training data (lower bias) but may capture noise as if it were signal, leading to high variance on new data. This trade-off implies the existence of an **optimal model complexity** that minimizes the total prediction error. As complexity (e.g. flexibility or number of parameters) increases, the sum of squared bias initially decreases and then increases once variance grows too large, resulting in a U-shaped test error curve. In practice, we seek models that balance bias and variance to achieve low **generalization error** on new data.  *Figure 1. Illustration of the bias–variance tradeoff for a regression model. As model complexity increases, bias (black curve) decreases and variance (green curve) increases. The test mean squared error (purple curve) is high for very simple models (high bias) and very complex models (high variance), with an optimal complexity in between.* Managing the bias-variance tradeoff is crucial in regression modeling. Techniques such as **cross-validation** are commonly used to estimate out-of-sample error for different model choices and select an appropriate complexity. **Cross-validation** involves partitioning the data (e.g. into *k* folds) and ensuring that models are evaluated on data not used for training, providing an unbiased estimate of prediction error. We will use cross-validation extensively for model tuning (e.g. choosing regularization parameters) and model selection. In R, packages like **`caret`** and **`mlr3`** facilitate these procedures by automating data splitting, model training, and performance evaluation in a grid search or iterative framework. Another key concept is **overfitting** versus **underfitting**. Overfitting occurs when a model is so flexible that it captures random noise in the training data as if it were true structure, yielding excellent training fit but poor predictive performance on new data. Underfitting occurs when a model is too simple to capture important patterns, resulting in both training and test errors being high. The methods discussed in this chapter include built-in mechanisms to control overfitting: for example, **regularization penalties** in ridge/lasso reduce variance at the cost of a bit more bias, **pruning** in decision trees restricts overly complex trees, and **ensemble averaging** in random forests reduces variance by aggregating many overfitted trees. Understanding these concepts will inform how we tune and compare regression models. We now turn to individual regression modeling approaches, starting with the classical linear model and then moving to more advanced techniques. ## Linear Regression **Linear regression** is the oldest and most widely used regression technique. It assumes that the relationship between the independent variables (features) and the dependent variable (outcome) is linear. In a simple linear regression with one predictor, this means $y = \beta_0 + \beta_1 x + \varepsilon$; in a multiple linear regression with features $x_1, x_2, \dots, x_p$, the model is: $$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon, $$ where $\beta_j$ are coefficients to be estimated from data and $\varepsilon$ is an error term. Typically, the coefficients are estimated by **Ordinary Least Squares (OLS)**, which finds the $\hat{\beta}$ that minimize the residual sum of squares $RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2$. Under standard assumptions (linearity, independence of errors, homoscedasticity, normality of errors), OLS is the Best Linear Unbiased Estimator (BLUE) and yields convenient inference via t-tests and F-tests on coefficients. Linear regression models are easy to interpret—the coefficients $\beta_j$ represent the expected change in $y$ for a one-unit change in $x_j$ (holding other variables constant)—making them popular in social sciences for explanatory analysis. **Assumptions and Limitations:** Linear regression relies on the assumption of a linear relationship between each predictor and the outcome. If the true relationship is non-linear, a simple linear model will exhibit bias. One can partially address non-linearity by including transformations of predictors (quadratic terms, logarithms, interactions, etc.), but this requires manual specification. The model also assumes that errors have constant variance (homoscedasticity) and are uncorrelated; violation of these can be diagnosed by residual plots. Multicollinearity (high correlation between predictors) can inflate the variance of coefficient estimates, making inference unreliable. Furthermore, outliers can have a large influence on OLS estimates. Despite these limitations, linear models often serve as a useful baseline in analysis. **Strengths:** Linear regression is **interpretable**, analytically tractable, and quick to train even on large datasets. If the linearity assumption holds reasonably well, it provides an accurate and parsimonious description of data relationships. It also forms the foundation for more complex models (many advanced methods build on linear model ideas). **Implementation in R:** R has excellent support for linear modeling. The basic function is `lm()`. For example, using the **Boston housing dataset** (where the outcome `medv` is median home value and predictors include socioeconomic and environmental variables): ```r # Load data (Boston housing from MASS package) install.packages("MASS") # if not already installed library(MASS) data("Boston") # load the Boston dataset str(Boston) # view structure of the data ``` Assume we want to predict `medv` (housing value) from the average number of rooms `rm` and the percent of lower-status population `lstat` (two important predictors in this dataset): ```r # Fit a multiple linear regression model lm_fit <- lm(medv ~ rm + lstat, data = Boston) summary(lm_fit) ``` This would output the estimated coefficients, their standard errors, t-statistics, and p-values. We might see output like: ``` Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.94 3.73 0.789 0.430 rm 4.93 0.42 11.741 <2e-16 *** lstat -0.77 0.05 -15.280 <2e-16 *** ``` Interpreting these: holding `lstat` constant, each additional room is associated with an increase of about 4.93 (in $1000s) in median home price, and holding `rm` constant, each 1% increase in lower-status population is associated with a decrease of about $0.77k in home price. Both predictors are highly significant (p < .001). The $R^2$ and residual standard error in the summary indicate how well the model fits; for example, a multiple $R^2$ of 0.64 would mean the two predictors explain 64% of the variance in home prices. **Predictive Performance:** In terms of prediction, a linear model may suffice if relationships are roughly linear and additive. However, if the true relationships are more complex, a linear regression will exhibit bias (underfitting). For instance, on the full Boston data with all 13 predictors, a linear model might achieve an $R^2$ around 0.73–0.74 on held-out test data (meaning \~74% of variance explained). More flexible models like random forests can achieve higher $R^2$ (around 0.85–0.90 on the same data) by capturing nonlinearities and interactions, at the expense of interpretability (more on this later). In a social science context, if our goal is causal inference (e.g. the effect of education on income while controlling for other factors), linear regression remains a primary tool – but one must be cautious about unmeasured confounders and use domain knowledge to justify the model specification. Before moving to advanced models, it’s good practice to check linear model diagnostics (e.g. residual plots for non-linearity or heteroscedasticity, Q-Q plots for normality, etc.) and possibly apply remedial measures (transformations or weighted least squares) if assumptions are violated. Assuming we have a solid baseline, we can then explore ways to improve predictive accuracy and deal with high-dimensional settings using **regularization techniques**. ## Regularization: Ridge, Lasso, and Elastic Net One limitation of OLS linear regression is that it can overfit when the number of predictors $p$ is large relative to the number of observations $n$, or when predictors are highly correlated. **Regularization** techniques address this by adding a penalty term to the loss function, discouraging overly complex models (large coefficients) and thus reducing variance at the cost of some bias. Two widely used regularization methods are **ridge regression** and **lasso regression**, and a hybrid of the two called **elastic net**. These methods still assume a linear functional form, but shrink or select coefficients to improve predictive performance and interpretability. ### Ridge Regression (L2 regularization) Ridge regression adds an $L_2$ penalty (the squared magnitude of coefficients) to the OLS loss. The ridge estimates $\hat{\beta}^{\text{ridge}}$ minimize: $$ \sum_{i=1}^{n} \Big(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\Big)^2 + \lambda \sum_{j=1}^p \beta_j^2, $$ where $\lambda \ge 0$ is the **tuning parameter** that controls the strength of the penalty. When $\lambda = 0$, this is just OLS; as $\lambda$ increases, the penalty term $\lambda\sum \beta_j^2$ grows, forcing coefficients toward zero. In the extreme $\lambda \to \infty$, the coefficients (except intercept) all approach zero. The effect is that ridge regression **shrinks** coefficients, especially those of less predictive variables, thereby reducing model complexity. Ridge is particularly helpful in situations with **multicollinearity** or $p > n$, where OLS coefficients would be unstable or non-unique. In fact, ridge regression can yield a unique solution even if $p > n$, whereas OLS cannot. Another benefit is computational: ridge avoids searching through subsets of predictors (an infeasible $2^p$ problem) by solving a single convex optimization. **Bias-Variance Perspective:** The ridge penalty introduces some bias (shrinkage of coefficients) but can dramatically lower variance, often reducing test error as long as the increase in bias is smaller than the decrease in variance. In situations where the true relationship is close to linear, OLS is unbiased but may have high variance (especially if many predictors are included); ridge will introduce a bit of bias but achieve a lower overall mean squared error by taming variance. Empirically, one often finds an optimal $\lambda$ (via cross-validation) that gives the lowest validation error, balancing this tradeoff. **Choosing $\lambda$:** The regularization parameter $\lambda$ is usually selected via cross-validation. We try a grid of $\lambda$ values and choose the value that minimizes the average validation set error (e.g. mean squared error). In R, the **`glmnet`** package makes this easy via the function `cv.glmnet()`, which performs $k$-fold cross-validation and identifies the optimal $\lambda$. By default, `glmnet` standardizes predictors (zero mean, unit variance) because the penalty is not invariant to scaling; this is important since ridge treats a change of 1 unit in any predictor equally in the penalty, and unscaled variables could unfairly dominate. **Implementation in R (Ridge):** Using the **Hitters** data (a classic dataset of baseball players’ statistics and salaries, often used to demonstrate regularization), we can perform ridge regression as follows: ```r # Install and load glmnet for penalized regression install.packages("glmnet") library(glmnet) # Prepare data: Hitters dataset from ISLR package install.packages("ISLR") library(ISLR) data("Hitters") Hitters <- na.omit(Hitters) # remove rows with missing Salary # Define matrix of predictors (x) and response (y) x <- model.matrix(Salary ~ ., data = Hitters)[, -1] # model.matrix creates dummies; drop intercept column y <- Hitters$Salary # Perform ridge regression with cross-validation to choose lambda set.seed(1) cv_fit <- cv.glmnet(x, y, alpha = 0) # alpha=0 for ridge best_lambda <- cv_fit$lambda.min best_lambda ``` The output `best_lambda` might be, for example, 326. This is the value of $\lambda$ that minimized the 10-fold CV error. We can then examine the coefficients at this $\lambda$: ```r ridge_mod <- glmnet(x, y, alpha = 0, lambda = best_lambda) coef(ridge_mod) ``` We would find that many coefficients are shrunk towards zero relative to an OLS fit. For instance, a particular player's performance metric that was not very predictive will have a coefficient close to 0 under ridge. Notably, *ridge regression does not set coefficients exactly to zero* (except at $\lambda=\infty$) – all predictors remain in the model, just with smaller weights. This is a key difference from lasso, as we discuss next. **Use Cases and Strengths:** Ridge regression is particularly useful when we have **many correlated features** and we care about prediction accuracy more than model sparsity. For example, in a survey with many related socioeconomic indicators predicting an outcome, ridge will include them all but shrink their effects, effectively sharing information among correlated variables. Ridge often achieves better predictive accuracy than OLS on new data by **trading a little bias for a lot of variance reduction**. It’s also a go-to method when $p > n$ (more features than observations), where OLS fails. However, if we believe only a subset of features are truly relevant, ridge’s inability to perform variable selection can be a drawback for interpretability. ### Lasso Regression (L1 regularization) The **Lasso** (Least Absolute Shrinkage and Selection Operator) introduced by Tibshirani (1996) adds an $L_1$ penalty to the OLS loss. The lasso coefficients $\hat{\beta}^{\text{lasso}}$ minimize: $$ \sum_{i=1}^{n} \Big(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\Big)^2 + \lambda \sum_{j=1}^p |\beta_j|\,. $$ This $L_1$ penalty has the effect of **shrinking some coefficients exactly to zero** for sufficiently large $\lambda$. In other words, lasso performs **automatic feature selection**: it will include only a subset of the predictors in the final model (those with the strongest signals), setting the rest to zero. This sparsity makes lasso very appealing when we suspect many predictors are irrelevant, or when interpretability of the model (identifying key drivers) is important. **Comparison to Ridge:** Both ridge and lasso shrink coefficients toward zero to avoid overfitting, but the nature of the $L_1$ penalty (which has a sharp corner at zero in the constraint region geometry) means lasso tends to zero-out weaker coefficients entirely. Ridge, with its $L_2$ penalty (which gives a circular constraint region), instead keeps all coefficients small but generally nonzero. As a consequence, lasso can be seen as a variable selection method in addition to a regularization method. However, lasso has downsides: in cases of extreme multicollinearity or when $p > n$, lasso may arbitrarily pick one variable of a correlated set to keep and ignore others, leading to instability. Additionally, if $p > n$, lasso can select at most $n$ variables before it saturates (once it has $n$ parameters it can fit the data perfectly, then additional variables won’t enter). **Elastic net**, discussed next, was developed to address some of these issues by combining ridge and lasso penalties. **Implementation in R (Lasso):** The procedure is analogous to ridge but with `alpha=1` in `glmnet` (alpha is the mixing parameter between ridge (0) and lasso (1)). Using the same Hitters data: ```r set.seed(1) cv_fit_lasso <- cv.glmnet(x, y, alpha = 1) best_lambda_lasso <- cv_fit_lasso$lambda.min coef(cv_fit_lasso, s = "lambda.min") ``` Now, many coefficients will be exactly zero. For instance, in the Hitters data, lasso might select a handful of the most important player statistics (like hits, home runs, etc.) and eliminate others. The exact list of selected features can be seen from the coefficient output: any predictor with a coefficient of 0 is effectively dropped. This makes lasso useful as a **feature selection tool** in high-dimensional data. It is not uncommon in social science to have far more predictors (e.g. hundreds of survey questions or demographic variables) than observations; lasso can help focus on the most relevant ones, albeit with caution if inference is the goal (the lasso’s regularization bias means standard errors of coefficients are not straightforward). **Tuning and Use Cases:** Like ridge, lasso requires choosing $\lambda$ via cross-validation. The selected $\lambda$ balances model complexity and error. We often examine not just `lambda.min` (which minimizes error) but also a slightly larger value like `lambda.1se` (the largest $\lambda$ within one standard error of the minimum CV error) for a more parsimonious model. Lasso tends to perform well in scenarios where there are a few strong predictors and many weak ones, effectively finding the needle in the haystack. It has been used successfully in genetics (selecting important genes) and economics (selecting relevant instruments or controls). One must be careful, however, in using lasso for causal inference: by selecting variables, it introduces bias to coefficient estimates. In some advanced causal modeling strategies, a “post-lasso OLS” is used (run lasso to select variables, then refit OLS on those) to mitigate bias, but that is beyond our scope here. **Why Lasso Yields Sparsity:** Geometrically, the $L_1$ penalty’s constraint region (an $L_1$ ball) has corners, and the optimal solution often lies on a corner, meaning some coefficients are zero. In contrast, ridge’s $L_2$ penalty (an $L_2$ ball) produces solutions that typically lie on the smooth boundary (circle) without hitting axes, so ridge does not zero out coefficients. Thus, lasso achieves a form of built-in feature selection that ridge does not. ### Elastic Net **Elastic Net** regression, proposed by Zou and Hastie (2005), combines the $L_1$ and $L_2$ penalties. Its penalty is $\lambda_1 \sum_j |\beta_j| + \lambda_2 \sum_j \beta_j^2$. Often one uses a convex combination: $\alpha \sum |\beta_j| + (1-\alpha)\sum \beta_j^2$, where $\alpha=1$ is pure lasso and $\alpha=0$ pure ridge. The **elastic net** penalty yields a model that can select variables like lasso while also handling correlated predictors more gracefully like ridge. In practice, elastic net with $\alpha$ somewhere between 0 and 1 might include groups of correlated variables together (not arbitrarily dropping one of a group), a desirable property if predictors are highly correlated (e.g., in sociological indices where several variables move together). Elastic net also avoids an issue where lasso might perform poorly if $p \gg n$ and the true model is highly sparse; the ridge component helps stabilize it. **Implementation in R (Elastic Net):** In `glmnet`, any $0 < \alpha < 1$ gives an elastic net. We typically cross-validate to choose both $\alpha$ and $\lambda$. The `caret` package can facilitate a two-dimensional tuning (because `glmnet` itself tunes $\lambda$ for a given $\alpha$). For example: ```r install.packages("caret") library(caret) # Define training control with cross-validation train_control <- trainControl(method = "cv", number = 10) # Train an elastic net model with caret, tuning over alpha and lambda set.seed(1) enet_model <- train(Salary ~ ., data = Hitters, method = "glmnet", trControl = train_control, tuneLength = 10) # tuneLength will try various alpha and lambda enet_model$bestTune ``` Suppose the `bestTune` results in `alpha = 0.5` and some lambda value; this would indicate a balance where roughly equal parts L1 and L2 penalty gave the lowest error. We could then examine `enet_model$finalModel` or use `predict(enet_model, newdata=...)` to make predictions. **When to use Elastic Net:** Whenever we encounter a scenario with many predictors, especially with groups of correlated ones, elastic net is a robust default. It effectively handles multicollinearity by distributing coefficient weight across correlated variables. It also has an additional advantage: if there are more variables than observations, lasso can select at most $n$ variables, whereas elastic net (with $\alpha<1$) doesn’t have such a strict limit because of the ridge component. In social science data with many related indicators (education, income, occupation, etc.), an elastic net might include all variables in a group (with moderate shrinkage) rather than pick only one and ignore others, which may align better with theory (e.g., multiple factors all contribute some effect). Zou and Hastie’s original paper showed that elastic net often outperforms pure lasso in predictive accuracy when variables are correlated. **Example Result:** To illustrate the benefit of regularization, consider a comparison of test set performance (measured by Root Mean Squared Error, RMSE) for different models: * **OLS:** RMSE = 5.3 * **Ridge (CV optimal $\lambda$):** RMSE = 4.7 * **Lasso (CV optimal $\lambda$):** RMSE = 4.6 (with 10 of 20 predictors selected) In this hypothetical example, regularization improved predictive accuracy by mitigating overfitting. In general, if we observe that an OLS model has much lower training error than test error, ridge or lasso are promising remedies. ## Support Vector Regression (SVR) Support Vector Machines (SVMs) are a class of powerful machine learning models more commonly used for classification, but they can be adapted for regression problems via **Support Vector Regression (SVR)**. An SVR does not assume a simple linear form $y = \beta^T x$; instead, it seeks a function (linear or nonlinear via kernels) that deviates from the true targets by at most an $\epsilon$ margin for all training data, while at the same time being as flat as possible. This is achieved by using an $\epsilon$-insensitive loss function: errors within $\pm \epsilon$ of the true value incur no loss (they’re “in the tube”), and beyond that margin a linear penalty kicks in. The model tries to balance model complexity (flatness of the function) with tolerance to small errors, controlled by a regularization parameter. ### Theory of SVR In a linear SVR, we solve an optimization problem: minimize $\frac{1}{2}|\beta|^2 + C \sum_{i=1}^n L_\epsilon(y_i, \beta^T x_i + b)$, where $L_\epsilon$ is the $\epsilon$-insensitive loss (0 if $|y - \hat{y}| \le \epsilon$, and grows linearly for deviations beyond $\epsilon$). The term $\frac{1}{2}|\beta|^2$ is a regularization term (keeping the model flat/small weights), and $C$ is a tuning parameter (often denoted $1/\lambda$ in other formulations) that determines the trade-off between maximizing the margin (model simplicity) and minimizing training errors beyond $\epsilon$. A larger $C$ places more penalty on errors (thus the model will fit training data more closely, potentially at the expense of flatness), while a smaller $C$ emphasizes a wider tube (more regularization, potentially underfitting). SVR can also be **nonlinear**: by using the kernel trick, we can implicitly map features into a high-dimensional space and do linear SVR there. Common kernels include polynomial and radial basis function (RBF), which allow fitting very flexible nonlinear relationships. One important concept in SVR (and SVMs generally) is that the solution for $\beta$ can be expressed in terms of a subset of training points called **support vectors**. These are the points that lie outside the $\epsilon$-tube or on its boundary – they are the observations that “support” or define the regression function. Points well **inside** the tube (with error less than $\epsilon$) have no influence on the model (their Lagrange multipliers are zero in the dual formulation). This makes SVR robust to small fluctuations or outliers: a few extreme points might lie outside the tube and get penalized, but those within the $\epsilon$ range have no effect on the fitted function. **Strengths and Weaknesses:** SVR is a **non-parametric** method (especially with nonlinear kernels) – it does not assume a specific parametric form for the function beyond the general smoothness implied by the kernel. It often performs well in moderate dimensions and with data that has a clear margin of tolerance for error (the $\epsilon$). It can capture complex nonlinear patterns that linear regression would miss, without explicitly trying many transformations. SVRs, like SVMs, tend to have good **generalization** due to the max-margin principle and the regularization parameter $C$. However, they come with **computational costs**: training complexity is on the order of $O(n^3)$ for naive implementations, which can be slow for very large $n$. Memory usage can also be high since all pairwise kernel evaluations may be needed. In addition, SVR has multiple hyperparameters ($C$, $\epsilon$, and kernel parameters like the RBF width) that need tuning via grid search and cross-validation, which can be time-consuming. Interpretability is limited: while one can inspect which data points are support vectors or examine dual coefficients, SVR does not yield simple coefficients or variable importance measures like a linear model or tree. ### Implementation in R (SVR) The **`e1071`** package in R provides an `svm()` function that can perform both classification and regression (it infers the mode of the response). We can use it for SVR. For example, consider a toy problem: we have a single feature $x$ and outcome $y = \sin(x) + \text{noise}$. A linear model will underfit (since $\sin$ is nonlinear), but an SVR with an RBF kernel can fit the sine curve. ```r install.packages("e1071") library(e1071) # Simulate some nonlinear data set.seed(123) n <- 100 x <- seq(0, 2*pi, length.out = n) y <- sin(x) + rnorm(n, sd = 0.2) data <- data.frame(x, y) # Fit a linear model vs SVR lin_mod <- lm(y ~ x, data = data) svr_mod <- svm(y ~ x, data = data, kernel = "radial", cost = 1, epsilon = 0.1) # Predictions data$pred_lin <- predict(lin_mod, data) data$pred_svr <- predict(svr_mod, data) ``` Now, if we plot the true $y$ versus $x$ along with predictions, we would see that the linear model is a straight line (a poor fit), whereas the SVR with an RBF kernel closely follows the sine wave pattern (with slight deviations where noise occurs). The `cost` parameter here is analogous to $C$, and `epsilon` is the $\epsilon$ in the loss function. We could tune these via `tune.svm()` or using `caret`: ```r # Tune SVR hyperparameters via caret install.packages("caret") library(caret) svr_grid <- expand.grid(C = c(0.1, 1, 10), sigma = 0.5, epsilon = c(0.01, 0.1, 1)) set.seed(1) svr_tune <- train(y ~ x, data = data, method = "svmRadial", tuneGrid = svr_grid, trControl = trainControl(method = "cv", number = 5)) svr_tune$bestTune ``` This process would yield the best combination of `C` and `epsilon` (and kernel parameters like `sigma` for the RBF kernel) according to cross-validation. For higher-dimensional data, say using SVR on the Boston housing dataset, one might do: ```r set.seed(1) train_index <- sample(1:nrow(Boston), 400) train_data <- Boston[train_index, ] test_data <- Boston[-train_index, ] svr_model <- svm(medv ~ ., data = train_data, kernel = "radial", cost = 10, epsilon = 1) preds <- predict(svr_model, test_data) mean((preds - test_data$medv)^2) # test MSE ``` Comparing an SVR to earlier models: one might find SVR does fairly well, though on tabular data like Boston, tree-based methods often perform better due to their ability to capture interactions automatically. SVR’s advantage shows more in cases with moderate $p$ (feature count) and complex nonlinear relationships that are hard to hand-engineer features for. **Interpretability:** SVR is not very interpretable. One cannot easily decompose the prediction into contributions of each feature (as in linear regression). However, techniques like **partial dependence plots** or **SHAP (Shapley additive explanations)** can be applied to SVR to understand the effect of a single feature on the predictions on average. These tools treat the SVR as a black-box function $f(x)$ and probe its behavior by varying one feature at a time or computing contribution values. If interpretability is paramount, one might lean towards GAMs or decision trees for the final model, or use SVR only for prediction and use simpler models to explain the insights. **Epsilon-insensitive loss and support vectors:** To summarize SVR’s core idea, imagine a tube of width $2\epsilon$ around the true target values. The SVR finds a function (e.g., a line or curve) that stays as much as possible within this tube. Errors within the tube incur **no loss**, and only errors outside the tube are penalized linearly. The data points outside the tube (or on the boundary) are the **support vectors** — these points influence the SVR solution, whereas points inside the tube do not affect the model. The parameter $C$ controls how much the SVR tries to avoid errors versus keeping the tube wide; a high $C$ will shrink the tube (allow fewer errors, potentially overfitting), while a low $C$ will allow a wider tube (more tolerance, potentially underfitting). ## Tree-Based Methods Tree-based methods partition the feature space into a set of rectangular regions and fit a simple model (like a constant) in each region. They offer a very different approach to regression than the global linear or kernel methods. Decision trees are **non-parametric** and can capture complex interaction structures in the data, often at the expense of fine-grained smoothness or extrapolation. We discuss **decision trees** first, then ensembles of trees: **random forests** (using bagging) and **gradient boosting**. ### Decision Trees for Regression A **decision tree** for regression (also known as a regression tree) is a flowchart-like structure where each internal node represents a binary decision on one of the input features, and each leaf node gives a predicted outcome (typically the mean of the target for observations in that leaf). The CART algorithm (Classification and Regression Trees; Breiman et al., 1984) builds the tree by recursively splitting the data based on feature values that reduce the **sum of squared errors (SSE)** the most. At each step, among all features and all possible split points, it chooses the split that maximally decreases the SSE of the two resulting partitions (for regression; for classification, a measure like Gini impurity or information gain is used). This process continues until some stopping criterion is met: for example, a minimum number of observations in a leaf, a minimum reduction in SSE required for a split, or a maximum tree depth. If grown fully, a tree can perfectly fit the training data (each leaf could be a single observation, yielding SSE = 0), which is **overfitting**; hence, **pruning** or early stopping is usually applied to limit complexity. **Properties:** A basic decision tree provides a set of if-else rules that are **highly interpretable**: one can follow the path for a given observation to see why a prediction was made (e.g., *IF education > 12.5 years and urbanicity = yes and income < $30k, THEN predicted happiness = 6.8*). Trees can naturally handle qualitative (categorical) predictors by splitting on categories, and handle interactions between variables (each split can be on a different feature, so a leaf implicitly represents an interaction of conditions). They require little preprocessing — there’s no need to standardize variables, they can handle missing values (via surrogate splits), and they are invariant to monotonic transformations of individual predictors. Trees are also **robust to outliers** in predictors (an extremely large value will just end up in some partition and not affect other splits). **Strengths and Weaknesses:** As noted, interpretability is a strength (especially for small trees). Trees are also **nonlinear** and capture heterogeneity: different regions of the feature space can have different relationships or different mean outcomes. However, a single tree is often not very accurate. Trees tend to be **high-variance** learners; a small change in the data can substantially change the splits (especially when two splits yield nearly equal SSE reduction, the choice between them may flip with a slight data perturbation). They also can easily overfit if grown too deep, capturing noise. Empirical comparisons often find that an optimally pruned tree does not match the predictive performance of well-tuned regression models like lasso or of ensemble methods like random forests. This is why trees are often used in **ensembles** (discussed next) to improve performance. Another limitation is that the greedy splitting may not find a globally optimal tree – it makes locally optimal splits at each step. Nonetheless, decision trees remain popular for their transparency and ease of use. In some cases, a tree can reveal interesting structure in the data (e.g., threshold effects or interactions) that a linear model would miss. **Implementation in R (Decision Tree):** The classic packages are **`rpart`** (which implements CART) and the older `tree` package. We will use `rpart` here, along with `rpart.plot` for visualization: ```r install.packages(c("rpart", "rpart.plot")) library(rpart) library(rpart.plot) # Using Boston housing data for a regression tree example set.seed(1) train_idx <- sample(1:nrow(Boston), 300) train_data <- Boston[train_idx, ] test_data <- Boston[-train_idx, ] # Train a regression tree to predict medv tree_model <- rpart(medv ~ ., data = train_data, method = "anova", control = rpart.control(cp = 0.01)) # cp = complexity parameter for pruning printcp(tree_model) # display the cost-complexity pruning table ``` The output of `printcp` shows various possible tree sizes and their cross-validated errors. Suppose the tree with complexity parameter 0.01 has 5 splits. We can visualize the tree: ```r rpart.plot(tree_model, type = 2, extra = 101) ``` *Figure 2. Example regression tree predicting `medv` (housing price) from the Boston data.* *Figure 2.* Each internal node splits on a predictor (with the splitting rule shown), and each leaf node contains the predicted value (mean outcome) for that region. For instance, one branch of this tree might correspond to homes in neighborhoods with low `lstat` (percent lower-status population) and high `rm` (number of rooms), yielding a higher predicted `medv`, whereas another branch corresponds to areas with high `lstat` and thus lower `medv`. The tree is easily interpretable but may not capture smooth linear trends as well as some other methods. From Figure 2, we can interpret the model: the first split might be on `lstat` (e.g., if `lstat < 9.5%` vs. higher), indicating that neighborhood socio-economic status is a primary driver of home value. One branch leads to high-value neighborhoods, which are then split by another variable (say `rm`, the number of rooms), and so forth. Each leaf gives a predicted `medv` for that region of the feature space. This tree, while simpler than a full unpruned tree, highlights interactions (e.g., the effect of `rm` is considered separately in high-`lstat` vs low-`lstat` neighborhoods) and non-linearity (piecewise constant predictions). To evaluate performance: ```r pred_tree <- predict(tree_model, newdata = test_data) tree_mse <- mean((pred_tree - test_data$medv)^2) tree_mse ``` We can compare this MSE to that of a linear model or other models on the same training/test split. Often, a single tree’s performance will be worse than a tuned linear or regularized model if the true relationships are mostly linear. However, if there are threshold effects or strong interactions, the tree might do better. **Pruning:** The `cp` (complexity parameter) in rpart controls how the tree is pruned. A smaller `cp` allows more splits (a more complex tree); a larger `cp` prunes more aggressively. Typically, one would use cross-validation (the `cptable` from `printcp`) to find the optimal `cp` that minimizes the cross-validated error. In the code above, we set `cp = 0.01` initially. We could adjust this to the value that minimizes the xerror in `tree_model$cptable`. For example: ```r best_cp <- tree_model$cptable[which.min(tree_model$cptable[,"xerror"]), "CP"] pruned_tree <- prune(tree_model, cp = best_cp) ``` A pruned tree (with fewer splits) is generally preferable to the fully grown tree to avoid overfitting. **Decision trees in social science:** Decision trees can be useful as an exploratory tool to uncover segmentations in the data and potential interactions. For example, an education researcher might use a tree to discover that different predictors of test scores matter in urban vs. rural schools (the tree might first split on an indicator for urban/rural, then split on teacher-student ratio or income level within each group). However, for final inferential analysis, one might still prefer a parametric model or other approaches that provide confidence intervals and are easier to connect to theory. In recent years, trees have also been adapted for **causal inference** (e.g., causal trees and causal forests for heterogeneous treatment effect estimation), but those are specialized extensions beyond our scope. ### Random Forests (Ensembles of Trees via Bagging) A **Random Forest** is an ensemble of decision trees built using the technique of bagging (*bootstrap aggregating*) combined with random feature selection, introduced by Breiman (2001). The idea is to combine the predictions of many trees to reduce variance and improve accuracy, at the cost of interpretability of the overall model. **How it works:** We grow $B$ trees on bootstrap samples of the training data (each tree sees a random sample of the data with replacement). This by itself is bagging; if we average the predictions of many such trees, the average tends to have lower variance than a single tree (because the individual tree errors cancel out to some extent). **Random forests add an extra twist:** at each candidate split in the tree-growing process, instead of considering all $p$ predictors, the algorithm selects a random subset of $m_{\text{try}}$ predictors and only considers splits among those. Typically $m_{\text{try}} = \sqrt{p}$ for classification and $m_{\text{try}} \approx p/3$ for regression by default (these can be tuned). This random feature selection further **decorrelates the trees**: if there is a very strong predictor, bagged trees would all tend to use it in the top split, making their predictions highly correlated. By forcing different trees to consider different predictors, the ensemble’s overall variance is reduced more than by bagging alone. After training, we have an ensemble ${\hat{f}_1(x), \hat{f}*2(x), \dots, \hat{f}*B(x)}$ of $B$ tree predictors. The random forest prediction is simply the average of the individual tree predictions: $\displaystyle \hat{f}*{RF}(x) = \frac{1}{B}\sum*{b=1}^B \hat{f}_b(x)$. For regression, this is the mean; for classification, it could be majority vote or the average class probability. **Key advantages:** * **Improved accuracy:** Random forests often achieve excellent predictive performance out-of-the-box, typically **outperforming a single tree** by a wide margin. In many cases, RFs are competitive with the best tuned models on a given task. * **Robustness:** They are more robust to overfitting than single trees; using an adequate number of trees $B$ (e.g. 500 or more) ensures the ensemble won’t overfit simply by adding more trees (in fact, as $B \to \infty$, the generalization error converges to a limit). We still need to tune parameters like $m_{\text{try}}$ and sometimes the tree depth (via minimum node size or maximum depth), but random forests are less sensitive to these choices than boosting methods. * **Out-of-bag (OOB) error:** A useful byproduct of the bootstrap procedure is that for each tree, about one-third of the training instances are left out (not in that bootstrap sample). These are “out-of-bag” for that tree. We can predict those OOB instances with the tree (or the forest) and get an **OOB error estimate** by averaging over all trees. This OOB error is an internal cross-validation estimate of model performance – often very close to the true test error – so one may not need a separate validation set. * **Variable importance:** Random forests can provide measures of feature importance. One approach is **permutation importance**: randomly permute values of a given feature in the OOB data and see how much the prediction error increases (a large increase suggests the feature was important). Another is to accumulate the total decrease in node impurity (SSE for regression) attributable to each feature across the forest. These measures give a high-level view of which predictors are driving the model’s predictions. **Limitations:** The main downside is the loss of interpretability compared to a single tree. It’s difficult to interpret the ensemble of hundreds of trees in a simple way – one cannot easily summarize the relationship as a set of rules or coefficients. Partial dependence plots or feature importance metrics help, but they provide only aggregate insights. Random forests are also more computationally intensive than a single tree (though they are **embarrassingly parallel** since trees can be grown independently, and modern implementations like `ranger` in C++ are quite fast). For extremely large datasets (especially with many features), training can be slow or memory-intensive, but random forests scale reasonably well for moderate-sized data. **Implementation in R (Random Forest):** The original implementation is in the `randomForest` package, and a more recent efficient implementation is **`ranger`**. We will use `ranger` for speed: ```r install.packages("ranger") library(ranger) # Fit a random forest on Boston housing data rf_model <- ranger(medv ~ ., data = train_data, num.trees = 500, mtry = 6, importance = "permutation") print(rf_model) ``` This will output something like: *500 trees, OOB prediction error (MSE) = 10.5*. The OOB MSE can be compared to the test MSE we computed for a single tree. Typically, the random forest’s OOB error is much lower (indicating a better model). For instance, if our single tree had MSE around 20 on the test set, the random forest might have MSE around 10–12, roughly halving the error. In terms of $R^2$, the single tree might be around 0.5, whereas the random forest might be 0.85 or higher on the same data – a dramatic improvement in predictive power. In the code above, we chose `mtry = 6` (for 13 predictors, $p/3 \approx 4.3$, we tried a slightly larger value to see if it helps). We could tune `mtry` by trying a few values and checking OOB error, or use `caret` to tune it. The argument `importance = "permutation"` tells ranger to compute permutation importance for features. We can examine which variables were important: ```r rf_model$variable.importance ``` This might show, for example, that `lstat` and `rm` are the top two important features (aligning with what we saw in the linear model and tree), followed by others like `nox` (nitric oxide concentration) or `dis` (distance to employment centers), etc. **Variable importance** gives a sense of which predictors contribute most to reducing prediction error in the forest. To use the random forest model for prediction on the test set: ```r rf_preds <- predict(rf_model, data = test_data)$predictions rf_mse <- mean((rf_preds - test_data$medv)^2) rf_mse ``` The test MSE should be close to the OOB estimate. In our hypothetical example, if OOB MSE was \~10.5 (in $1000^2$ units), the test MSE might be similar, indicating good generalization. The improvement over the single tree is clear – random forests often achieve significantly lower error by averaging many overfit trees, thereby canceling out the noise each tree picks up. **Using caret or mlr3:** One can also use `caret` to train a random forest (`method = "rf"` uses the `randomForest` package by default) and tune `mtry`. The **`mlr3`** framework likewise has learners for random forests and can tune hyperparameters with grid search, random search, or other strategies. For example, in mlr3 you could define a `LearnerRegrRanger`, set a search space for `mtry`, and use `tune_grid` with cross-validation. **Interpretability considerations:** While the forest as a whole is complex, tools like **`vip`** (Variable Importance Plot) and **`pdp`** (Partial Dependence Plot) can help interpret it. For instance, a partial dependence plot for `lstat` (holding other variables at average values) might show that predicted house prices sharply decrease as `lstat` increases from low values, then level off at high `lstat` – capturing a nonlinear effect that a linear model would approximate with a single coefficient. Similarly, one could examine partial dependence of `rm` to see its effect. These plots provide a way to explain the model’s behavior feature by feature, which is useful for presenting results to stakeholders. From a causal perspective, random forests are generally used for prediction rather than inference. However, there are methods like **causal forests** (Athey & Imbens, 2016) that modify the splitting criteria to estimate treatment effects in experimental or observational data, using the random forest idea to detect heterogeneity in effects. Those are beyond our scope, but it’s worth noting that tree ensembles have roles in modern causal analysis (e.g., for propensity score estimation or meta-learning frameworks like double ML). ### Gradient Boosting Machines (GBM) and XGBoost **Gradient boosting** is another powerful ensemble technique. Unlike the parallel trees in bagging (random forests), boosting builds trees *sequentially*, each one trying to correct the errors of the previous ensemble. Pioneered by Friedman (2001), gradient boosting views the task of fitting a model as an optimization problem: we start with a simple model (like predicting the mean of $y$ for all observations), then iteratively add new “weak” models (such as small trees) that point in the direction of the gradient of the loss function (hence *gradient* boosting). In each iteration, a new tree is fit to the **residuals** (the remaining errors) of the current model. For least squares regression, this is equivalent to fitting a tree to the difference $y - \hat{y}$ (which is the negative gradient of MSE loss). This new tree is then added to the existing model with a scaling factor called the **learning rate** $\nu$ (a small number, e.g. 0.1), which controls how fast we correct the errors. By sequentially “boosting” the model in the direction of steepest descent of the loss, the method can fit very complex functions **additively** as a sum of many simple trees. Key characteristics of boosting: * **Many shallow trees:** Typically, each individual tree in a boosting model is limited in depth (e.g., 3–8 terminal nodes) to ensure the trees remain weak learners. The strength comes from the accumulation of many such trees. The final model might be an ensemble of hundreds or even thousands of small trees. * **Learning rate:** The learning rate $\nu$ (also called shrinkage) slows down the learning by scaling the contribution of each new tree. A smaller $\nu$ (like 0.01) usually requires more trees but can yield better generalization, whereas a larger $\nu$ (0.1 or 0.2) learns faster but might overfit if too many trees are added. The learning rate is a crucial regularization parameter in boosting. * **Regularization:** In addition to the learning rate, modern boosting implementations include other regularization: **subsampling** (Stochastic Gradient Boosting, where each tree is fit on a random subset of data, adding randomness similar to bagging), **column subsampling** (like in random forest, using a subset of features for each tree), and direct penalties on tree complexity (e.g., XGBoost can penalize the L2 norm of leaf weights). These techniques help prevent overfitting even after many boosting rounds. **XGBoost** (eXtreme Gradient Boosting, Chen & Guestrin 2016) is a popular high-performance implementation of gradient boosting that includes efficient handling of sparse data, built-in regularization, and parallelization. XGBoost has seen wide adoption and success in data science competitions and real-world applications. For example, in 2015 Kaggle competitions, 17 out of 29 winning solutions used XGBoost (8 of them using it as the sole method), establishing it as a de facto choice for many tabular data problems. Gradient boosted trees often **outperform random forests** in predictive accuracy because boosting can adapt bias-variance tradeoff more finely (though this comes at the cost of more hyperparameters to tune). **Implementation in R (Boosting):** The `xgboost` package provides a low-level interface to train models. Alternatively, one can use `caret` with `method = "xgbTree"` or the higher-level `tidyverse` interface via `xgboost` in `tidymodels`. Here we illustrate using `xgboost` directly on the Boston housing data: ```r install.packages("xgboost") library(xgboost) # Prepare data matrices for XGBoost (requires numeric matrices) train_matrix <- data.matrix(train_data[ , setdiff(names(train_data), "medv") ]) train_label <- train_data$medv test_matrix <- data.matrix(test_data[ , setdiff(names(test_data), "medv") ]) test_label <- test_data$medv # Set up parameters for XGBoost params <- list(objective = "reg:linear", # linear regression objective eta = 0.1, # learning rate max_depth = 3, # max depth of trees subsample = 0.8, # subsample 80% of rows for each tree colsample_bytree = 0.8) # subsample 80% of features for each tree # Train XGBoost model with 100 trees (rounds) set.seed(1) xgb_model <- xgboost(data = train_matrix, label = train_label, params = params, nrounds = 100, verbose = FALSE) # Predict on test data preds <- predict(xgb_model, test_matrix) xgb_test_mse <- mean((preds - test_label)^2) xgb_test_mse ``` We would compare `xgb_test_mse` to our previous models’ performance. Typically, a well-tuned XGBoost model could achieve even lower error than a random forest on many datasets. For instance, on Boston housing data, one might push the $R^2$ to \~0.90 with careful boosting (whereas our linear model was \~0.74 and random forest \~0.85, hypothetically). The **advantage of boosting** is its ability to model complex relationships by iteratively focusing on the remaining errors. The disadvantage is that it can **overfit** if we add too many trees or use too high a learning rate without sufficient regularization. Therefore, it’s common to use cross-validation to find the optimal number of boosting rounds (trees) – for example, using `xgb.cv()` with an early stopping criterion. **Tuning hyperparameters:** Boosting has several hyperparameters: the number of trees (`nrounds`), the learning rate (`eta`), tree depth (`max_depth`), and others like `min_child_weight` (minimum sum of instance weights in a leaf), `gamma` (minimum loss reduction required to make a split), `subsample`, and `colsample_bytree`. It’s important to tune these. A common strategy is to start with a relatively large `eta` (like 0.1) and tune `max_depth` and `nrounds`, possibly using early stopping to decide when to stop adding trees. Then one might lower `eta` and increase `nrounds` for a final model. The `caret` package can perform a grid search over a subset of these parameters. For example: ```r grid <- expand.grid(nrounds = c(100, 200, 500), max_depth = c(3, 5, 7), eta = c(0.1, 0.3), gamma = 0, colsample_bytree = 0.8, min_child_weight = 1, subsample = 0.8) set.seed(1) xgb_tune <- train(medv ~ ., data = train_data, method = "xgbTree", tuneGrid = grid, trControl = trainControl(method="cv", number=5)) xgb_tune$bestTune ``` This might tell us, for example, that `max_depth = 3, eta = 0.1, nrounds = 200` was best (just an illustration). We would then use those parameters to fit the final model on the full training data. **Interpretability:** Like random forests, gradient boosting yields a black-box model. But we can extract **variable importance** (e.g., using `xgb.importance()` to see which features had the highest importance across splits). We can also use partial dependence plots to visualize how changing one feature affects predictions on average, or compute SHAP values for more nuanced explanations of individual predictions. These tools can help translate the model’s complexity into understandable insights. For example, we might find that the partial dependence of `lstat` is strongly nonlinear (steep drop then leveling off), indicating a diminishing returns effect of improving neighborhood status on housing prices – an insight that aligns with what a GAM or domain knowledge might also suggest. **Use in social science context:** Boosted trees are typically used for **prediction tasks** (e.g., predicting economic indices, risk scores, etc.) where accuracy is the primary goal. If causal interpretation is needed, one must be careful; a boosted model might fit patterns that are predictive but not causal. That said, one can use boosting in supportive roles – for instance, to impute missing data, to model propensity scores in observational studies, or as a way to explore nonlinear relationships which are later examined with simpler models for causality. In purely predictive competitions or applications (like predicting which individuals will respond to an intervention), gradient boosting often shines. The key is to guard against overfitting with appropriate tuning and validation. **Caution:** Because boosting is so powerful, it is possible to **memorize noise** if the model is made too complex. Always monitor training vs. validation performance: as you add more trees, the training error will keep dropping, but the validation error will typically decrease to a point and then start increasing when overfitting sets in. Using techniques like **early stopping** (stop adding trees when validation error hasn’t improved for a certain number of rounds) is common. Additionally, ensure that the data is randomly shuffled (for cross-validation) and consider using a separate test set to double-check performance. In summary, **gradient boosting machines** (GBMs) are among the most accurate and versatile modeling techniques for regression and classification. With libraries like XGBoost, LightGBM, and CatBoost, practitioners have access to highly optimized implementations. While they require careful tuning and are less interpretable, their predictive performance often justifies the effort in competitive settings. ## Generalized Additive Models (GAMs) Generalized Additive Models represent an important middle ground between the interpretability of linear models and the flexibility of non-parametric models. A GAM (Hastie & Tibshirani, 1990) models the outcome as an additive combination of unspecified smooth functions of predictors: $$ g\!\big(E[Y]\big) = \beta_0 + f_1(X_1) + f_2(X_2) + \cdots + f_m(X_m)\,, $$ where $g$ is a link function (identity for regular regression, or logit, etc. for generalized outcomes) and each $f_j$ is a smooth function fitted from the data (often using splines or other basis expansions, subject to a smoothness penalty). In a regression context with identity link, this is: $y = \beta_0 + f_1(x_1) + \cdots + f_m(x_m) + \varepsilon$. The key idea is that we **relax the linearity assumption** by allowing each predictor to have a nonlinear effect, but we keep the additivity (no automatic interactions between different $f_j$ terms, unless we explicitly include an interaction function). This way, the model remains relatively interpretable—one can plot $f_j(x_j)$ to see how predictor $X_j$ influences the response $Y$, holding other variables constant. **Theory and Assumptions:** By being additive, GAMs assume no strong interactions between predictors (unless we add interaction terms or use tensor product smooths for multiple variables together). If strong interactions exist and are not modeled, a GAM can miss those patterns (leading to bias). However, often additivity is a reasonable assumption or can be a starting point. Each smooth function $f_j$ is typically represented using basis functions (e.g., spline basis) and fitted with a **roughness penalty** to avoid overfitting. In R’s **`mgcv`** package (by Simon Wood), GAMs use penalized regression splines by default, and the amount of smoothing (the effective degrees of freedom for each $f_j$) is chosen by the model fitting process (usually by minimizing a criterion like generalized cross-validation or using restricted maximum likelihood). This means the user often does not need to manually choose the smoothing parameter; the algorithm will find an optimal balance between fit and smoothness for each term. **Strengths:** GAMs are **interpretable** — one can examine each $f_j$ plot to understand the relationship between $X_j$ and $Y$. They can model **nonlinear relationships** that linear models would miss, thus reducing bias when nonlinearities are present, while still using a form of regularization (spline smoothing penalties) to avoid excessive variance. They strike a good bias-variance balance in many situations. Moreover, GAMs naturally handle mixtures of continuous and categorical predictors (categorical predictors can be included as linear terms or factors, equivalent to step functions for those variables). They also extend generalized linear models (GLMs), so one can use them for binary outcomes (logistic GAM), count outcomes (Poisson GAM), etc., by choosing an appropriate link function $g$. **Limitations:** The additivity means GAMs will not automatically capture interactions between predictors (unless you explicitly add interaction terms like $f_{ij}(X_i, X_j)$). If important interactions exist (e.g., maybe the effect of income on happiness differs by gender), a pure additive model will miss that unless you include a specific term for it. Also, while more interpretable than, say, a random forest, GAMs are less succinct than a linear model — instead of a single coefficient, each feature’s effect is a curve that must be visualized. There is an art to choosing the type of smooth (e.g., thin plate spline vs. cubic spline vs. loess) and the basis dimension (how flexible it can be). If the smoothing parameter selection chooses too much flexibility, the model could overfit wiggly patterns; if too restrictive, it might underfit. Diagnostic tools in `mgcv` (like checking the estimated degrees of freedom and the significance of smooth terms, residual plots, concurvity checks for multicollinearity in smooths) are important to validate the model. **Implementation in R (GAM with `mgcv`):** The `mgcv` package’s main function is `gam()`. Suppose we suspect that in the Boston housing data the relationships between `medv` and some predictors are nonlinear (as is likely for `lstat` and `rm`). We can fit a GAM: ```r install.packages("mgcv") library(mgcv) gam_model <- gam(medv ~ s(lstat) + s(rm) + rad + tax, data = Boston) summary(gam_model) ``` In this formula, `s(lstat)` means we include a smooth function of `lstat`, and `s(rm)` a smooth function of `rm`. We left `rad` (index of accessibility to highways) and `tax` (property tax rate) as linear terms for illustration – perhaps because `rad` is an index with only a few values (we might treat it as categorical or linear) and `tax` might have a roughly linear effect. The summary of `gam_model` will show something like: ``` EDF Ref.df F p-value s(lstat) 3.7 4.5 42.6 <2e-16 *** s(rm) 1.0 1.0 18.3 3.2e-05 *** rad (Linear coefficients and p-values) tax (Linear coefficients and p-values) ``` Here, `EDF` stands for **effective degrees of freedom** for each smooth. For `s(lstat)`, EDF \~ 3.7 suggests it used a spline with about 4 degrees of freedom (so the function is nonlinear). The p-value < 2e-16 indicates the smooth is highly significant. For `s(rm)`, EDF \~ 1.0, meaning the model found that a linear function in `rm` was sufficient (or it heavily penalized any wiggliness, effectively making it linear). The linear terms `rad` and `tax` would have ordinary regression coefficients in the summary. We can visualize the fitted smooth functions: ```r par(mfrow = c(1, 2)) plot(gam_model, shade = TRUE, se = TRUE) ``` This would produce plots of $f(\text{lstat})$ and $f(\text{rm})$. For example, $f(\text{lstat})$ might show a sharply declining curve for low `lstat` values that levels off as `lstat` becomes high (indicating that reducing the percentage of lower-status population from 20% to 10% has a big effect on home prices, but below 5% there is little additional gain — a diminishing returns phenomenon). The $f(\text{rm})$ plot might appear roughly as a straight line (since EDF \~ 1), suggesting a linear relationship between `rm` (number of rooms) and `medv` in this model, or possibly a gentle curve if EDF was slightly above 1. These plots are easily communicable: they show the *shape* of the relationship, with confidence bands (if `se=TRUE`) indicating uncertainty. To evaluate the GAM’s predictive performance, we could do: ```r gam_preds <- predict(gam_model, newdata = test_data) gam_mse <- mean((gam_preds - test_data$medv)^2) gam_mse ``` We might find the GAM improves over a linear model slightly if those nonlinear effects were important (for instance, a lower MSE than the linear model’s \~0.74 $R^2$). However, a well-tuned random forest or boosted model might still outperform the GAM in pure prediction, because they can capture interactions and more complex shapes. The GAM’s advantage is that it provides an **interpretable summary of the relationships**. **Use cases in social science:** GAMs have been used in many fields. For example, in epidemiology, one might model how air pollution levels affect health outcomes with a smooth function (since the relationship might not be strictly linear). In economics, one might allow for a nonlinear effect of age on income (wage profiles often rise then fall) or of education on income (diminishing returns to schooling). GAMs allow the data to *suggest* the form of these curves rather than prespecifying a polynomial of a certain degree. Importantly, because GAMs still have a notion of degrees of freedom, one can perform approximate hypothesis tests on whether a nonlinear term is needed (e.g., compare a model with $s(x)$ to one with just $x$ linearly, or check if EDF is significantly greater than 1). This can guide theory: if the data suggests a nonlinear effect, researchers might theorize reasons for it. **Regularization in GAMs:** The spline smoothness selection is effectively a form of regularization (an $L_2$ penalty on the second derivative of the function, for example). The “penalty” prevents overfitting by discouraging wiggly curves unless the data strongly warrants them. If a predictor has no relationship with $Y$, the best fit might be essentially a flat line (EDF near 0, meaning $f_j(x)$ is nearly constant, contributing nothing). In that sense, a GAM can **zero-out** a predictor’s effect (by making it a flat function) analogous to how lasso can zero out coefficients, though the GAM’s approach is more akin to ridge (continuous shrinkage rather than a hard threshold to zero). There are extensions like **sparse GAM** that combine lasso and GAM to select which variables to include as well as estimate their shapes, but in practice one can use the significance of smooth terms as a guide. Finally, as with any model, checking diagnostics is important. For GAMs, one can plot residuals vs fitted values to see if patterns remain, use `gam.check()` to see if basis dimensions were sufficient (it will indicate if more flexibility might improve the fit), and look at concurvity (GAM’s version of multicollinearity) if there are concerns that two smooths might be jointly identifying the same pattern. In summary, GAMs provide a powerful and relatively interpretable way to discover and model nonlinear relationships. They are a valuable tool in the analyst’s toolkit when linear models are too restrictive and black-box machine learning models are too opaque or not needed for the task. ## Model Tuning, Evaluation, and Comparison Throughout this chapter, we emphasized the importance of **tuning hyperparameters** (like $\lambda$ in ridge/lasso, $C$ and $\epsilon$ in SVR, the depth and number of trees in ensembles, degrees of freedom in GAMs, etc.) using techniques like cross-validation. In practice, one should perform model selection by comparing models on a hold-out validation set or via cross-validated performance, rather than relying solely on training error or theoretical appeal. For example, one might compare a linear model, a lasso, a random forest, and an XGBoost model on the same dataset to see which yields the lowest predictive error (using 10-fold CV or a separate test set). Often, the more flexible models (random forests, XGBoost) win on pure prediction metrics. However, they might be harder to interpret. Thus, the choice may depend on context: if we need to **explain** the relationship or test hypotheses (as in many social science settings where causal understanding is the goal), a simpler or at least an interpretable model (like linear regression or GAM) might be preferable so we can clearly communicate the effects. If we primarily need **accurate predictions** (say, for predicting an outcome to allocate resources), a complex ensemble might be the best choice. Key topics in evaluation include: * **Cross-Validation:** Using $k$-fold CV or repeated CV to estimate out-of-sample performance for different models and parameter settings. We often use metrics like mean squared error (MSE) or root MSE for regression. In R, packages like `caret` automate this process (`train()` will by default do CV and pick the best model). The **`mlr3`** ecosystem provides a more programmable framework to do the same with more customization. The goal is to simulate how the model would perform on new data by not training and testing on the same data points. * **Learning Curves and Bias-Variance Diagnostics:** Sometimes we diagnose a model by its learning curve – plotting training error and validation error as model complexity increases. For example, for ridge regression, as $\lambda$ increases, training error will decrease (for small $\lambda$) then increase (for large $\lambda$) while validation error typically has a U-shape. For a random forest or boosting, we can monitor how the OOB/validation error changes as we add more trees. If the validation error starts rising after a point, we know further complexity is overfitting. Conversely, if the validation error is still going down when we stop (and training error is near zero), perhaps we could have fit a more complex model or we need more data. These diagnostics help in understanding whether a model is high-bias (underfitting) or high-variance (overfitting). * **Feature Selection:** Aside from automated methods like lasso, one can perform manual or stepwise feature selection. However, stepwise selection can be risky (it may select different features in different folds, and the statistical inference after selection is complicated). In a predictive context, it is often better to use regularization or tree importance or domain expertise to select features. For instance, one might use a random forest to rank variables by importance and drop those below a certain threshold to simplify a model. Or use lasso to screen variables, then use the subset in a final linear model (this is sometimes done to balance prediction and interpretability). Always be cautious: removing too many features might remove signal, but keeping all features might lead to overfitting or unnecessary complexity. * **Model Interpretability:** If a black-box model (like an ensemble) is chosen for its accuracy, consider applying post-hoc interpretation tools to extract insights. **Partial Dependence Plots (PDPs)** show the marginal effect of one feature on the predicted outcome, averaging over all other features. For example, PDP might reveal that a model’s predictions of income flatten out after 16 years of education, suggesting diminishing returns to education. **Individual Conditional Expectation (ICE) plots** go further by showing lines for individual instances’ predictions as a function of a feature (to detect heterogeneity in the effect). **SHAP values** (from cooperative game theory) provide an additive attribution of each feature’s contribution to an individual prediction; they have the advantage of a solid theoretical foundation and consistency. In R, packages like `shapley` or `fastshap` can compute SHAP values for any model. These tools help in explaining *why* the model made certain predictions, which is increasingly important for trust and accountability (e.g., in finance or healthcare applications of ML). * **Causal Inference vs. Prediction:** It’s important to reiterate the difference between explaining and predicting (Shmueli, 2010). A model that is excellent at prediction may use complex patterns that are not causal; using it to infer “what if” scenarios can be misleading. Conversely, a simple interpretable model might be preferred to answer a causal question, even if a black-box would predict better. Sometimes, researchers use machine learning within a causal analysis pipeline: for example, using lasso to select control variables for an observational study, or using random forests to estimate propensity scores for matching. These hybrid approaches leverage the strength of ML in dealing with many variables or nonlinearities, but then combine with traditional inferential techniques. It is an active research area (e.g., the field of “double ML” or targeted learning) to ensure unbiased estimation of causal effects while using flexible ML models. In practice, be clear about the goal: **for causal interpretation**, lean towards models that are transparent and grounded in theory; **for prediction**, feel free to use more complex models but validate them well and use interpretation tools to verify they make sense. ## Conclusion In this chapter, we surveyed a spectrum of regression models in R, from the venerable linear regression to modern machine learning techniques like random forests and gradient boosting. Each method has its strengths: linear models and GAMs offer interpretability and are useful for inference and communicating insights, whereas support vector machines and tree-based ensembles offer high flexibility and often superior predictive performance (at the cost of opacity). Regularization methods like ridge and lasso provide a middle ground, improving linear models’ performance in high-dimensional settings and performing feature selection for interpretability. Throughout, we emphasized critical concepts such as the **bias-variance tradeoff** and the importance of proper model tuning and validation. For a practitioner, the choice of model should be guided by the problem context. If the task is to **predict** an outcome as accurately as possible (e.g. predicting economic indicators or individual risk scores), machine learning models like ensembles or SVR might be the best choice – and indeed we saw that these can dramatically reduce prediction error compared to a simple linear baseline. If the task is to **explain** relationships or test hypotheses (e.g. *does education affect income, and by how much?*), one might start with a linear model or a GAM for a more transparent view of the data-generating process, perhaps supplemented by regularization to handle many covariates, and use the insights from more flexible models as a supplement rather than a replacement. Often, a combination of approaches yields the richest understanding: for instance, using a random forest to identify important variables and potential interactions, then carefully modeling those with a parametric or semiparametric model for interpretability. Finally, we stress the importance of sound evaluation. All models can be made to look good on the training data; the real test is performance on new data. Techniques like cross-validation and the use of independent test sets (when available) are indispensable to prevent being misled by an overfit model. When communicating results, especially in a social science context, be clear about whether a model’s results are correlational or have a causal interpretation, and what the limitations of the modeling approach are. For example, a random forest might tell us that variable X is a strong predictor of Y, but we should not immediately interpret that as “X causes Y” without further analysis. The R code examples provided in this chapter offer a template for how to implement, tune, and interpret each type of model. The **R ecosystem** for machine learning is rich: beyond the packages we used, there are the `tidymodels` collection (which unifies modeling syntax), deep learning packages (`keras`, `torch`), Bayesian modeling tools (`brms`, `rstanarm` for Bayesian GLMs), and specialized packages for time-series, spatial data, etc. The principles discussed here will help in understanding and applying those advanced tools as well. In conclusion, regression models remain a fundamental toolkit for data science. Whether one’s goal is predicting the future or understanding the past (to paraphrase Shmueli, 2010), choosing the appropriate regression technique and using it wisely is key to success. We encourage the reader to practice with the provided code, explore variations (try more predictors, other datasets, tune parameters), and consult the references for deeper theoretical grounding and extensions. ## References {.unnumbered} Breiman, L. (1984). *Classification and Regression Trees*. Monterey, CA: Wadsworth. Breiman, L. (2001). Random forests. *Machine Learning, 45*(1), 5–32. DOI: 10.1023/A:1010933404324. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)*, 785–794. DOI: 10.1145/2939672.2939785. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. *Annals of Statistics, 29*(5), 1189–1232. DOI: 10.1214/aos/1013203451. Shmueli, G. (2010). To explain or to predict? *Statistical Science, 25*(3), 289–310. Hastie, T. J., & Tibshirani, R. J. (1990). *Generalized Additive Models*. London: Chapman and Hall. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning with Applications in R* (2nd ed.). New York: Springer. Kuhn, M. (2008). Building predictive models in R using the caret package. *Journal of Statistical Software, 28*(5), 1–26. DOI: 10.18637/jss.v028.i05. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. *Journal of the Royal Statistical Society: Series B (Methodological), 58*(1), 267–288. Vapnik, V. (1995). *The Nature of Statistical Learning Theory*. New York: Springer. Wood, S. N. (2017). *Generalized Additive Models: An Introduction with R* (2nd ed.). Boca Raton, FL: CRC Press. Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. *Journal of Statistical Software, 77*(1), 1–17. DOI: 10.18637/jss.v077.i01. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. *Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67*(2), 301–320.\*