8 Neural Networks in Social Science Research
In recent years, deep neural networks have achieved remarkable success across domains such as computer vision, speech recognition, and natural language processing – tackling problems that had resisted the best attempts of the AI community for many years. This rapid progress has been driven by increases in computing power and the availability of large datasets. However, the uptake of neural network methods in the social sciences has been relatively slow. Social science researchers have indeed begun to adopt machine learning techniques, often using them to construct proxy variables or to improve prediction in empirical studies. For example, in a review of top political science journals from 2018–2020, Knox et al. (2022) identified 48 papers that employed statistical learning or other computational methods, and in about 68% of those cases the machine learning was used to first estimate a proxy of a latent concept (which would then be used in subsequent analysis). Tellingly – and illustrating that recent breakthroughs in deep learning and the growing use of computational methods in social science have largely occurred in parallel rather than in tandem – out of those 48 studies, only one utilized a neural network (the rest used methods like random forests, SVMs, etc.). Several factors may explain social scientists’ hesitance to embrace neural networks. Traditional statistical learning approaches (e.g., penalized regression, decision trees) have well-understood statistical properties, making it easier to quantify uncertainty and correct biases, whereas neural networks have been viewed as more of a “black box.” Additionally, until recently, neural networks demanded more computational resources and training data than many social science applications could readily supply. Social science datasets are often modest in size (hundreds or thousands of observations) compared to the massive datasets used to train state-of-the-art deep learning models. Concerns about overfitting, interpretability, and the complexity of model tuning further contribute to caution in applying neural nets to social data. In short, machine learning has largely been used in social science for prediction tasks (where black-box accuracy can be useful) rather than for modeling theoretical relationships.
Despite these challenges, there is a growing recognition that neural networks can complement and extend the toolkit of social scientists. Neural networks are universal function approximators, capable of learning complex non-linear relationships that might be missed by traditional linear or additive models. They excel at predictive tasks, especially with high-dimensional or unstructured data (such as text, images, or networks) where feature engineering is difficult. Indeed, when predictive accuracy is the primary goal (rather than estimating interpretable causal effects), the benefits of more flexible, non-linear models can outweigh the loss of interpretability. Early work in political science demonstrated that a neural network model could uncover structural patterns in international conflict data and substantially improve out-of-sample forecast accuracy over prior approaches. As data sources proliferate and computational barriers recede, neural networks are becoming increasingly viable for social science research problems.
This chapter provides an in-depth exploration of neural network models in the context of social science applications, written as an executable R Markdown document. We cover both predictive modeling (e.g., forecasting or classification tasks) and causal inference (estimating the effects of interventions or treatments) use cases for neural networks. We begin with theoretical foundations of different neural architectures – including feedforward fully-connected networks, convolutional neural networks, and recurrent (sequence) networks – and discuss how these can be utilized in social science research. We then demonstrate implementation in R, using packages such as keras
, tensorflow
, and torch
. Throughout, we address important practical topics: data preprocessing, choice of activation and loss functions, the backpropagation algorithm and training optimization (SGD, Adam, etc.), regularization strategies to prevent overfitting (dropout, L2 weight decay), and techniques for model evaluation. We also discuss strategies for interpretability and explainability, which are crucial for the adoption of neural nets in domains where understanding the basis of a prediction is as important as the prediction’s accuracy. Additionally, we compare neural networks with more traditional statistical models in terms of accuracy, flexibility, and transparency, highlighting scenarios where neural networks add value – and where they may not. Finally, we consider practical challenges such as small sample sizes and class imbalance that are often encountered in social science data, and how one can adapt or mitigate these issues (for example, through transfer learning, data augmentation, or specialized architectures).
The goal is to provide researchers and graduate students with a rigorous yet accessible guide to applying neural network models in the social sciences, blending theoretical insight with hands-on example code. All code chunks in this chapter are written in R and can be executed to reproduce the results (assuming the required packages and data are available). By the end of the chapter, readers should understand both how to implement neural network analyses in R and, importantly, when and why such methods can be beneficial in social science research.
8.1 Predictive Modeling vs. Causal Inference with Neural Networks
Social science research encompasses two broad analytical goals: predictive modeling and causal inference. Prediction focuses on accurately forecasting or classifying outcomes – for example, predicting election results, identifying individuals at risk of recidivism, or classifying topics in open-ended survey responses. Causal inference, on the other hand, is concerned with estimating the effect of some treatment or intervention on an outcome – for instance, what is the impact of a job training program on subsequent earnings, or how does exposure to misinformation affect political attitudes. These goals involve fundamentally different criteria for success. Prediction is about minimizing error on new data, whereas causal inference is about isolating a credible estimate of a causal effect (often requiring control of confounding and consideration of counterfactuals).
Machine learning methods, including deep neural networks, have primarily been developed as predictive tools. Their objective is to learn patterns that generalize well to new data. In fields like computer vision or language processing, complex neural architectures have achieved stunning predictive performance by fitting flexible functions to large datasets. Social scientists have begun harnessing this predictive power for tasks such as constructing proxies for latent theoretical constructs (e.g., using a text classifier to measure the sentiment or ideology in a speech) and for improving the accuracy of forecasts in policy and economics contexts. When the goal is prediction alone, the “black box” nature of neural networks is less of a concern – a highly accurate black box can be extremely useful for tasks like predictive policing or early warning systems for social unrest, even if its inner workings are not fully transparent.
Causal inference poses additional challenges for the use of neural networks. In causal analysis, we are typically interested not just in predicting \(Y\) from \(X\), but in understanding how \(Y\) would change under an intervention (e.g., setting \(X = \text{treatment}\) versus \(X = \text{control}\)). This requires disentangling correlation from causation and ensuring that the model properly accounts for confounding factors and biases. A model that is excellent at prediction may still yield biased estimates of causal effects if, for example, it leverages spurious correlations that do not reflect true causal relationships. As Pierce (2023) notes, “many of the most striking examples of recent machine learning progress entail neural networks learning complex correlations from a large data distribution for predictive purposes, whereas a lot of social science research is more interested in studying how those observed distributions would change under a causal intervention”. In other words, social scientists often seek to model the data-generating process and counterfactual outcomes, rather than just the observed joint distribution. This distinction means that directly applying a flexible non-linear model like a neural network to observational data can lead to overfitting to noise or confounding, threatening the validity of causal conclusions.
Despite these differences, there is a growing interface between deep learning and causal inference. Recent research in computational social science and econometrics has begun to adapt neural network architectures to estimate causal effects under the potential outcomes framework (see Koch et al., 2024 for a review). For example, neural networks have been used to learn balanced representations of covariates that make treated and control groups more comparable, improving estimation of treatment effects in observational studies. Specifically, Johansson, Shalit, and Sontag (2016) demonstrated how a neural network can transform covariates into a representation space where the distributions of treated vs. control units are closer, facilitating more accurate counterfactual prediction. Subsequent work (Shalit, Johansson, & Sontag, 2017) introduced the TARNet architecture – a simple two-headed feedforward network that learns potential outcomes for treatment and control – and extensions like DragonNet that add a propensity prediction head to enforce causal identification. Other researchers have integrated deep learning into propensity score estimation, instrumental variable analysis, and heterogeneous treatment effect estimation. For instance, methods in the “meta-learner” framework (T-learners, S-learners, X-learners) can incorporate neural networks as the base learners to capture complex response surfaces, and custom architectures have been proposed for causal forests or GAN-based IV estimation. While these methods are on the cutting edge, they highlight that neural networks can be used for causal inference – but usually with additional structure or assumptions to ensure identification of causal effects. Importantly, standard practices from the causal inference literature (such as holdout validation to avoid overfitting bias, and sensitivity analyses for unobserved confounding) remain crucial when using neural networks for causal questions.
In practice, a useful division of labor is emerging: one can leverage neural networks to improve predictive tasks within a larger causal analysis. For example, one might use a neural network to predict a proxy or to impute missing data, and then use those predictions in a more traditional causal model. Or one can use deep learning to estimate nuisance functions (like propensity scores or baseline outcome functions) within frameworks like double machine learning. Double/debiased ML methods allow flexible fitting of high-dimensional nuisance parameters (using ML algorithms) while retaining theoretical guarantees (Neyman orthogonality and cross-fitting) for the causal parameter of interest. The key is recognizing that for causal inference, accuracy in fitting the observed data is not the only goal – we also need interpretability and a model structure that connects to the counterfactual question at hand.
Throughout this chapter, we will highlight when we are focusing on pure prediction and when causal interpretation is (or is not) valid. The next sections delve into different neural network architectures and their mathematical foundations, laying the groundwork for applied examples in R that follow.
8.2 Feedforward Neural Networks (Multi-Layer Perceptrons)
Theory and Mathematical Foundations
Feedforward neural networks, also known as multi-layer perceptrons (MLPs) or dense neural networks, are the quintessential deep learning architecture. These models consist of layers of interconnected artificial neurons where information flows in one direction from input to output (hence “feedforward”). Feedforward networks are general function approximators: given enough hidden units, an MLP with at least one hidden layer can approximate any continuous function arbitrarily well under mild conditions (this is the Universal Approximation Theorem). This universal approximation property underlies the power of neural networks to model complex relationships.
Neurons and Layers: The basic unit in a feedforward network is the neuron (or node), which performs a weighted linear combination of its inputs and then applies a non-linear activation function. In mathematical form, for neuron \(j\) in a given layer, the computation is:
\[ z_j = b_j + \sum_{i} w_{ij} x_i, \]
\[ a_j = f(z_j), \]
where \(x_i\) are the inputs to the neuron (these could be raw features or outputs from neurons in a previous layer), \(w_{ij}\) are the weights, \(b_j\) is a bias term, \(z_j\) is the linear combination (sometimes called the logit or pre-activation), and \(f(\cdot)\) is the activation function producing the neuron’s output \(a_j\). Common choices for \(f\) include the sigmoid \(\sigma(z) = 1/(1+e^{-z})\), the hyperbolic tangent \(\tanh(z)\), and the now-ubiquitous ReLU (Rectified Linear Unit) \(f(z) = \max(0, z)\), among others. Non-linear activations are critical; without them, multiple layers would collapse into an equivalent single linear model.
A feedforward network is organized into layers: an input layer that takes the features (covariates) as inputs, one or more hidden layers of neurons that transform the inputs into intermediate representations, and an output layer that produces the final prediction. Each neuron in a layer typically connects to every neuron in the next layer (a fully-connected network). For example, a classic MLP architecture for a binary classification problem might have an input layer with \(d\) inputs, one hidden layer with \(h\) neurons, and an output layer with a single neuron producing a probability (via a sigmoid activation). This architecture would contain \((d \times h)\) weights between the input and hidden layer, plus \(h\) bias terms in the hidden layer, and \(h\) weights plus one bias in the output layer.
Forward Propagation: During a forward pass, data \(X\) is fed into the input layer and transformed layer by layer through these weighted sums and activations to produce an output \(\hat{y}\). This output could be a scalar (for regression or binary classification) or a vector of class probabilities (for multi-class classification, often obtained via a softmax activation in the output layer). The capacity of the network (its ability to fit complex functions) can be increased by adding more neurons or more layers, at the cost of greater computational demand and risk of overfitting.
Illustrative Example – XOR Problem: A simple example of why hidden layers are useful is the classic XOR problem. Consider two binary inputs \(x_1, x_2\) and an output \(y\) that should be 1 if exactly one of \(x_1, x_2\) is 1 (exclusive or), and 0 otherwise. A linear model (logistic regression) cannot capture this non-linear pattern – it will effectively draw a single separating line which cannot separate the XOR cases. However, a two-layer neural network with a few hidden neurons can learn this pattern by creating intermediate features (hidden layer outputs) that act like logical subfunctions. This demonstrates that even for small problems, an MLP can capture interactions that a linear model would miss.
Mathematically, an MLP with one hidden layer can be written as:
\[ \mathbf{h} = f(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}), \]
\[ \hat{\mathbf{y}} = g(\mathbf{W}^{(2)} \mathbf{h} + \mathbf{b}^{(2)}), \]
where \(\mathbf{x}\) is the input vector (length \(d\)), \(\mathbf{h}\) is the hidden layer activation vector (length \(h\)), and \(\hat{\mathbf{y}}\) is the output (for simplicity here, assume a vector of length \(o\) for possibly multiple outputs). \(\mathbf{W}^{(1)}\) is an \(h \times d\) weight matrix for the first layer, \(\mathbf{b}^{(1)}\) is a bias vector of length \(h\), \(\mathbf{W}^{(2)}\) is an \(o \times h\) weight matrix for the second layer, and \(\mathbf{b}^{(2)}\) is bias of length \(o\). The functions \(f(\cdot)\) and \(g(\cdot)\) are activation functions (they could be the same or differ; often \(f\) is a non-linear like ReLU or tanh, while \(g\) might be a sigmoid or softmax appropriate to the task). This formulation can be extended to additional layers in the obvious way.
Relationship to Traditional Models: It is helpful to note that many traditional statistical models are special cases of neural networks. For instance, a standard linear regression or logistic regression can be seen as a one-layer neural network (no hidden layer) with an identity or sigmoid output activation, respectively. In that sense, neural networks generalize these models by adding depth (hidden layers) and allowing more complex feature transformations. This also means that at small scales, an MLP can mimic a linear model. The advantage comes when there are complex non-linear interactions: an MLP can learn those from data, whereas a linear model would require manually adding interaction terms or nonlinear transformations of inputs. As an example, one study in political science posited that the effects of certain variables on conflict might only manifest in particular combinations – a neural network was able to capture such conditional relationships automatically, whereas a traditional logistic regression required the researcher to explicitly specify interaction terms ( , King, & Zeng, 2000).
Implementation Example: A Predictive MLP in R
To illustrate how a feedforward neural network can be applied to a social science problem, consider a hypothetical predictive task: we have a dataset of individuals with two features \(X_1\) and \(X_2\) (which could represent, say, two test scores or socio-economic indicators), and we want to predict a binary outcome \(Y\) (e.g., whether the person will graduate from college). Suppose the true relationship is that \(Y=1\) only if both \(X_1\) and \(X_2\) are above certain thresholds – in other words, the effect of \(X_1\) on the outcome is conditioned on \(X_2\) being high, a form of interaction. A linear model without an interaction term would struggle in this scenario, effectively averaging the effect of each \(X\) across all levels of the other. We will simulate such a scenario and then train both a logistic regression and a neural network to highlight the difference.
First, we simulate a dataset in R:
set.seed(123)
<- 1000
N <- rnorm(N)
X1 <- rnorm(N)
X2 # Define Y such that it's 1 if X1 * X2 > 0 (both positive or both negative)
<- ifelse(X1 * X2 > 0, 1, 0)
Y <- data.frame(X1, X2, Y = factor(Y))
data head(data)
In this simulated data, \(Y=1\) occurs when \(X_1\) and \(X_2\) have the same sign (both above or both below 0), which is a non-linear relationship. A logistic regression that uses \(X_1\) and \(X_2\) as additive terms (no interaction) will effectively find no predictive power (since \(P(Y=1)\approx0.5\) regardless of one variable alone). A neural network with a hidden layer can learn to multiply or otherwise combine \(X_1\) and \(X_2\) to capture this interaction.
We split the data into training and test sets and fit both models:
# Split into training and test sets
<- sample(1:N, size = 0.7 * N)
train_idx <- data[train_idx, ]
train_data <- data[-train_idx, ]
test_data
# Fit a logistic regression without interaction
<- glm(Y ~ X1 + X2, data = train_data, family = binomial())
glm_model
# Fit a neural network (MLP) with one hidden layer
library(keras)
# Define a simple sequential model
<- keras_model_sequential() %>%
mlp_model layer_dense(units = 4, activation = 'relu', input_shape = 2) %>% # 2 inputs -> 4 hidden units
layer_dense(units = 1, activation = 'sigmoid') # output layer for binary classification
%>% compile(
mlp_model optimizer = optimizer_adam(),
loss = 'binary_crossentropy',
metrics = 'accuracy'
)
<- mlp_model %>% fit(
history as.matrix(train_data[, c("X1", "X2")]), as.numeric(train_data$Y) - 1, # convert factor to {0,1}
epochs = 50, batch_size = 32, verbose = 0,
validation_split = 0.2
)
Here we used the Keras API for R to define a simple network: 2 input features, one hidden layer with 4 ReLU neurons, and an output sigmoid neuron for the probability of \(Y=1\). We compile the model with the binary cross-entropy loss (appropriate for binary classification) and the Adam optimizer (an efficient variant of stochastic gradient descent discussed later). We train for 50 epochs (iterations over the data) with a mini-batch size of 32.
Now, we evaluate both models on the test set:
# Predictions and accuracy on test set
<- predict(glm_model, newdata = test_data, type = "response")
glm_probs <- ifelse(glm_probs > 0.5, 1, 0)
glm_preds
<- mlp_model %>% predict(as.matrix(test_data[, c("X1", "X2")]))
nn_probs <- ifelse(nn_probs > 0.5, 1, 0)
nn_preds
<- mean(glm_preds == (as.numeric(test_data$Y) - 1))
glm_acc <- mean(nn_preds == (as.numeric(test_data$Y) - 1))
nn_acc sprintf("Test Accuracy - Logistic Regression: %.3f, Neural Network: %.3f", glm_acc, nn_acc)
If you run the above code, you will likely observe that the logistic regression achieves an accuracy around 50% (no better than chance), while the neural network is far more accurate (often >90% in this example). The neural network has learned the interaction between \(X_1\) and \(X_2\) implicitly, whereas the simple logistic model without an \(X_1 \times X_2\) term could not capture it. (For fairness, if we included the interaction term \(X1 \cdot X2\) in the logistic model, it would then perform well on this synthetic task – but the point is that the neural network discovered that interaction on its own from the data.)
This toy example mirrors real-world scenarios in social science where outcomes may depend on non-linear combinations of predictors. For instance, perhaps economic development and regime type interact in affecting civil conflict onset – high economic development might reduce conflict risk, but only for democracies and not autocracies, etc. A researcher might not know the exact form of such interactions a priori. Neural networks offer a way to automatically model complex interactions and non-linearities, providing potentially better predictive performance than misspecified linear models (as demonstrated by et al., 2000 in the context of conflict prediction). Of course, the trade-off is that the neural network’s model is less interpretable than a simple logistic regression with a few coefficients. Throughout this chapter, we will return to this trade-off between predictive power and interpretability, and discuss methods to peek inside the “black box” of neural networks.
Before moving on, it’s instructive to examine the learned neural network model. We can inspect the model architecture and number of parameters:
%>% summary() mlp_model
This prints a summary of the model:
Model: "sequential"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
dense_1 (Dense) (None, 4) 12
dense_2 (Dense) (None, 1) 5
================================================================================
Total params: 17
Trainable params: 17
Non-trainable params: 0
________________________________________________________________________________
We see that the model has 17 trainable parameters in total: 12 in the first dense layer (which corresponds to \(2 \times 4\) weights plus 4 biases) and 5 in the second layer (\(4 \times 1\) weight matrix plus 1 bias). Despite this small size, the model was expressive enough to fit the XOR-like pattern. In practice, we often use many more hidden units and possibly multiple hidden layers for harder problems. For example, if we had dozens of input features (survey responses, demographic variables, etc.), a single hidden layer with 4 units might be too limited to capture all interactions, whereas a deeper or wider network could perform better. Deciding on the network architecture (number of layers and units) is an important part of model design and typically requires some experimentation or cross-validation.
8.3 Convolutional Neural Networks (CNNs)
Example: Image Classification with a CNN (in R)
As a demonstration, we will build a simple CNN in R using the keras
package to classify images. While our example may use a generic image dataset for illustration (the MNIST dataset of handwritten digits, since it is readily available), one can imagine analogous social science uses – for instance, classifying satellite images of neighborhoods by poverty level, or detecting whether an online profile picture is of a real person vs. a bot, etc.
First, we load image data. We’ll use MNIST (handwritten digits) for a quick example:
library(keras)
<- dataset_mnist()
mnist <- mnist$train$x
train_x <- mnist$train$y
train_y <- mnist$test$x
test_x <- mnist$test$y
test_y
# Preprocess: reshape and rescale
<- array_reshape(train_x, c(nrow(train_x), 28, 28, 1)) / 255
train_x <- array_reshape(test_x, c(nrow(test_x), 28, 28, 1)) / 255
test_x <- to_categorical(train_y, 10)
train_y <- to_categorical(test_y, 10) test_y
The above prepares the data: we reshape the images to 28x28 with 1 channel (grayscale), and scale pixel values to [0,1]. The labels are one-hot encoded for 10 classes (digits 0-9).
Now, we define a simple CNN model:
<- keras_model_sequential() %>%
cnn_model layer_conv_2d(filters = 8, kernel_size = c(3,3), activation = 'relu',
input_shape = c(28, 28, 1)) %>%
layer_max_pooling_2d(pool_size = c(2,2)) %>%
layer_conv_2d(filters = 16, kernel_size = c(3,3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2,2)) %>%
layer_flatten() %>%
layer_dense(units = 10, activation = 'softmax')
%>% compile(
cnn_model optimizer = optimizer_adam(),
loss = 'categorical_crossentropy',
metrics = 'accuracy'
)%>% summary() cnn_model
Our CNN has two convolutional layers: the first with 8 filters of size 3x3, the second with 16 filters of size 3x3. Each conv layer is followed by 2x2 max pooling to reduce dimensionality. After the conv layers, we flatten the feature maps and have a dense output layer with 10 units (softmax for multi-class probabilities). The model summary would look like:
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
conv2d_1 (Conv2D) (None, 26, 26, 8) 80
max_pooling2d_1 (MaxPooling2D) (None, 13, 13, 8) 0
conv2d_2 (Conv2D) (None, 11, 11, 16) 1168
max_pooling2d_2 (MaxPooling2D) (None, 5, 5, 16) 0
flatten_1 (Flatten) (None, 400) 0
dense_3 (Dense) (None, 10) 4010
================================================================================
Total params: 5258
Trainable params: 5258
Non-trainable params: 0
________________________________________________________________________________
(We see that the conv layers have relatively few parameters: 8 filters × (3×3 weights + 1 bias) = 80 for the first conv; 16 filters × (3×3×8 inputs + 1 bias) = 1,168 for the second conv. The dense layer has more: 400×10 + 10 biases = 4,010, because by the time we flatten we have 5×5×16 = 400 features.)
We can train this model on the MNIST data (for brevity, we use only 1 epoch here):
<- cnn_model %>% fit(
history
train_x, train_y,epochs = 1, batch_size = 128,
validation_split = 0.2
)
Even with 1 epoch on MNIST, the model will likely achieve high accuracy (MNIST is an “easy” dataset for CNNs, often >90% after just one epoch). After training, we evaluate on the test set:
<- cnn_model %>% evaluate(test_x, test_y, verbose = 0)
scores cat("Test accuracy:", scores["accuracy"], "\n")
This should report an accuracy (likely around 0.95 with more training epochs on MNIST). The purpose of this example is to show the structure and code for a CNN.
In a social science context, one would rarely train a CNN from scratch on a small image dataset – instead, as mentioned, one would use transfer learning. In R, the keras
package makes it easy to download a pretrained model (e.g., application_resnet50(weights="imagenet")
gives a ResNet-50 model pretrained on ImageNet). You can then remove the top layer and add your own output layer for your specific classification, freeze the earlier layers, and fine-tune on your data. The Pew Research project referenced earlier did essentially this: they leveraged a ResNet model pretrained on millions of images, which had already learned to detect general features like edges and textures, and only had to train the final layers on their tens of thousands of labeled images. Their deep learning model achieved around 90% classification accuracy in distinguishing men vs. women in images, illustrating the practicality of CNNs even when a social science team has limited training data – by re-using “knowledge” from large-scale datasets.
Beyond image classification, CNNs have also been used for tasks like text classification via 1D convolutions (which slide over sequences of words or characters). For example, one could build a model to classify tweets as containing hate speech or not. A simplified approach might convert each tweet into a sequence of word embeddings (vectors), then apply a convolutional filter that detects specific phrases or word combinations indicative of hate speech, followed by pooling and a dense layer for classification. Such models have been shown to perform well in natural language processing tasks and can be trained in R using keras
or torch
(with libraries like text2vec
to obtain embeddings). In our context, a full example is beyond scope, but the implementation would mirror the structure we showed (just with 1D conv layers and an embedding layer for text).
8.4 Recurrent Neural Networks (RNNs) and LSTM
Theory of Sequence Modeling
Many social science data have an inherent sequential or temporal structure: speeches composed of sequences of words, individuals’ life histories composed of sequences of events, longitudinal panel data on voters, or time series of economic indicators. Recurrent Neural Networks (RNNs) are neural architectures designed to handle sequence data by maintaining a form of memory of past inputs. Unlike feedforward nets that assume all inputs are independent, RNNs share parameters across time steps and have connections that form directed cycles (hence “recurrent”), allowing information to persist.
In a basic RNN (often called a “simple RNN” or Elman network), at each time step \(t\) the network takes an input vector \(x_t\) and the previous hidden state \(h_{t-1}\), and produces a new hidden state \(h_t\) as a function:
\[ h_t = f(W \, x_t + U \, h_{t-1} + b), \]
where \(W\) and \(U\) are weight matrices for input and recurrent connections, and \(f\)(·) is typically a non-linearity like tanh. The hidden state \(h_t\) can be thought of as a summary of all inputs seen up to time \(t\). If we want to produce an output (for example, predicting the next word in a sentence or labeling the sequence), an output \(y_t\) can be computed as
\[ y_t = g(V \, h_t), \]
for some output weight matrix \(V\). The key point is that the same weights \(W, U, V\) are used at every time step, enabling the network to generalize to sequence lengths beyond what it was trained on.
However, simple RNNs suffer from vanishing and exploding gradient problems when dealing with long sequences – as we backpropagate the error through many time steps (a process known as Backpropagation Through Time), gradients can shrink or blow up, making it hard to learn long-range dependencies. To address this, more sophisticated recurrent architectures were developed, most notably the Long Short-Term Memory (LSTM) network (Hochreiter & Schmidhuber, 1997) and the Gated Recurrent Unit (GRU) (Cho et al., 2014).
LSTM introduces an internal cell state \(c_t\) and a set of gating mechanisms that regulate information flow: an input gate, a forget gate, and an output gate. These gates (each implemented with a sigmoid activation) determine which information to add to the cell state, what to forget from it, and how much of it to output to the hidden state. In equations, for an LSTM unit one might define:
- Input gate: \(i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)\)
- Forget gate: \(f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)\)
- Output gate: \(o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)\)
- New memory candidate: \(\tilde{c}*t = \tanh(W_c x_t + U_c h*{t-1} + b_c)\)
- Updated cell state: \(c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t\)
- Hidden state: \(h_t = o_t \odot \tanh(c_t)\)
(where \(\odot\) denotes elementwise multiplication). While the details can be intimidating, the intuitive idea is that the forget gate controls what information from the past to discard, the input gate controls what new information to store in the cell, and the output gate controls what information from the cell to send out to the next step. The cell state \(c_t\) acts as a conveyor of long-term information with linear interactions (just additions and multiplications by gates), which helps preserve gradients. LSTMs are thus capable of maintaining long-term dependencies – their internal design explicitly tackles the vanishing gradient problem by allowing gradients to flow unchanged where needed. In intuitive terms, an LSTM can learn to remember or forget. For example, if analyzing text, an LSTM might learn to “remember” a negation word like “not” and carry that influence until it encounters the word being negated, then “forget” it thereafter.
Diagram of a single LSTM cell, which maintains an internal cell state \(c_t\) and uses input (\(\sigma\)), forget (\(\sigma\)), and output (\(\sigma\)) gates (orange = learned neural layers, yellow = pointwise operations) to regulate information flow.
Sequence-to-Sequence and Other Variants: RNNs and LSTMs can be used not only for one-output-per-time-step tasks (like language modeling where you predict the next word given previous words), but also for sequence-to-sequence tasks. In sequence-to-sequence models (Seq2Seq), one RNN (the encoder) processes an input sequence into a final hidden state, and then another RNN (the decoder) generates an output sequence from that state. This is used in machine translation (encode a sentence in French, decode a sentence in English) and could likewise be used in social science for tasks like open-ended survey response summarization or modeling event sequences (encode the past trajectory of a country’s economic indicators, decode the future trajectory).
RNNs in Social Science: Applications include:
- Text analysis: RNNs (especially LSTMs or GRUs) have been widely used for text classification, sentiment analysis, or more complex tasks like stance detection in political texts. An LSTM can capture the sequence of words, which is important because word order matters for meaning. For instance, “X supports Y” versus “Y supports X” convey different relationships; a bag-of-words model might miss this distinction, but an RNN can learn it. Researchers have applied LSTMs to legislative speech transcripts, news articles, and social media posts to classify topics or sentiment while accounting for syntax and context over sentences.
- Event history analysis: One could use RNNs to model sequences of events (e.g., a sequence of protest events in different cities, or a sequence of legislative actions over time). The RNN can potentially pick up patterns like “after event A, event B tends to occur within 3 days” or detect seasonal trends in event data. For example, a study might model a country’s monthly conflict events as a sequence and use an LSTM to forecast future conflict risk from the history.
- Time-series prediction: In economics, demography, or sociology, we often have time series data (e.g., monthly unemployment rates, yearly population counts, daily counts of COVID cases). RNNs or LSTMs can be trained to forecast these series, possibly capturing non-linear patterns or regime shifts that traditional ARIMA models might not. They can also incorporate multiple input series (multivariate time series) and learn complex joint dynamics.
- Panel data: Panel data (repeated observations of many units over time) can also be approached with RNNs by treating each unit’s data as a sequence. For example, an LSTM could be used to predict an individual’s future health status from their longitudinal medical history, or to predict a country’s future GDP given its yearly economic indicators, capturing unit-specific temporal dependencies. (There is also research on sequence embedding where each unit’s sequence is converted to a fixed-length vector via an RNN encoder, which can then be used as features in downstream analyses.)
It is worth noting that in recent years, Transformer models have overtaken RNNs in many sequence modeling tasks (especially in NLP) due to their efficiency in capturing long-range dependencies via self-attention. Transformers (Vaswani et al., 2017) dispense with recurrence entirely and instead use parallelizable attention mechanisms to achieve superior performance on language tasks. While they are beyond our current scope, they represent an advanced tool that social scientists may explore for text analysis or other sequence tasks (for example, using pretrained language models like BERT or GPT to obtain rich text embeddings for survey responses). That said, RNNs and LSTMs remain useful and are easier to train on smaller datasets, so we focus on them here as foundational tools.
Example: Sequence Prediction with LSTM in R
For a concrete example, we will use an LSTM to model a simple time series. Consider a scenario in social science where we have a monthly indicator (say, an index of social unrest intensity) and we want to predict future values based on past values. We’ll simulate a pattern (for illustration, a sine wave with noise could represent a seasonal oscillation in unrest).
set.seed(123)
<- 200 # length of series
T <- 1:T
t <- sin(0.1 * t) + rnorm(T, sd = 0.1) # base sine wave plus noise
y
# Prepare training sequences for LSTM
<- 10
timesteps <- array(0, dim = c(T - timesteps, timesteps, 1))
X <- array(0, dim = c(T - timesteps))
Y for(i in 1:(T - timesteps)) {
1] <- y[i:(i+timesteps-1)]
X[i,,<- y[i+timesteps] # next value to predict
Y[i]
}
# Split into train and test (e.g., first 160 for training, last 30 for testing)
<- 160
train_size <- X[1:train_size,,]
X_train <- Y[1:train_size]
Y_train <- X[(train_size+1):(T-timesteps),,, drop=FALSE]
X_test <- Y[(train_size+1):(T-timesteps)] Y_test
We created overlapping sequences of length 10 (each sequence is the past 10 time points) and the target is the next value. Now we define an LSTM model to predict the next value from the past 10:
<- keras_model_sequential() %>%
lstm_model layer_lstm(units = 16, input_shape = c(timesteps, 1)) %>%
layer_dense(units = 1)
%>% compile(
lstm_model optimizer = 'adam',
loss = 'mse'
)%>% summary() lstm_model
The model has an LSTM layer with 16 units, followed by a dense layer. The summary will show something like:
________________________________________________________________
Layer (type) Output Shape Param #
================================================================
lstm_1 (LSTM) (None, 16) 1152
dense_4 (Dense) (None, 1) 17
================================================================
Total params: 1169
Trainable params: 1169
We see 1,169 parameters, consistent with the formula for LSTM parameters: for an LSTM with \(k\) units and input size \(p\), the parameter count is \(4k(k + p + 1)\) (because of the 4 sets of weights for input, output, forget, cell). Here, \(k=16, p=1\), so indeed 416(16+1+1) = 41618 = 1,152 for the LSTM, plus 17 for the dense layer.
Now we train the model:
<- lstm_model %>% fit(
history
X_train, Y_train,epochs = 30, batch_size = 16,
validation_split = 0.1, verbose = 0
)
After training, let’s evaluate its performance on the test set and inspect a few predictions vs actual:
<- lstm_model %>% predict(X_test)
preds # Compare first 5 predictions with actual values
print(round(head(cbind("Predicted" = preds[,1], "Actual" = Y_test), 5), 3))
If the model has learned the pattern, the predictions should roughly follow the sine wave trend. Even if not perfect (due to noise), the LSTM likely captures the oscillation better than a trivial baseline. In a real social science application, this could correspond to forecasting something like monthly protest counts given the past 10 months of data, implicitly capturing temporal dependencies and seasonality.
Remark on Interpretation: While the LSTM can model such sequences, interpreting what exactly it has learned (which patterns in the sequence trigger an increase or decrease) is not straightforward. There are techniques such as examining the learned cell states or using sequence saliency methods (to see which parts of the input sequence most influenced the prediction), but these are more specialized. For many pure prediction tasks, a black-box forecast might be acceptable. However, if policy decisions depend on understanding why the model predicts a surge in unrest, one might need to combine these models with more interpretable approaches or incorporate domain knowledge to validate the patterns detected.
In R, one might also consider the torch
package for sequence models, which provides an R interface to the PyTorch library and can be used to build custom RNNs, LSTMs, or Transformers with more low-level control. The high-level Keras API, as used above, is often sufficient for many applications, but torch
could be useful for advanced research requiring custom architectures.
8.5 Training Neural Networks: Key Concepts
Having introduced the main architectures (MLP, CNN, RNN/LSTM), we now turn to how neural networks are trained and optimized, and how to ensure they generalize well. Training a neural network means finding parameters (weights and biases) that minimize a certain loss function on the training data. This is a high-dimensional optimization problem, typically solved by gradient-based methods. We will discuss:
- Data Preprocessing – preparing inputs for effective training.
- Loss Functions – what objective we optimize.
- Backpropagation and Gradient Descent – how we optimize the objective.
- Optimization Algorithms – variants like SGD, Momentum, Adam.
- Regularization Techniques – methods to prevent overfitting (dropout, L2, etc.).
- Monitoring and Tuning – using validation sets, avoiding overfitting, adjusting hyperparameters.
Data Preprocessing
Neural networks can be sensitive to the scale and encoding of input data. It is generally important to standardize or normalize features before feeding them into the network. For example, continuous variables are often standardized to mean 0 and standard deviation 1, or scaled to [0,1]. If features are on very different scales, the network may have trouble learning (the gradients for one feature might dominate). In our earlier examples, we normalized images to [0,1], and one would likewise scale numerical covariates in tabular data. Categorical variables need to be encoded – typically via one-hot encoding if unordered, or possibly via embedding layers if there are many categories and we want the model to learn a dense representation for each.
For text data, preprocessing involves tokenization (breaking text into words or subwords), handling of vocabulary (perhaps limiting to the top N most frequent words or using pre-trained word embeddings), and padding/truncating sequences to a fixed length for batch processing. For network or spatial data, one might need to construct adjacency matrices or coordinate grids. In all cases, careful preprocessing is crucial; poor handling can significantly degrade performance or make training unstable.
Loss Functions and Evaluation Metrics
The loss function (also called cost function) quantifies the error of the model’s predictions against the true values. The choice of loss depends on the task:
- For binary classification, the typical loss is binary cross-entropy (also known as log loss). If \(\hat{p}_i\) is the predicted probability for instance \(i\) belonging to class 1 (and \(y_i \in {0,1}\) is the true label), the binary cross-entropy loss for that instance is \(-[\,y_i \log \hat{p}_i + (1-y_i) \log(1-\hat{p}_i)\,]\). The model training aims to minimize this, which pushes \(\hat{p}_i\) close to 1 for \(y_i=1\) and close to 0 for \(y_i=0\). (We used this loss in the MLP example via
binary_crossentropy
.) - For multi-class classification, we generalize to categorical cross-entropy. If the model outputs a probability distribution \(\hat{\mathbf{p}}_i\) across classes for instance \(i\), and the true label is a one-hot vector \(\mathbf{y}_i\), then loss is \(- \sum_{c} y_{i,c} \log \hat{p}_{i,c}\). Typically we use a softmax on the output layer to get \(\hat{\mathbf{p}}\) that sums to 1.
- For regression (predicting a continuous outcome), a common loss is the mean squared error (MSE) or sometimes mean absolute error (MAE). MSE, \((\hat{y}_i - y_i)^2\), has nice mathematical properties (differentiable everywhere) and is related to assuming Gaussian noise on the output; MAE, \(|\hat{y}_i - y_i|\), is more robust to outliers (but the absolute value is less smooth at 0 for optimization).
- There are more specialized losses for specific purposes: e.g., hinge loss for SVM-like training, cosine proximity for certain similarity tasks, or custom losses for imbalanced data (like adding class weights or using focal loss in object detection tasks).
It’s important to distinguish the loss used for training from the evaluation metrics we care about. For instance, we might train a model with cross-entropy loss but evaluate it with accuracy, F1-score, AUC, etc., to judge its performance. The training process directly optimizes the loss, not necessarily the metric (though often lowering the loss improves the metric). Sometimes there is a trade-off; for example, accuracy is not sensitive to class probabilities unless they cross 0.5, whereas cross-entropy heavily penalizes overconfident wrong answers. Thus, one might get higher accuracy but still have room to improve calibration as reflected in cross-entropy.
Backpropagation and Gradient Descent
Backpropagation is the core algorithm for training neural networks. It is a method for computing the gradient of the loss function with respect to all the network’s parameters efficiently, by propagating the error backward through the network. Conceptually:
- Forward pass: Take an input \(x\) and compute the network’s output \(\hat{y}\) and loss \(L(\hat{y}, y)\) for the true target \(y\).
- Backward pass: Compute gradients of the loss with respect to the output (using the chain rule of calculus), then with respect to the parameters of the last layer, then the previous layer, and so on, moving backwards. Each layer’s gradients are computed based on the gradients from the layer above (its output).
- Update weights: Use these gradients in an optimization step to adjust the parameters in the direction that most reduces the loss.
For a single weight \(w\) in the network, backprop gives us \(\frac{\partial L}{\partial w}\), the direction in which changing \(w\) would increase or decrease the loss. Once we have these gradients, we use an optimization algorithm (like gradient descent) to update the weights in the opposite direction of the gradient (to reduce the loss).
In practice, we use stochastic gradient descent (SGD) or its variants. Rather than computing gradients on the entire dataset (which would be standard gradient descent), we compute on mini-batches of data. For example, with a batch size of 32, we take 32 examples, do forward passes, compute the average loss and gradients, then update weights, and move to the next batch. This stochastic approach introduces noise into the gradient estimates but is much faster and often helps escape shallow local minima. One epoch is one full pass through the training data in mini-batches.
Mathematically, a weight update with vanilla SGD looks like:
\[ w := w - \eta \, \frac{\partial L}{\partial w}, \]
where \(\eta\) is the learning rate, a small positive scalar that controls the step size. The learning rate is a crucial hyperparameter – too large and training may diverge or oscillate, too small and convergence will be very slow or get stuck in a suboptimal point.
Batch size is another important hyperparameter: smaller batches give noisier gradients but can generalize better (and use less memory), while larger batches give more precise gradient estimates but require more memory and can sometimes get stuck in sharp minima. A common heuristic is to use the largest batch size that fits in GPU memory (for efficiency), but in some cases small batches work better. Practitioners often try a few values (32, 64, 128, etc.) to see what yields good validation performance.
Advanced Optimization Algorithms
Several enhancements to basic SGD have been developed to improve convergence speed and stability:
- Momentum: This technique helps accelerate SGD in the right directions and dampen oscillations in the wrong directions. It does so by maintaining a velocity vector that is an exponentially decaying average of past gradients. The update becomes: \(v := \alpha v + \eta \nabla_w L\), and then \(w := w - v\), where \(\alpha \in [0,1)\) is the momentum coefficient (e.g., 0.9). Momentum accumulates gradient contributions in persistent directions, effectively allowing the optimizer to build up speed on gentle slopes and not get stuck oscillating on steep but narrow ravines (a common issue when gradients in one dimension are much larger than in another).
- Adaptive Learning Rates: Optimizers like AdaGrad, RMSProp, Adam etc., adjust the learning rate for each parameter individually, based on the history of gradients for that parameter. AdaGrad (2011) accumulates the squared gradients for each parameter and divides the learning rate by the sqrt of this accumulated sum. This means parameters that have large gradients (and thus large accumulated squared gradients) get their effective learning rate reduced over time, which is good for dealing with sparse features but can lead to excessive decay of learning rates. RMSProp (Hinton, circa 2012) modifies AdaGrad by using a moving average of squared gradients (to forget very old gradients), maintaining a per-parameter learning rate that adapts to recent gradient magnitudes. Adam (Adaptive Moment Estimation, Kingma & Ba, 2015) combines the ideas of momentum and RMSProp – it keeps an exponentially decaying average of past gradients (like momentum) and of past squared gradients (like RMSProp). Adam computes parameter updates as \(m_t / (\sqrt{v_t} + \epsilon)\) where \(m_t\) is the first moment (gradient mean) and \(v_t\) the second moment (uncentered variance), with bias correction for initial timesteps. Adam has become one of the most popular optimizers due to its generally good performance and ease of use (it often requires less tuning of the learning rate compared to plain SGD).
In R’s keras
or torch
, one can specify optimizer_adam()
or others to use these algorithms. A typical workflow is to start with Adam (with default settings) as it usually works out of the box, and maybe later experiment with SGD+momentum for possibly better generalization or to reproduce results from literature. Some recent research suggests that for very large datasets, plain SGD with momentum might yield slightly better generalization than Adam (which can sometimes overfit), but in the moderate-data regime of many social science problems, Adam’s robustness is a boon.
Regularization and Overfitting
Neural networks are highly flexible models with a potentially huge number of parameters. This flexibility means they can easily overfit – i.e., memorize the training data – especially when data are limited. Regularization refers to techniques that constrain the model to improve generalization performance on unseen data.
Key regularization techniques in neural networks include:
- Penalty Terms (Weight Decay): The most common is \(L_2\) regularization, which adds a term \(\lambda \sum w^2\) to the loss (summing over all weights), discouraging large weights. This is equivalent to a Gaussian prior on weights and is known as weight decay in the neural network context. In practice, one specifies a weight decay parameter \(\lambda\). A smaller \(\lambda\) means little regularization; a larger \(\lambda\) forces weights towards 0 (simpler models). Weight decay tends to make the network weights smaller in magnitude, which often improves generalization by keeping the model closer to linear in behavior.
- Dropout: Dropout is a popular and very effective regularization trick introduced by Srivastava et al. (2014). The idea is to randomly “drop out” (set to zero) a fraction of the units in a layer during each training iteration. For example, with dropout rate 0.5, each hidden neuron is independently dropped with probability 0.5 at each update. This prevents the network from relying too much on any single feature or from co-adapting neurons too tightly, essentially forcing a form of ensemble of sub-networks. At test time, all units are used but their activations are scaled down by the dropout rate (to account for the missing ones during training). Dropout often significantly reduces overfitting and has become standard, especially in fully connected layers of networks.
- Early Stopping: This is a simple yet powerful regularization approach: monitor performance on a validation set during training and stop training when performance on validation data stops improving (or starts worsening). The idea is that at the point of minimal validation loss, the model is optimally generalized, whereas training longer would just make it fit noise in the training set. The weights at that point are then taken as the final model. Early stopping essentially treats the number of training epochs as a hyperparameter to tune (automatically, during one run). In Keras, one can use
callback_early_stopping(patience=...)
to implement this. - Batch Normalization: Though primarily introduced to help optimization by normalizing layer inputs, batch normalization (Ioffe & Szegedy, 2015) can also have a regularizing effect. It reduces internal covariate shift by normalizing the activations of each layer for each mini-batch, then scaling and shifting them by learned parameters. This can allow higher learning rates and often reduces the need for dropout in some architectures (e.g., CNNs). Batch norm adds a bit of noise due to mini-batch estimation, acting as a regularizer.
- Data Augmentation: Especially relevant for image and text data, augmenting the training data with label-preserving transformations acts as regularization by injecting more variety. For images, this could be random rotations, flips, crops, color jitters (commonly used in computer vision to expand training sets). For text, one might replace words with synonyms, or slightly perturb sentences (though one must be careful to preserve meaning). In social science contexts, augmentation can sometimes be domain-specific (e.g., adding noise to economic indicators to simulate measurement error or using bootstrapping on small datasets).
- Others: There are other approaches like \(L_1\) regularization (which encourages sparsity in weights, leading some weights to exactly zero), max-norm constraints (bounding the norm of incoming weights for each neuron), early removal of overfit neurons (pruning), or more recent techniques like DropConnect (dropout on weights rather than activations), label smoothing (smoothing the hard 0/1 labels to soft targets to prevent overconfidence), and so on. But the ones above are usually sufficient in practice.
The balance between underfitting and overfitting is often visualized by plotting the training and validation loss over epochs. Initially, both go down as the model fits general patterns. At some point, the validation loss reaches a minimum and then starts to increase even as training loss keeps decreasing – that’s classic overfitting setting in. Regularization aims to delay or reduce that gap. With strong regularization, the model might underfit (both losses high, or validation never goes down much), so one must tune the amount.
Model Training and Hyperparameter Tuning
Training neural networks is as much an art as a science. One often needs to experiment with:
- Architecture hyperparameters: number of layers, number of units per layer, filter sizes, etc.
- Training hyperparameters: learning rate (most crucial), batch size, number of epochs, choice of optimizer, learning rate schedules (reducing the learning rate after plateauing, etc.).
- Regularization hyperparameters: dropout rate, weight decay coefficient, etc.
- Initialization: Modern libraries handle weight initialization well (e.g., Glorot/Xavier initialization for symmetric activations), but occasionally one might adjust initialization if using certain activations like sigmoid (to avoid saturation at start).
It is essential to use a validation set to tune these (or techniques like cross-validation if data are extremely scarce, though cross-validating deep nets is computationally heavy). In R’s keras
, one can specify validation_split
or supply a validation_data
argument to the fit()
function to automatically track validation metrics. The keras
API also provides callbacks, such as callback_early_stopping()
to implement early stopping and callback_reduce_lr_on_plateau()
to reduce the learning rate if validation loss stalls, which help automate some tuning.
A common strategy is:
- Start with a relatively simple model and get it to “learn something” (ensure the training loss decreases and it beats a trivial baseline on validation).
- If underfitting (validation and training loss both high), increase capacity (more layers/units) or train longer or adjust learning rate.
- If overfitting (training loss much lower than validation), add regularization (dropout, etc.) or reduce capacity.
- Adjust the learning rate carefully: it often needs to be tuned on a log scale (e.g., 0.1, 0.01, 0.001, …). Sometimes a learning rate that is too low will cause extremely slow convergence, giving the impression of underfitting, whereas a slightly higher rate would converge nicely.
- Use learning rate schedules: many times you can start with a relatively high learning rate and then reduce it as training progresses (either manually or via a schedule like exponential decay or step decay when validation performance plateaus).
Monitoring metrics like accuracy alongside loss can also be informative: sometimes the loss might decrease while accuracy plateaus (indicating the model is getting more confident on the same predictions), or vice versa.
In summary, training a neural network is an iterative process of configure → train → evaluate → adjust. Modern deep learning frameworks significantly lower the barrier to trying different configurations quickly, which is a big reason for the rapid progress in the field. For social scientists, this means one can be empirically guided – try a network, see how it performs, and iterate – much as one might do with choosing specifications in a regression (though the “specifications” space is much larger for a neural net!). The final model chosen should ideally be the one that performs best on held-out data. As a sanity check, one should also compare it with simpler models; if a straightforward logistic regression or random forest is performing just as well, the added complexity of a neural net might not be warranted.
8.6 Interpretability and Explainability
One of the major concerns in applying neural networks to social science problems is the interpretability of the models. Social scientists are typically interested not only in making accurate predictions, but also in understanding the relationships between variables, uncovering latent patterns, and providing explanations that are convincing to stakeholders or policymakers. Traditional statistical models (like linear regression or decision trees) offer transparent relationships – e.g., coefficients or splits that can be directly interpreted. Neural networks, in contrast, are often criticized as “black boxes”: their predictions result from complex, layered computations that do not yield simple, direct explanations.
However, a growing field of eXplainable AI (XAI) has developed tools to interpret and explain neural network predictions. Here we outline approaches to interpretability that can make neural network results more transparent in a social science context:
- Feature Importance and Attribution: Methods like SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) provide ways to estimate the importance of input features for a given prediction. SHAP values are based on cooperative game theory (the concept of Shapley values) and represent the contribution of each feature to the difference between the model’s prediction and a baseline expectation. LIME, on the other hand, fits a simple interpretable model (like a sparse linear model) locally around the prediction to approximate the neural network’s behavior. For example, if a neural net predicts that a certain individual will have a high income, SHAP values could tell us that the individual’s education level and years of experience were strong positive contributors, while the local unemployment rate was a negative contributor, aligning the explanation with domain expectations. LIME could show a small linear model for that individual where, say, education=Master’s contributes +15% chance of high income, experience > 5 years contributes +10%, and high local unemployment contributes -5%, etc., illustrating in simple terms why the neural net made its prediction.
- Saliency and Input Sensitivity: For image or text models, one can compute saliency maps or attention weights that highlight what parts of the input the network focused on. In images, saliency maps (essentially the gradient of the output w.r.t. input pixels) can show which regions of an image influenced the prediction most. For example, a CNN predicting “protest” vs “non-protest” in an image might focus on areas with crowds or protest signs. In text, some sequence models (and certainly Transformers with attention mechanisms) allow extraction of attention weights to see which words were most attended to for a classification. If an LSTM classified a speech as containing hate speech, we might examine which words in the speech contributed heavily to that decision – perhaps identifying specific derogatory terms. This can be important for both understanding and justifying the model’s decisions (and for identifying when the model might be keying off of problematic biases, such as associating certain topics with certain groups unfairly).
- Interpretable Model Surrogates: Another approach is to train an interpretable surrogate model on the predictions of the neural network. For instance, one could use decision trees or rule-based models to approximate the behavior of the neural network in certain regions of the feature space. This is related to LIME but can be done globally: e.g., train a decision tree on the dataset where the “labels” are the neural network’s predictions. The tree might then provide a set of rules that roughly mimic the network. Caution is needed – an approximation may not faithfully represent the true model in all cases – but it can sometimes reveal broad patterns. For example, a surrogate tree might yield rules like “IF (income > $50k) AND (age < 30) THEN predict high credit score” which could approximate what a neural net has learned, even if the net itself is not a decision tree.
- Network Dissection and Concept Analysis: Research by Bau et al. (2017) on network dissection shows that some neurons in CNNs learn to detect human-interpretable concepts (like “tree” or “door”) even without supervision on those concepts. They developed a method to systematically test each hidden unit in a vision model against a large set of concepts (objects, textures, colors) and found, for example, units that reliably activate for images of doors, or units for certain textures. This kind of analysis can be done to see if particular hidden units correspond to meaningful factors. In social science applications, one could imagine analyzing a network trained on survey data to see if any neuron’s activation correlates strongly with known indices (like an SES index or an ideology score), which might indicate the net internally constructed a similar concept.
- Causal Explainability: Recently, there is interest in going beyond correlational explanations to more causal ones. For example, counterfactual explanations try to answer: “what would need to change in this input for the model’s prediction to change in a desired way?” In a recidivism risk model, a counterfactual explanation might be: “If this individual had one fewer prior offense, the predicted risk score would drop below the threshold for detention.” This gives a more actionable explanation (it points to a change in input that alters output) and connects with ideas of fairness and algorithmic recourse. Some methods formulate this as an optimization problem: find the minimal change to the input features that yields a different outcome from the model.
In the social sciences, the need for transparency is not just academic – it’s often ethical or legal. For instance, if an algorithm is used in criminal justice or in allocation of social services, one must often provide reasons for decisions and ensure there is no hidden bias against protected groups. Neural networks themselves do not inherently avoid bias – in fact, if the training data reflect societal biases, the model can perpetuate them. Techniques like feature importance can help identify if certain sensitive attributes (like race or gender, or proxies thereof) are unduly influencing predictions, which might prompt retraining the model with fairness constraints or interpreting results with caution. In some cases, interpretable models (or post-hoc explanations) can uncover problematic patterns that were not apparent during training. For example, one might discover via LIME or SHAP that a resume-screening network was effectively using an applicant’s address as a proxy for race in decisions (because location strongly correlates with demographics in the data) – a red flag that would need addressing.
It is also possible to impose some interpretability at training time. For example, one could use a smaller network or add penalties that encourage sparse activations or use attention mechanisms that are inherently interpretable (in some cases, attention weights can be interpreted as a measure of importance of each part of the input). Another approach is building hybrid models – e.g., use a neural network to generate features or scores and then feed those into a traditional regression model, thereby capturing non-linearities in the feature generation but retaining an interpretable final model. An example of this might be using a CNN to score images of neighborhoods for “disorder level” and then using that score as a variable in a regression predicting crime rates. The regression is interpretable, and the CNN’s output can be interpreted as a single meaningful index (even if internally the CNN is complex).
To illustrate one technique in R: we can use the lime
package to explain a neural network’s predictions on tabular data. For brevity, here’s a conceptual mini-example using the MLP we trained earlier (treating it as a black box):
# Install lime if not already installed
# install.packages("lime")
library(lime)
# Our model expects a numeric matrix input, so we define model_type and predict_model for LIME
<- function(x, ...) 'classification'
model_type.keras <- function(x, newdata, type, ...) {
predict_model.keras # newdata will be a data.frame; convert to matrix and get predictions
<- x %>% predict(as.matrix(newdata))
preds # Return a data frame of class probabilities (two classes: 0 and 1)
data.frame(`0` = 1 - preds[,1], `1` = preds[,1])
}# Create a lime explainer using training data (excluding the label column)
<- lime(train_data[, c("X1","X2")], mlp_model)
explainer # Explain predictions for the first 5 test cases
<- explain(test_data[1:5, c("X1","X2")], explainer, n_labels = 1, n_features = 2)
explanation print(explanation[, 1:9])
This will output something like:
case label label_prob model_r2 model_intercept model_prediction feature feature_weight
1 1 1 0.992 0.75 0.500 0.950 X2<=0 0.445
2 1 1 0.992 0.75 0.500 0.950 X1<=0 0.005
3 2 0 0.998 0.67 0.500 0.020 X2<=0 0.367
4 2 0 0.998 0.67 0.500 0.020 X1 >0 -0.122
...
This indicates, for example, that in Case 1 (with true label 1, predicted probability ~0.992 of class 1), LIME’s local model had an \(R^2\) of 0.75 and predicted 0.950 for class 1 based on two rules: X2 <= 0 contributed +0.445 towards class 1, and X1 <= 0 contributed +0.005. In other words, in that case both features being negative made the neural net lean towards class 1 (which matches the XOR pattern logic in our simulation: both negative means output 1). For Case 2 (true label 0, predicted prob ~0.002 of class 1), LIME shows X2 <= 0 contributed +0.367 towards class 1, but X1 > 0 contributed -0.122, and the intercept was 0.5, resulting in a net prediction of 0.02 for class 1 (i.e., strong leaning to class 0). These numbers are just illustrative, but the idea is we get a human-readable explanation of each prediction in terms of original features. In more realistic settings, we could request more features in the explanation (n_features=5, etc.), and we would examine which features consistently show up as important.
In summary, while neural networks present challenges for interpretability, a variety of methods exist to extract insights from them. The level of explanation required depends on the use-case: for pure predictive tasks (like language translation or image tagging for internal research), a “black box” may be acceptable; for scientific inference or high-stakes decisions (like criminal justice or healthcare), interpretability is crucial. Social scientists should be aware of these tools and use them to ensure that when they do employ neural networks, they can explain and justify the findings to themselves and to others. Furthermore, such tools can help diagnose when a model might be exploiting undesirable patterns (e.g., proxies for protected attributes or dataset artifacts) and guide improvements.
8.7 Comparison with Traditional Methods
How do neural network models compare with more traditional statistical or machine learning methods commonly used in social science, such as linear/logistic regression or even tree-based models and support vector machines? We consider accuracy, flexibility, and transparency as key dimensions:
- Predictive Accuracy: Neural networks, when appropriately tuned and given sufficient data, often outperform simpler models on complex prediction tasks. Their ability to automatically model interactions and non-linear relationships means they can discover patterns that a linear model or a low-degree polynomial might miss. For example, in text or image analysis, linear models that rely on manually engineered features cannot match the accuracy of deep networks that learn features from raw data. In the earlier example by , King & Zeng (2000), their neural network approach improved conflict prediction accuracy substantially over prior statistical models. However, the advantage is not universal – for many tabular datasets with limited samples and a strong signal-to-noise ratio, methods like gradient boosting machines (e.g., XGBoost) or even well-tuned logistic regressions can perform on par with neural nets. In fact, one reason neural nets haven’t completely displaced other methods in social science is that in low-data regimes, very deep models might overfit and not have a clear edge. Additionally, ensemble methods like random forests or boosted trees often yield strong performance with less tuning.
- Flexibility: Neural networks are extremely flexible in terms of the data they can handle and the mappings they can learn. They can naturally incorporate unstructured data (images, text, audio) via CNNs, RNNs, etc., whereas traditional models often require a separate feature extraction step for such data. They can also be extended easily: e.g., one can create multi-task networks that simultaneously predict multiple outcomes, or networks that incorporate multiple input modalities (e.g., taking both text and numeric inputs by combining different subnetworks). Moreover, neural nets can learn internal representations that might be transferrable to other tasks (transfer learning) – something like a logistic regression doesn’t have internal layers to reuse. Traditional statistical models, on the other hand, are less flexible in structure – you often have to decide on interactions or transformations manually. That said, for purely structured data with defined features, tree-based models or linear models can be quite effective and are simpler to implement.
- Transparency and Interpretability: Here traditional methods have a clear advantage. A simple model like a linear regression provides coefficients that (under certain assumptions) directly quantify the effect of each predictor on the outcome. Decision trees yield human-readable rules (e.g., “IF income > $50k AND age < 30 THEN probability of voting = 0.8”). By contrast, a neural network with hundreds or thousands of weights does not provide a straightforward narrative of “X increases Y by Z units.” We must resort to the interpretability techniques discussed (SHAP, LIME, etc.) to get insight, and even then those are post-hoc explanations rather than an inherent part of the model. In many social science applications, explanation is part of the goal – we often care about understanding the social processes at work, not just predicting outcomes. If an algorithm is used in policy, being able to explain its decisions might be essential for it to be accepted. For this reason, interpretable models (or at least simpler proxy models) are often used alongside neural nets in studies: e.g., a researcher might report the results of a logistic regression for interpretability, even if a neural network was used as a robustness check or to validate that no nonlinear patterns were missed.
- Causal Inference and Theoretical Insight: Traditional methods, especially those in the econometrics toolkit, are closely linked to causal inference frameworks. Linear regressions with control variables, instrumental variables regression, difference-in-differences designs, etc., all have well-developed theoretical interpretations for causal estimation. Neural networks can be used as part of causal analysis (for example, to estimate propensity scores or conditional outcome models in a double ML approach), but the core causal identification strategy usually relies on the same old assumptions (ignorability, exclusion restrictions, parallel trends, etc.). Neural nets typically do not provide confidence intervals or significance tests out-of-the-box, whereas traditional methods often do (though one can bootstrap a neural net or use Bayesian versions to get uncertainty estimates). For a social scientist aiming to test a theory or estimate a specific effect, a neural network alone might not be ideal – but it could complement by capturing nuisance functions or suggesting new hypotheses. For example, Mullainathan and Spiess (2017) argued that machine learning is primarily about prediction, and its role in econometrics is to be used for tasks like prediction of counterfactuals or discovery of heterogeneity, while the core inference about causal parameters remains a separate step.
- Scalability: When it comes to very large datasets, neural networks (with GPU acceleration and mini-batch SGD) can scale to millions of examples and high-dimensional inputs. Traditional statistical models can also scale in their own ways (e.g., using stochastic gradient descent for logistic regression or large matrix solvers), but off-the-shelf implementations might struggle with really large data. However, training very large neural networks can be expensive and time-consuming, and they may require specialized hardware (GPUs/TPUs). In social science, datasets are rarely as large as those in commercial deep learning (like billions of tokens or millions of images), except perhaps some forms of text data or network data from the web. So scalability is usually not the limiting factor – data availability is.
In contexts with small data, one often finds that a neural network easily overfits and a regularized linear or tree model performs better. For example, if you have a survey with 500 respondents and 20 predictors, a carefully specified logistic regression (maybe with polynomial terms or interactions chosen based on theory) could outperform a 3-layer neural net that has no guidance and ends up overfitting the noise. A rule of thumb sometimes cited is that you need an order of magnitude more data (in terms of number of training examples) than you have parameters in your neural network to reliably avoid overfitting (though techniques like regularization and transfer learning complicate this simple picture). In many social science problems, we simply don’t have tens of thousands of examples, so simpler models are not only more interpretable but necessary to avoid overfitting.
However, in scenarios where you do have rich data (high-dimensional, possibly unstructured, or non-linear signals) and sufficient sample size, neural nets can shine. For instance, if analyzing text from thousands of political speeches to predict a rating of populist vs. technocratic rhetoric, a neural network that learns its own text features may outperform a bag-of-words SVM or a dictionary-based approach, because it can capture subtle phrasing differences and context. Similarly, for predicting policy outcomes from a combination of numeric indicators, social network metrics, and text sentiment, one could build a unified neural network that ingests all these data types, whereas a traditional approach might have to reduce everything to a set of summary indices first.
A pragmatic view is that neural networks complement rather than outright replace traditional methods in the social scientist’s toolkit. One might use neural nets to explore data or to validate that more rigid models aren’t missing something. For example, after running a neural network, you might inspect what variables it found important (via SHAP values) and realize that a certain interaction is important – you could then include that interaction explicitly in a logistic regression and confirm it’s significant and aligns with theory. Conversely, one might use a regression to summarize what a network is doing, as a way to communicate results in a familiar format (e.g., “Using a neural network, we find that the marginal effect of education on income is larger at higher levels of experience, consistent with a complementarity hypothesis”).
In high-stakes decision contexts (loans, criminal justice), there is an ongoing debate about using black-box models vs. interpretable models. Rudin (2019) strongly argues that for high-stakes decisions, one should use interpretable models whenever possible rather than relying on post-hoc explanations of black-boxes. Her point is that an inherently interpretable model (like a sparse rule list or a transparent scoring system) can often be built with little loss in accuracy, and it avoids the risk that the black box might be right for the wrong reasons (which explanations might not fully catch). On the other hand, proponents of black-box use (with explanation) claim that sometimes accuracy is paramount (say, diagnosing cancer from an MRI), and as long as we carefully check the model for bias, the improved accuracy can save lives or resources, even if the model isn’t fully interpretable.
For social scientists, the takeaway is: use the right tool for the job. If a simple model suffices and yields insight, there’s no need to complicate things with a deep network. If the problem involves data types or nonlinear patterns that simpler models can’t handle well, then consider a neural network, but accompany it with appropriate interpretation and validation. And in many cases, consider using both: a neural network for predictive performance or exploratory analysis, and a traditional model for confirmatory analysis or presentation. This way you get the benefits of both – the neural net can uncover patterns and provide a benchmark for maximum predictive power, while the simpler model can test hypotheses and communicate relationships clearly.
8.9 Conclusion
Neural networks offer powerful new tools for social scientists, enabling the modeling of complex patterns in data that were previously difficult or impossible to capture. In this chapter, we have covered the landscape of neural networks in social science applications, moving from theoretical foundations to practical implementation in R. We discussed feedforward neural networks (MLPs) and how they can model non-linear relationships and interactions; convolutional neural networks (CNNs) for handling structured inputs like images or spatial data; and recurrent networks (LSTMs) for sequence data and time series. The mathematical underpinnings – from activation functions to backpropagation – provide insight into how these models learn from data. With the R code examples, we demonstrated how to build and train these networks using modern libraries like Keras, showing that even relatively few lines of code can set up a sophisticated model.
Importantly, we tackled the distinction between predictive modeling and causal inference. Neural networks excel at prediction given enough data, and we showed scenarios where they clearly outperform traditional approaches in predictive accuracy (e.g., the XOR example, or citing improvements in conflict prediction). However, we also emphasized caution when it comes to interpreting these models for causal insights – often a direct causal interpretation of a deep model is not possible without further assumptions or methods. We described how one might integrate neural nets into causal analysis carefully (e.g., using them for propensity score or outcome modeling in a double ML framework, or using them to discover potential interactions which are then tested in a causal model). In practice, a judicious approach might use neural networks to improve certain components of an analysis (like imputation or proxy variable construction) while still relying on more interpretable models or established causal inference techniques for the core analysis.
We also delved into practical aspects of model training: how to choose loss functions, how gradient descent and its variants (SGD, Adam) work, and how to use regularization methods (dropout, weight decay, etc.) to prevent overfitting. These are essential for any applied work because a poorly trained network is no better than a random guess or a misleading curve fit. Through examples and discussion, we highlighted how to monitor training and tune hyperparameters, using validation data to guide decisions. We emphasized that data preprocessing (normalization, encoding) is often as important as model architecture in getting networks to train properly.
The section on interpretability and explainability addressed the black-box critique of neural nets. We presented methods such as LIME and SHAP for explaining individual predictions, and stressed the importance of transparency especially in policy-relevant applications. We gave an example of using LIME in R to interpret an MLP’s decisions, illustrating how even a complex model can be probed to yield understandable insights (like which features were driving a prediction). This is crucial: if neural networks are to be used in social science research, researchers must ensure they can interpret and validate what the model is doing, to avoid drawing false substantive conclusions or deploying biased algorithms. The array of XAI tools available today makes it feasible to open up the black box to a significant degree, though it requires extra effort.
In comparing neural networks with traditional methods, a theme emerged that each has its place. Neural networks bring flexibility and often better pure predictive power (especially with rich data), while traditional models bring simplicity and interpretability. We highlighted scenarios where neural networks add value (complex interactions, high-dimensional data, text/image analysis) and where they may not (very small datasets, where interpretability is paramount and patterns are linear enough). We also noted that these approaches can be combined – e.g., using a neural net for feature learning and a regression for the final analysis, or using regression to summarize a neural net model’s behavior. The social scientist’s goal is often to maximize insight, not just accuracy, and sometimes the insight comes from the combination of sophisticated algorithms and human interpretation/theory.
We covered practical challenges such as small sample sizes, class imbalance, computational constraints, and ethical issues. For each, we gave tips: e.g., use transfer learning for small data, class weighting for imbalance, GPU/cloud resources for heavy computation, and fairness checks for ethical considerations. These are the nuts-and-bolts issues one encounters when actually trying to use neural nets on social data, and addressing them is key to a successful project. As with any method, using neural networks responsibly means understanding their limitations and failure modes (like overfitting or bias) and proactively mitigating them.
A recurring message is that neural networks do not replace the need for theory and careful research design. Rather, they are tools that can uncover patterns we might otherwise miss, or improve predictions/measurements that feed into larger analyses. For example, a neural network might produce a better measure of ideology from text, which a political scientist can then use in a regression to test a hypothesis about legislative behavior (as in Knox, Lucas, & Cho, 2022’s discussion of learned proxies). The theory about legislative behavior remains critical – the neural network is just improving the measurement of one variable. Likewise, a neural network might predict protests, but a social scientist still needs to interpret why those factors matter and what it means for theories of collective action or political instability.
Looking forward, as social phenomena generate increasingly complex and large-scale data (from social media, sensors, digital trace data, etc.), neural networks and deep learning are likely to play a growing role in social science research. Areas like computational sociology, political text analysis, and econ applications of ML are already burgeoning. Yet, the barriers to entry are falling – with high-level APIs and many pre-trained models available, one does not need a Ph.D. in computer science to apply these methods. What one does need is a strong grasp of research design and domain knowledge, so that the questions asked of the data are meaningful and the results are interpreted correctly. A danger with any powerful technique is the potential for misuse (data mining without theory, finding spurious “significant” patterns, etc.). By combining the strengths of neural networks (flexibility, performance) with the rigor of social science methodology (validity, theory-driven inquiry), reseabarchers can unlock new insights while avoiding pitfalls.
Neural networks are a powerful addition to the social scientist’s analytic toolkit – but they should be used thoughtfully. They can uncover patterns and improve predictions in ways that open up new research questions and practical solutions (e.g., more accurate early warning systems for crises, better measurement of latent social traits, etc.). At the same time, they come with the responsibility to ensure interpretability, fairness, and robustness. The chapter has aimed to equip readers with both the how (implementation in R) and the when/why (appropriate use cases and limitations) of neural networks in social science research. The hope is that readers will feel empowered to experiment with these methods in their own work – whether it’s predicting an election, analyzing survey open-end texts, or modeling the evolution of a social network – while maintaining the critical perspective of a social scientist. With rigorous exposition and reproducible code examples, this chapter serves as a bridge between the exciting developments in deep learning and the rich, nuanced problems of social science, encouraging a fruitful interplay between the two.
References
Chollet, F., & Allaire, J. J. (2018). Deep learning with R. Manning.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Hochreiter, S., & Schmidhuber, J. (1997). Long short‐term memory. Neural Computation, 9(8), 1735–1780.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. Bach & D. Blei (Eds.), Proceedings of the 32nd International Conference on Machine Learning (pp. 448–456). PMLR.
Johansson, F., Shalit, U., & Sontag, D. (2016). Learning representations for counterfactual inference. In M. Balcan & K. Q. Weinberger (Eds.), Proceedings of the 33rd International Conference on Machine Learning (pp. 3020–3029). PMLR.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR 2015). arXiv:1412.6980.
Koch, B., Sainburg, T., Bastías, P. G., Jiang, S., Sun, Y., & Foster, J. (2024). A primer on deep learning for causal inference. Sociological Methods & Research, 54(2), 397–447.
Knox, D., Lucas, C., & Cho, W. K. T. (2022). Testing causal theories with learned proxies. Annual Review of Political Science, 25, 419–441.
Lam, O., Hughes, A., & Wojcik, S. (2019, January 30). How social scientists can use transfer learning to kick‑start a deep learning project. Pew Research Center: Decoded.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient‑based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Mullainathan, S., & Spiess, J. (2017). Machine learning: An applied econometric approach. Journal of Economic Perspectives, 31(2), 87–106.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135–1144). ACM.
Rudin, C. (2019). Stop explaining black box machine learning models for high‑stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206–215.
Shalit, U., Johansson, F. D., & Sontag, D. (2017). Estimating individual treatment effect: Generalization bounds and algorithms. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th International Conference on Machine Learning (pp. 3076–3085). PMLR.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In I. Guyon et al. (Eds.), Advances in Neural Information Processing Systems, 30 (pp. 5998–6008). Curran Associates.
Yan, X., Zhao, J., Ding, W., & Luo, X. (2020). Estimating city‑scale passenger‑car fuel consumption using street‑view images. Computers, Environment and Urban Systems, 82, 101489.