8 Neural Networks in Social Science Research

In recent years, deep neural networks have achieved remarkable success across domains such as computer vision, speech recognition, and natural language processing – tackling problems that had resisted the best attempts of the AI community for many years. This rapid progress has been driven by increases in computing power and the availability of large datasets. However, the uptake of neural network methods in the social sciences has been relatively slow. Social science researchers have indeed begun to adopt machine learning techniques, often using them to construct proxy variables or to improve prediction in empirical studies. For example, in a review of top political science journals from 2018–2020, Knox et al. (2022) identified 48 papers that employed statistical learning or other computational methods, and in about 68% of those cases the machine learning was used to first estimate a proxy of a latent concept (which would then be used in subsequent analysis). Tellingly – and illustrating that recent breakthroughs in deep learning and the growing use of computational methods in social science have largely occurred in parallel rather than in tandem – out of those 48 studies, only one utilized a neural network (the rest used methods like random forests, SVMs, etc.). Several factors may explain social scientists’ hesitance to embrace neural networks. Traditional statistical learning approaches (e.g., penalized regression, decision trees) have well-understood statistical properties, making it easier to quantify uncertainty and correct biases, whereas neural networks have been viewed as more of a “black box.” Additionally, until recently, neural networks demanded more computational resources and training data than many social science applications could readily supply. Social science datasets are often modest in size (hundreds or thousands of observations) compared to the massive datasets used to train state-of-the-art deep learning models. Concerns about overfitting, interpretability, and the complexity of model tuning further contribute to caution in applying neural nets to social data. In short, machine learning has largely been used in social science for prediction tasks (where black-box accuracy can be useful) rather than for modeling theoretical relationships.

Despite these challenges, there is a growing recognition that neural networks can complement and extend the toolkit of social scientists. Neural networks are universal function approximators, capable of learning complex non-linear relationships that might be missed by traditional linear or additive models. They excel at predictive tasks, especially with high-dimensional or unstructured data (such as text, images, or networks) where feature engineering is difficult. Indeed, when predictive accuracy is the primary goal (rather than estimating interpretable causal effects), the benefits of more flexible, non-linear models can outweigh the loss of interpretability. Early work in political science demonstrated that a neural network model could uncover structural patterns in international conflict data and substantially improve out-of-sample forecast accuracy over prior approaches. As data sources proliferate and computational barriers recede, neural networks are becoming increasingly viable for social science research problems.

This chapter provides an in-depth exploration of neural network models in the context of social science applications, written as an executable R Markdown document. We cover both predictive modeling (e.g., forecasting or classification tasks) and causal inference (estimating the effects of interventions or treatments) use cases for neural networks. We begin with theoretical foundations of different neural architectures – including feedforward fully-connected networks, convolutional neural networks, and recurrent (sequence) networks – and discuss how these can be utilized in social science research. We then demonstrate implementation in R, using packages such as keras, tensorflow, and torch. Throughout, we address important practical topics: data preprocessing, choice of activation and loss functions, the backpropagation algorithm and training optimization (SGD, Adam, etc.), regularization strategies to prevent overfitting (dropout, L2 weight decay), and techniques for model evaluation. We also discuss strategies for interpretability and explainability, which are crucial for the adoption of neural nets in domains where understanding the basis of a prediction is as important as the prediction’s accuracy. Additionally, we compare neural networks with more traditional statistical models in terms of accuracy, flexibility, and transparency, highlighting scenarios where neural networks add value – and where they may not. Finally, we consider practical challenges such as small sample sizes and class imbalance that are often encountered in social science data, and how one can adapt or mitigate these issues (for example, through transfer learning, data augmentation, or specialized architectures).

The goal is to provide researchers and graduate students with a rigorous yet accessible guide to applying neural network models in the social sciences, blending theoretical insight with hands-on example code. All code chunks in this chapter are written in R and can be executed to reproduce the results (assuming the required packages and data are available). By the end of the chapter, readers should understand both how to implement neural network analyses in R and, importantly, when and why such methods can be beneficial in social science research.

8.1 Predictive Modeling vs. Causal Inference with Neural Networks

Social science research encompasses two broad analytical goals: predictive modeling and causal inference. Prediction focuses on accurately forecasting or classifying outcomes – for example, predicting election results, identifying individuals at risk of recidivism, or classifying topics in open-ended survey responses. Causal inference, on the other hand, is concerned with estimating the effect of some treatment or intervention on an outcome – for instance, what is the impact of a job training program on subsequent earnings, or how does exposure to misinformation affect political attitudes. These goals involve fundamentally different criteria for success. Prediction is about minimizing error on new data, whereas causal inference is about isolating a credible estimate of a causal effect (often requiring control of confounding and consideration of counterfactuals).

Machine learning methods, including deep neural networks, have primarily been developed as predictive tools. Their objective is to learn patterns that generalize well to new data. In fields like computer vision or language processing, complex neural architectures have achieved stunning predictive performance by fitting flexible functions to large datasets. Social scientists have begun harnessing this predictive power for tasks such as constructing proxies for latent theoretical constructs (e.g., using a text classifier to measure the sentiment or ideology in a speech) and for improving the accuracy of forecasts in policy and economics contexts. When the goal is prediction alone, the “black box” nature of neural networks is less of a concern – a highly accurate black box can be extremely useful for tasks like predictive policing or early warning systems for social unrest, even if its inner workings are not fully transparent.

Causal inference poses additional challenges for the use of neural networks. In causal analysis, we are typically interested not just in predicting $Y$ from $X$, but in understanding how $Y$ would change under an intervention (e.g., setting $X = \text{treatment}$ versus $X = \text{control}$). This requires disentangling correlation from causation and ensuring that the model properly accounts for confounding factors and biases. A model that is excellent at prediction may still yield biased estimates of causal effects if, for example, it leverages spurious correlations that do not reflect true causal relationships. As Pierce (2023) notes, “many of the most striking examples of recent machine learning progress entail neural networks learning complex correlations from a large data distribution for predictive purposes, whereas a lot of social science research is more interested in studying how those observed distributions would change under a causal intervention”. In other words, social scientists often seek to model the data-generating process and counterfactual outcomes, rather than just the observed joint distribution. This distinction means that directly applying a flexible non-linear model like a neural network to observational data can lead to overfitting to noise or confounding, threatening the validity of causal conclusions.

Despite these differences, there is a growing interface between deep learning and causal inference. Recent research in computational social science and econometrics has begun to adapt neural network architectures to estimate causal effects under the potential outcomes framework (see Koch et al., 2024 for a review). For example, neural networks have been used to learn balanced representations of covariates that make treated and control groups more comparable, improving estimation of treatment effects in observational studies. Specifically, Johansson, Shalit, and Sontag (2016) demonstrated how a neural network can transform covariates into a representation space where the distributions of treated vs. control units are closer, facilitating more accurate counterfactual prediction. Subsequent work (Shalit, Johansson, & Sontag, 2017) introduced the TARNet architecture – a simple two-headed feedforward network that learns potential outcomes for treatment and control – and extensions like DragonNet that add a propensity prediction head to enforce causal identification. Other researchers have integrated deep learning into propensity score estimation, instrumental variable analysis, and heterogeneous treatment effect estimation. For instance, methods in the “meta-learner” framework (T-learners, S-learners, X-learners) can incorporate neural networks as the base learners to capture complex response surfaces, and custom architectures have been proposed for causal forests or GAN-based IV estimation. While these methods are on the cutting edge, they highlight that neural networks can be used for causal inference – but usually with additional structure or assumptions to ensure identification of causal effects. Importantly, standard practices from the causal inference literature (such as holdout validation to avoid overfitting bias, and sensitivity analyses for unobserved confounding) remain crucial when using neural networks for causal questions.

In practice, a useful division of labor is emerging: one can leverage neural networks to improve predictive tasks within a larger causal analysis. For example, one might use a neural network to predict a proxy or to impute missing data, and then use those predictions in a more traditional causal model. Or one can use deep learning to estimate nuisance functions (like propensity scores or baseline outcome functions) within frameworks like double machine learning. Double/debiased ML methods allow flexible fitting of high-dimensional nuisance parameters (using ML algorithms) while retaining theoretical guarantees (Neyman orthogonality and cross-fitting) for the causal parameter of interest. The key is recognizing that for causal inference, accuracy in fitting the observed data is not the only goal – we also need interpretability and a model structure that connects to the counterfactual question at hand.

Throughout this chapter, we will highlight when we are focusing on pure prediction and when causal interpretation is (or is not) valid. The next sections delve into different neural network architectures and their mathematical foundations, laying the groundwork for applied examples in R that follow.

8.2 Feedforward Neural Networks (Multi-Layer Perceptrons)

Theory and Mathematical Foundations

Feedforward neural networks, also known as multi-layer perceptrons (MLPs) or dense neural networks, are the quintessential deep learning architecture. These models consist of layers of interconnected artificial neurons where information flows in one direction from input to output (hence “feedforward”). Feedforward networks are general function approximators: given enough hidden units, an MLP with at least one hidden layer can approximate any continuous function arbitrarily well under mild conditions (this is the Universal Approximation Theorem). This universal approximation property underlies the power of neural networks to model complex relationships.

Neurons and Layers: The basic unit in a feedforward network is the neuron (or node), which performs a weighted linear combination of its inputs and then applies a non-linear activation function. In mathematical form, for neuron $j$ in a given layer, the computation is:

\[ z_j = b_j + \sum_{i} w_{ij} x_i, \]

\[ a_j = f(z_j), \]

where $x_i$ are the inputs to the neuron (these could be raw features or outputs from neurons in a previous layer), $w_{ij}$ are the weights, $b_j$ is a bias term, $z_j$ is the linear combination (sometimes called the logit or pre-activation), and $f(\cdot)$ is the activation function producing the neuron’s output $a_j$. Common choices for $f$ include the sigmoid $\sigma(z) = 1/(1+e^{-z})$, the hyperbolic tangent $\tanh(z)$, and the now-ubiquitous ReLU (Rectified Linear Unit) $f(z) = \max(0, z)$, among others. Non-linear activations are critical; without them, multiple layers would collapse into an equivalent single linear model.

A feedforward network is organized into layers: an input layer that takes the features (covariates) as inputs, one or more hidden layers of neurons that transform the inputs into intermediate representations, and an output layer that produces the final prediction. Each neuron in a layer typically connects to every neuron in the next layer (a fully-connected network). For example, a classic MLP architecture for a binary classification problem might have an input layer with $d$ inputs, one hidden layer with $h$ neurons, and an output layer with a single neuron producing a probability (via a sigmoid activation). This architecture would contain $(d \times h)$ weights between the input and hidden layer, plus $h$ bias terms in the hidden layer, and $h$ weights plus one bias in the output layer.

Forward Propagation: During a forward pass, data $X$ is fed into the input layer and transformed layer by layer through these weighted sums and activations to produce an output $\hat{y}$. This output could be a scalar (for regression or binary classification) or a vector of class probabilities (for multi-class classification, often obtained via a softmax activation in the output layer). The capacity of the network (its ability to fit complex functions) can be increased by adding more neurons or more layers, at the cost of greater computational demand and risk of overfitting.

Illustrative Example – XOR Problem: A simple example of why hidden layers are useful is the classic XOR problem. Consider two binary inputs $x_1, x_2$ and an output $y$ that should be 1 if exactly one of $x_1, x_2$ is 1 (exclusive or), and 0 otherwise. A linear model (logistic regression) cannot capture this non-linear pattern – it will effectively draw a single separating line which cannot separate the XOR cases. However, a two-layer neural network with a few hidden neurons can learn this pattern by creating intermediate features (hidden layer outputs) that act like logical subfunctions. This demonstrates that even for small problems, an MLP can capture interactions that a linear model would miss.

Mathematically, an MLP with one hidden layer can be written as:

\[ \mathbf{h} = f(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}), \]

\[ \hat{\mathbf{y}} = g(\mathbf{W}^{(2)} \mathbf{h} + \mathbf{b}^{(2)}), \]

where $\mathbf{x}$ is the input vector (length $d$), $\mathbf{h}$ is the hidden layer activation vector (length $h$), and $\hat{\mathbf{y}}$ is the output (for simplicity here, assume a vector of length $o$ for possibly multiple outputs). $\mathbf{W}^{(1)}$ is an $h \times d$ weight matrix for the first layer, $\mathbf{b}^{(1)}$ is a bias vector of length $h$, $\mathbf{W}^{(2)}$ is an $o \times h$ weight matrix for the second layer, and $\mathbf{b}^{(2)}$ is bias of length $o$. The functions $f(\cdot)$ and $g(\cdot)$ are activation functions (they could be the same or differ; often $f$ is a non-linear like ReLU or tanh, while $g$ might be a sigmoid or softmax appropriate to the task). This formulation can be extended to additional layers in the obvious way.

Relationship to Traditional Models: It is helpful to note that many traditional statistical models are special cases of neural networks. For instance, a standard linear regression or logistic regression can be seen as a one-layer neural network (no hidden layer) with an identity or sigmoid output activation, respectively. In that sense, neural networks generalize these models by adding depth (hidden layers) and allowing more complex feature transformations. This also means that at small scales, an MLP can mimic a linear model. The advantage comes when there are complex non-linear interactions: an MLP can learn those from data, whereas a linear model would require manually adding interaction terms or nonlinear transformations of inputs. As an example, one study in political science posited that the effects of certain variables on conflict might only manifest in particular combinations – a neural network was able to capture such conditional relationships automatically, whereas a traditional logistic regression required the researcher to explicitly specify interaction terms ( , King, & Zeng, 2000).

Implementation Example: A Predictive MLP in R

To illustrate how a feedforward neural network can be applied to a social science problem, consider a hypothetical predictive task: we have a dataset of individuals with two features $X_1$ and $X_2$ (which could represent, say, two test scores or socio-economic indicators), and we want to predict a binary outcome $Y$ (e.g., whether the person will graduate from college). Suppose the true relationship is that $Y=1$ only if both $X_1$ and $X_2$ are above certain thresholds – in other words, the effect of $X_1$ on the outcome is conditioned on $X_2$ being high, a form of interaction. A linear model without an interaction term would struggle in this scenario, effectively averaging the effect of each $X$ across all levels of the other. We will simulate such a scenario and then train both a logistic regression and a neural network to highlight the difference.

First, we simulate a dataset in R:

set.seed(123)
N <- 1000
X1 <- rnorm(N)
X2 <- rnorm(N)
# Define Y such that it's 1 if X1 * X2 > 0 (both positive or both negative)
Y <- ifelse(X1 * X2 > 0, 1, 0)
data <- data.frame(X1, X2, Y = factor(Y))
head(data)

In this simulated data, $Y=1$ occurs when $X_1$ and $X_2$ have the same sign (both above or both below 0), which is a non-linear relationship. A logistic regression that uses $X_1$ and $X_2$ as additive terms (no interaction) will effectively find no predictive power (since $P(Y=1)\approx0.5$ regardless of one variable alone). A neural network with a hidden layer can learn to multiply or otherwise combine $X_1$ and $X_2$ to capture this interaction.

We split the data into training and test sets and fit both models:

# Split into training and test sets
train_idx <- sample(1:N, size = 0.7 * N)
train_data <- data[train_idx, ]
test_data  <- data[-train_idx, ]

# Fit a logistic regression without interaction
glm_model <- glm(Y ~ X1 + X2, data = train_data, family = binomial())

# Fit a neural network (MLP) with one hidden layer
library(keras)
# Define a simple sequential model
mlp_model <- keras_model_sequential() %>%
  layer_dense(units = 4, activation = 'relu', input_shape = 2) %>%  # 2 inputs -> 4 hidden units
  layer_dense(units = 1, activation = 'sigmoid')  # output layer for binary classification

mlp_model %>% compile(
  optimizer = optimizer_adam(),
  loss = 'binary_crossentropy',
  metrics = 'accuracy'
)

history <- mlp_model %>% fit(
  as.matrix(train_data[, c("X1", "X2")]), as.numeric(train_data$Y) - 1,  # convert factor to {0,1}
  epochs = 50, batch_size = 32, verbose = 0,
  validation_split = 0.2
)

Here we used the Keras API for R to define a simple network: 2 input features, one hidden layer with 4 ReLU neurons, and an output sigmoid neuron for the probability of $Y=1$. We compile the model with the binary cross-entropy loss (appropriate for binary classification) and the Adam optimizer (an efficient variant of stochastic gradient descent discussed later). We train for 50 epochs (iterations over the data) with a mini-batch size of 32.

Now, we evaluate both models on the test set:

# Predictions and accuracy on test set
glm_probs <- predict(glm_model, newdata = test_data, type = "response")
glm_preds <- ifelse(glm_probs > 0.5, 1, 0)

nn_probs <- mlp_model %>% predict(as.matrix(test_data[, c("X1", "X2")]))
nn_preds <- ifelse(nn_probs > 0.5, 1, 0)

glm_acc <- mean(glm_preds == (as.numeric(test_data$Y) - 1))
nn_acc  <- mean(nn_preds == (as.numeric(test_data$Y) - 1))
sprintf("Test Accuracy - Logistic Regression: %.3f,  Neural Network: %.3f", glm_acc, nn_acc)

If you run the above code, you will likely observe that the logistic regression achieves an accuracy around 50% (no better than chance), while the neural network is far more accurate (often >90% in this example). The neural network has learned the interaction between $X_1$ and $X_2$ implicitly, whereas the simple logistic model without an $X_1 \times X_2$ term could not capture it. (For fairness, if we included the interaction term $X1 \cdot X2$ in the logistic model, it would then perform well on this synthetic task – but the point is that the neural network discovered that interaction on its own from the data.)

This toy example mirrors real-world scenarios in social science where outcomes may depend on non-linear combinations of predictors. For instance, perhaps economic development and regime type interact in affecting civil conflict onset – high economic development might reduce conflict risk, but only for democracies and not autocracies, etc. A researcher might not know the exact form of such interactions a priori. Neural networks offer a way to automatically model complex interactions and non-linearities, providing potentially better predictive performance than misspecified linear models (as demonstrated by et al., 2000 in the context of conflict prediction). Of course, the trade-off is that the neural network’s model is less interpretable than a simple logistic regression with a few coefficients. Throughout this chapter, we will return to this trade-off between predictive power and interpretability, and discuss methods to peek inside the “black box” of neural networks.

Before moving on, it’s instructive to examine the learned neural network model. We can inspect the model architecture and number of parameters:

mlp_model %>% summary()

This prints a summary of the model:

Model: "sequential"
________________________________________________________________________________
Layer (type)                        Output Shape                    Param #      
================================================================================
dense_1 (Dense)                     (None, 4)                       12           
dense_2 (Dense)                     (None, 1)                       5            
================================================================================
Total params: 17
Trainable params: 17
Non-trainable params: 0
________________________________________________________________________________

We see that the model has 17 trainable parameters in total: 12 in the first dense layer (which corresponds to $2 \times 4$ weights plus 4 biases) and 5 in the second layer ($4 \times 1$ weight matrix plus 1 bias). Despite this small size, the model was expressive enough to fit the XOR-like pattern. In practice, we often use many more hidden units and possibly multiple hidden layers for harder problems. For example, if we had dozens of input features (survey responses, demographic variables, etc.), a single hidden layer with 4 units might be too limited to capture all interactions, whereas a deeper or wider network could perform better. Deciding on the network architecture (number of layers and units) is an important part of model design and typically requires some experimentation or cross-validation.

8.3 Convolutional Neural Networks (CNNs)

Theory and Applications in Social Science

While feedforward MLPs are suitable for structured data (tabular data) and can, in principle, handle any predictive task, they become inefficient or impractical when dealing with high-dimensional inputs that have local structure, such as images, text sequences, or spatial data. Convolutional Neural Networks (CNNs) were developed to more effectively handle such structured inputs, especially images, by exploiting spatial locality and parameter sharing. The hallmark of CNNs is the convolutional layer, in which neurons are arranged in feature maps and each neuron is connected not to the entire input, but to a small sliding window of the input (called a receptive field). The same set of weights (a filter or kernel) is applied across different positions of the input, which greatly reduces the number of parameters and encodes the assumption that the same pattern could appear in multiple locations.

In image analysis – which is where CNNs first rose to prominence (LeCun et al., 1998) – a convolutional layer might have filters that detect simple features like edges or corners in small 3x3 patches of an image. These filters convolve across the image, producing feature maps that indicate where those features occur. Deeper layers of the CNN then combine lower-level features into higher-level ones (for example, edges combine into motifs, motifs into shapes, shapes into objects). CNNs also commonly include pooling layers (e.g., max-pooling) that downsample the feature maps, providing some translational invariance (so that the exact position of a feature is less important) and further reducing computation.

The convolution operation for a single filter can be written as: for an input image (or feature map from a previous layer) $X$ and filter weights $K$ (of size, say, $m \times n$),

\[ S(i,j) = \sum_{a=1}^{m} \sum_{b=1}^{n} K(a,b) \; X(i+a-1,\; j+b-1), \]

with $S(i,j)$ being the output feature map. The filter slides over the input; each output location $(i,j)$ is a weighted sum of an $m \times n$ patch of the input. This sum is then typically passed through a non-linear activation. A filter might be 3x3, 5x5, etc., and there are usually multiple filters per layer (each learning to detect a different kind of feature). Because the same $K$ is used for all positions, the number of parameters for a filter is just $m \times n$ (plus one bias), instead of proportional to the input size – a huge saving for large images.

CNNs in Social Science: One might wonder how image models are relevant to social science. There are several areas where CNNs have been applied or have potential:

Analysis of Photographs and Videos: For example, extracting data from satellite images to measure economic activity (e.g., using nighttime lights as a proxy for development), or using street view images to estimate neighborhood characteristics such as physical disorder or infrastructure quality (as in Yan, Zhao, Ding, & Luo, 2020, who estimate urban vehicle fuel consumption from streetscape images). Similarly, analyzing facial images for social signals is possible – e.g., identifying crowd sizes in protest photos or measuring the emotional sentiment in faces at campaign events.
Document and Text Analysis with CNNs: CNNs can also be applied to textual data by treating a sequence of words as analogous to a 1D image (sequence) and using 1D convolutions. For instance, a 1D CNN can slide an $n$-gram window over a document to detect local patterns (phrases) useful for classification. This was demonstrated by Yoon Kim (2014), who showed that a simple CNN on pre-trained word embeddings could achieve strong results on sentence classification tasks. Such models have been used for tasks like topic classification or sentiment analysis of texts, which are common in social science.
Spatial Data: In fields like geography or urban studies, one can use CNNs on grid-structured data or even adapt them to graph data (though for general graphs, Graph Neural Networks are more direct). For example, one could divide a city into a grid and use a CNN to predict crime rates or disease outbreaks based on neighboring cell values (treating it like an image). Similarly, CNN variants have been used on network adjacency matrices or spatial adjacency graphs to capture local structure in social networks or diffusion processes (though again, more specialized graph convolution methods exist for those).

That said, the most straightforward application of CNNs in social sciences has been through transfer learning on images. Social scientists might not have millions of labeled images of their own, but they can leverage models like ResNet or VGG pretrained on ImageNet (a large image database). Through transfer learning, one can take a model that has learned general visual features and fine-tune it on a smaller, specific dataset. For example, a team at Pew Research Center used transfer learning to classify the perceived gender of individuals in profile images from Google search results: they fine-tuned a pretrained ResNet-50 model on ~50,000 labeled images (drawn from various sources) and achieved about 90% accuracy. This approach saved them from needing an enormous training set from scratch, highlighting a practical way CNNs enter social science workflows. (The Pew researchers specifically noted that using transfer learning made the project feasible – training a deep CNN from scratch on ~10k images would have been “insurmountable,” but by reusing a model pretrained on millions of images, they obtained a well-performing classifier quickly.)

Example: Image Classification with a CNN (in R)

As a demonstration, we will build a simple CNN in R using the keras package to classify images. While our example may use a generic image dataset for illustration (the MNIST dataset of handwritten digits, since it is readily available), one can imagine analogous social science uses – for instance, classifying satellite images of neighborhoods by poverty level, or detecting whether an online profile picture is of a real person vs. a bot, etc.

First, we load image data. We’ll use MNIST (handwritten digits) for a quick example:

library(keras)
mnist <- dataset_mnist()
train_x <- mnist$train$x
train_y <- mnist$train$y
test_x <- mnist$test$x
test_y <- mnist$test$y

# Preprocess: reshape and rescale
train_x <- array_reshape(train_x, c(nrow(train_x), 28, 28, 1)) / 255
test_x  <- array_reshape(test_x,  c(nrow(test_x), 28, 28, 1)) / 255
train_y <- to_categorical(train_y, 10)
test_y  <- to_categorical(test_y, 10)

The above prepares the data: we reshape the images to 28x28 with 1 channel (grayscale), and scale pixel values to [0,1]. The labels are one-hot encoded for 10 classes (digits 0-9).

Now, we define a simple CNN model:

cnn_model <- keras_model_sequential() %>%
  layer_conv_2d(filters = 8, kernel_size = c(3,3), activation = 'relu', 
                input_shape = c(28, 28, 1)) %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  layer_conv_2d(filters = 16, kernel_size = c(3,3), activation = 'relu') %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  layer_flatten() %>%
  layer_dense(units = 10, activation = 'softmax')
cnn_model %>% compile(
  optimizer = optimizer_adam(),
  loss = 'categorical_crossentropy',
  metrics = 'accuracy'
)
cnn_model %>% summary()

Our CNN has two convolutional layers: the first with 8 filters of size 3x3, the second with 16 filters of size 3x3. Each conv layer is followed by 2x2 max pooling to reduce dimensionality. After the conv layers, we flatten the feature maps and have a dense output layer with 10 units (softmax for multi-class probabilities). The model summary would look like:

________________________________________________________________________________
Layer (type)                     Output Shape                 Param # 
================================================================================
conv2d_1 (Conv2D)               (None, 26, 26, 8)            80         
max_pooling2d_1 (MaxPooling2D)  (None, 13, 13, 8)            0          
conv2d_2 (Conv2D)               (None, 11, 11, 16)           1168       
max_pooling2d_2 (MaxPooling2D)  (None, 5, 5, 16)             0          
flatten_1 (Flatten)             (None, 400)                  0          
dense_3 (Dense)                 (None, 10)                   4010       
================================================================================
Total params: 5258
Trainable params: 5258
Non-trainable params: 0
________________________________________________________________________________

(We see that the conv layers have relatively few parameters: 8 filters × (3×3 weights + 1 bias) = 80 for the first conv; 16 filters × (3×3×8 inputs + 1 bias) = 1,168 for the second conv. The dense layer has more: 400×10 + 10 biases = 4,010, because by the time we flatten we have 5×5×16 = 400 features.)

We can train this model on the MNIST data (for brevity, we use only 1 epoch here):

history <- cnn_model %>% fit(
  train_x, train_y,
  epochs = 1, batch_size = 128,
  validation_split = 0.2
)

Even with 1 epoch on MNIST, the model will likely achieve high accuracy (MNIST is an “easy” dataset for CNNs, often >90% after just one epoch). After training, we evaluate on the test set:

scores <- cnn_model %>% evaluate(test_x, test_y, verbose = 0)
cat("Test accuracy:", scores["accuracy"], "\n")

This should report an accuracy (likely around 0.95 with more training epochs on MNIST). The purpose of this example is to show the structure and code for a CNN.

In a social science context, one would rarely train a CNN from scratch on a small image dataset – instead, as mentioned, one would use transfer learning. In R, the keras package makes it easy to download a pretrained model (e.g., application_resnet50(weights="imagenet") gives a ResNet-50 model pretrained on ImageNet). You can then remove the top layer and add your own output layer for your specific classification, freeze the earlier layers, and fine-tune on your data. The Pew Research project referenced earlier did essentially this: they leveraged a ResNet model pretrained on millions of images, which had already learned to detect general features like edges and textures, and only had to train the final layers on their tens of thousands of labeled images. Their deep learning model achieved around 90% classification accuracy in distinguishing men vs. women in images, illustrating the practicality of CNNs even when a social science team has limited training data – by re-using “knowledge” from large-scale datasets.

Beyond image classification, CNNs have also been used for tasks like text classification via 1D convolutions (which slide over sequences of words or characters). For example, one could build a model to classify tweets as containing hate speech or not. A simplified approach might convert each tweet into a sequence of word embeddings (vectors), then apply a convolutional filter that detects specific phrases or word combinations indicative of hate speech, followed by pooling and a dense layer for classification. Such models have been shown to perform well in natural language processing tasks and can be trained in R using keras or torch (with libraries like text2vec to obtain embeddings). In our context, a full example is beyond scope, but the implementation would mirror the structure we showed (just with 1D conv layers and an embedding layer for text).

8.4 Recurrent Neural Networks (RNNs) and LSTM

Theory of Sequence Modeling

Many social science data have an inherent sequential or temporal structure: speeches composed of sequences of words, individuals’ life histories composed of sequences of events, longitudinal panel data on voters, or time series of economic indicators. Recurrent Neural Networks (RNNs) are neural architectures designed to handle sequence data by maintaining a form of memory of past inputs. Unlike feedforward nets that assume all inputs are independent, RNNs share parameters across time steps and have connections that form directed cycles (hence “recurrent”), allowing information to persist.

In a basic RNN (often called a “simple RNN” or Elman network), at each time step $t$ the network takes an input vector $x_t$ and the previous hidden state $h_{t-1}$, and produces a new hidden state $h_t$ as a function:

\[ h_t = f(W \, x_t + U \, h_{t-1} + b), \]

where $W$ and $U$ are weight matrices for input and recurrent connections, and $f$(·) is typically a non-linearity like tanh. The hidden state $h_t$ can be thought of as a summary of all inputs seen up to time $t$. If we want to produce an output (for example, predicting the next word in a sentence or labeling the sequence), an output $y_t$ can be computed as

\[ y_t = g(V \, h_t), \]

for some output weight matrix $V$. The key point is that the same weights $W, U, V$ are used at every time step, enabling the network to generalize to sequence lengths beyond what it was trained on.

However, simple RNNs suffer from vanishing and exploding gradient problems when dealing with long sequences – as we backpropagate the error through many time steps (a process known as Backpropagation Through Time), gradients can shrink or blow up, making it hard to learn long-range dependencies. To address this, more sophisticated recurrent architectures were developed, most notably the Long Short-Term Memory (LSTM) network (Hochreiter & Schmidhuber, 1997) and the Gated Recurrent Unit (GRU) (Cho et al., 2014).

LSTM introduces an internal cell state $c_t$ and a set of gating mechanisms that regulate information flow: an input gate, a forget gate, and an output gate. These gates (each implemented with a sigmoid activation) determine which information to add to the cell state, what to forget from it, and how much of it to output to the hidden state. In equations, for an LSTM unit one might define:

Input gate: $i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)$
Forget gate: $f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)$
Output gate: $o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)$
New memory candidate: $\tilde{c}*t = \tanh(W_c x_t + U_c h*{t-1} + b_c)$
Updated cell state: $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$
Hidden state: $h_t = o_t \odot \tanh(c_t)$

(where $\odot$ denotes elementwise multiplication). While the details can be intimidating, the intuitive idea is that the forget gate controls what information from the past to discard, the input gate controls what new information to store in the cell, and the output gate controls what information from the cell to send out to the next step. The cell state $c_t$ acts as a conveyor of long-term information with linear interactions (just additions and multiplications by gates), which helps preserve gradients. LSTMs are thus capable of maintaining long-term dependencies – their internal design explicitly tackles the vanishing gradient problem by allowing gradients to flow unchanged where needed. In intuitive terms, an LSTM can learn to remember or forget. For example, if analyzing text, an LSTM might learn to “remember” a negation word like “not” and carry that influence until it encounters the word being negated, then “forget” it thereafter.

Diagram of a single LSTM cell, which maintains an internal cell state $c_t$ and uses input ($\sigma$), forget ($\sigma$), and output ($\sigma$) gates (orange = learned neural layers, yellow = pointwise operations) to regulate information flow.

Sequence-to-Sequence and Other Variants: RNNs and LSTMs can be used not only for one-output-per-time-step tasks (like language modeling where you predict the next word given previous words), but also for sequence-to-sequence tasks. In sequence-to-sequence models (Seq2Seq), one RNN (the encoder) processes an input sequence into a final hidden state, and then another RNN (the decoder) generates an output sequence from that state. This is used in machine translation (encode a sentence in French, decode a sentence in English) and could likewise be used in social science for tasks like open-ended survey response summarization or modeling event sequences (encode the past trajectory of a country’s economic indicators, decode the future trajectory).

RNNs in Social Science: Applications include:

Text analysis: RNNs (especially LSTMs or GRUs) have been widely used for text classification, sentiment analysis, or more complex tasks like stance detection in political texts. An LSTM can capture the sequence of words, which is important because word order matters for meaning. For instance, “X supports Y” versus “Y supports X” convey different relationships; a bag-of-words model might miss this distinction, but an RNN can learn it. Researchers have applied LSTMs to legislative speech transcripts, news articles, and social media posts to classify topics or sentiment while accounting for syntax and context over sentences.
Event history analysis: One could use RNNs to model sequences of events (e.g., a sequence of protest events in different cities, or a sequence of legislative actions over time). The RNN can potentially pick up patterns like “after event A, event B tends to occur within 3 days” or detect seasonal trends in event data. For example, a study might model a country’s monthly conflict events as a sequence and use an LSTM to forecast future conflict risk from the history.
Time-series prediction: In economics, demography, or sociology, we often have time series data (e.g., monthly unemployment rates, yearly population counts, daily counts of COVID cases). RNNs or LSTMs can be trained to forecast these series, possibly capturing non-linear patterns or regime shifts that traditional ARIMA models might not. They can also incorporate multiple input series (multivariate time series) and learn complex joint dynamics.
Panel data: Panel data (repeated observations of many units over time) can also be approached with RNNs by treating each unit’s data as a sequence. For example, an LSTM could be used to predict an individual’s future health status from their longitudinal medical history, or to predict a country’s future GDP given its yearly economic indicators, capturing unit-specific temporal dependencies. (There is also research on sequence embedding where each unit’s sequence is converted to a fixed-length vector via an RNN encoder, which can then be used as features in downstream analyses.)

It is worth noting that in recent years, Transformer models have overtaken RNNs in many sequence modeling tasks (especially in NLP) due to their efficiency in capturing long-range dependencies via self-attention. Transformers (Vaswani et al., 2017) dispense with recurrence entirely and instead use parallelizable attention mechanisms to achieve superior performance on language tasks. While they are beyond our current scope, they represent an advanced tool that social scientists may explore for text analysis or other sequence tasks (for example, using pretrained language models like BERT or GPT to obtain rich text embeddings for survey responses). That said, RNNs and LSTMs remain useful and are easier to train on smaller datasets, so we focus on them here as foundational tools.

Example: Sequence Prediction with LSTM in R

For a concrete example, we will use an LSTM to model a simple time series. Consider a scenario in social science where we have a monthly indicator (say, an index of social unrest intensity) and we want to predict future values based on past values. We’ll simulate a pattern (for illustration, a sine wave with noise could represent a seasonal oscillation in unrest).

set.seed(123)
T <- 200  # length of series
t <- 1:T
y <- sin(0.1 * t) + rnorm(T, sd = 0.1)  # base sine wave plus noise

# Prepare training sequences for LSTM
timesteps <- 10
X <- array(0, dim = c(T - timesteps, timesteps, 1))
Y <- array(0, dim = c(T - timesteps))
for(i in 1:(T - timesteps)) {
  X[i,,1] <- y[i:(i+timesteps-1)]
  Y[i] <- y[i+timesteps]  # next value to predict
}

# Split into train and test (e.g., first 160 for training, last 30 for testing)
train_size <- 160
X_train <- X[1:train_size,,]
Y_train <- Y[1:train_size]
X_test  <- X[(train_size+1):(T-timesteps),,, drop=FALSE]
Y_test  <- Y[(train_size+1):(T-timesteps)]

We created overlapping sequences of length 10 (each sequence is the past 10 time points) and the target is the next value. Now we define an LSTM model to predict the next value from the past 10:

lstm_model <- keras_model_sequential() %>%
  layer_lstm(units = 16, input_shape = c(timesteps, 1)) %>% 
  layer_dense(units = 1)
lstm_model %>% compile(
  optimizer = 'adam',
  loss = 'mse'
)
lstm_model %>% summary()

The model has an LSTM layer with 16 units, followed by a dense layer. The summary will show something like:

________________________________________________________________
Layer (type)                 Output Shape              Param #  
================================================================
lstm_1 (LSTM)                (None, 16)                1152      
dense_4 (Dense)              (None, 1)                 17        
================================================================
Total params: 1169
Trainable params: 1169

We see 1,169 parameters, consistent with the formula for LSTM parameters: for an LSTM with $k$ units and input size $p$, the parameter count is $4k(k + p + 1)$ (because of the 4 sets of weights for input, output, forget, cell). Here, $k=16, p=1$, so indeed 416(16+1+1) = 41618 = 1,152 for the LSTM, plus 17 for the dense layer.

Now we train the model:

history <- lstm_model %>% fit(
  X_train, Y_train,
  epochs = 30, batch_size = 16,
  validation_split = 0.1, verbose = 0
)

After training, let’s evaluate its performance on the test set and inspect a few predictions vs actual:

preds <- lstm_model %>% predict(X_test)
# Compare first 5 predictions with actual values
print(round(head(cbind("Predicted" = preds[,1], "Actual" = Y_test), 5), 3))

If the model has learned the pattern, the predictions should roughly follow the sine wave trend. Even if not perfect (due to noise), the LSTM likely captures the oscillation better than a trivial baseline. In a real social science application, this could correspond to forecasting something like monthly protest counts given the past 10 months of data, implicitly capturing temporal dependencies and seasonality.

Remark on Interpretation: While the LSTM can model such sequences, interpreting what exactly it has learned (which patterns in the sequence trigger an increase or decrease) is not straightforward. There are techniques such as examining the learned cell states or using sequence saliency methods (to see which parts of the input sequence most influenced the prediction), but these are more specialized. For many pure prediction tasks, a black-box forecast might be acceptable. However, if policy decisions depend on understanding why the model predicts a surge in unrest, one might need to combine these models with more interpretable approaches or incorporate domain knowledge to validate the patterns detected.

In R, one might also consider the torch package for sequence models, which provides an R interface to the PyTorch library and can be used to build custom RNNs, LSTMs, or Transformers with more low-level control. The high-level Keras API, as used above, is often sufficient for many applications, but torch could be useful for advanced research requiring custom architectures.

8.5 Training Neural Networks: Key Concepts

Having introduced the main architectures (MLP, CNN, RNN/LSTM), we now turn to how neural networks are trained and optimized, and how to ensure they generalize well. Training a neural network means finding parameters (weights and biases) that minimize a certain loss function on the training data. This is a high-dimensional optimization problem, typically solved by gradient-based methods. We will discuss:

Data Preprocessing – preparing inputs for effective training.
Loss Functions – what objective we optimize.
Backpropagation and Gradient Descent – how we optimize the objective.
Optimization Algorithms – variants like SGD, Momentum, Adam.
Regularization Techniques – methods to prevent overfitting (dropout, L2, etc.).
Monitoring and Tuning – using validation sets, avoiding overfitting, adjusting hyperparameters.

Data Preprocessing

Neural networks can be sensitive to the scale and encoding of input data. It is generally important to standardize or normalize features before feeding them into the network. For example, continuous variables are often standardized to mean 0 and standard deviation 1, or scaled to [0,1]. If features are on very different scales, the network may have trouble learning (the gradients for one feature might dominate). In our earlier examples, we normalized images to [0,1], and one would likewise scale numerical covariates in tabular data. Categorical variables need to be encoded – typically via one-hot encoding if unordered, or possibly via embedding layers if there are many categories and we want the model to learn a dense representation for each.

For text data, preprocessing involves tokenization (breaking text into words or subwords), handling of vocabulary (perhaps limiting to the top N most frequent words or using pre-trained word embeddings), and padding/truncating sequences to a fixed length for batch processing. For network or spatial data, one might need to construct adjacency matrices or coordinate grids. In all cases, careful preprocessing is crucial; poor handling can significantly degrade performance or make training unstable.

Loss Functions and Evaluation Metrics

The loss function (also called cost function) quantifies the error of the model’s predictions against the true values. The choice of loss depends on the task:

For binary classification, the typical loss is binary cross-entropy (also known as log loss). If $\hat{p}_i$ is the predicted probability for instance $i$ belonging to class 1 (and $y_i \in {0,1}$ is the true label), the binary cross-entropy loss for that instance is $-[\,y_i \log \hat{p}_i + (1-y_i) \log(1-\hat{p}_i)\,]$. The model training aims to minimize this, which pushes $\hat{p}_i$ close to 1 for $y_i=1$ and close to 0 for $y_i=0$. (We used this loss in the MLP example via binary_crossentropy.)
For multi-class classification, we generalize to categorical cross-entropy. If the model outputs a probability distribution $\hat{\mathbf{p}}_i$ across classes for instance $i$, and the true label is a one-hot vector $\mathbf{y}_i$, then loss is $- \sum_{c} y_{i,c} \log \hat{p}_{i,c}$. Typically we use a softmax on the output layer to get $\hat{\mathbf{p}}$ that sums to 1.
For regression (predicting a continuous outcome), a common loss is the mean squared error (MSE) or sometimes mean absolute error (MAE). MSE, $(\hat{y}_i - y_i)^2$, has nice mathematical properties (differentiable everywhere) and is related to assuming Gaussian noise on the output; MAE, $|\hat{y}_i - y_i|$, is more robust to outliers (but the absolute value is less smooth at 0 for optimization).
There are more specialized losses for specific purposes: e.g., hinge loss for SVM-like training, cosine proximity for certain similarity tasks, or custom losses for imbalanced data (like adding class weights or using focal loss in object detection tasks).

It’s important to distinguish the loss used for training from the evaluation metrics we care about. For instance, we might train a model with cross-entropy loss but evaluate it with accuracy, F1-score, AUC, etc., to judge its performance. The training process directly optimizes the loss, not necessarily the metric (though often lowering the loss improves the metric). Sometimes there is a trade-off; for example, accuracy is not sensitive to class probabilities unless they cross 0.5, whereas cross-entropy heavily penalizes overconfident wrong answers. Thus, one might get higher accuracy but still have room to improve calibration as reflected in cross-entropy.

Backpropagation and Gradient Descent

Backpropagation is the core algorithm for training neural networks. It is a method for computing the gradient of the loss function with respect to all the network’s parameters efficiently, by propagating the error backward through the network. Conceptually:

Forward pass: Take an input $x$ and compute the network’s output $\hat{y}$ and loss $L(\hat{y}, y)$ for the true target $y$.
Backward pass: Compute gradients of the loss with respect to the output (using the chain rule of calculus), then with respect to the parameters of the last layer, then the previous layer, and so on, moving backwards. Each layer’s gradients are computed based on the gradients from the layer above (its output).
Update weights: Use these gradients in an optimization step to adjust the parameters in the direction that most reduces the loss.

For a single weight $w$ in the network, backprop gives us $\frac{\partial L}{\partial w}$, the direction in which changing $w$ would increase or decrease the loss. Once we have these gradients, we use an optimization algorithm (like gradient descent) to update the weights in the opposite direction of the gradient (to reduce the loss).

In practice, we use stochastic gradient descent (SGD) or its variants. Rather than computing gradients on the entire dataset (which would be standard gradient descent), we compute on mini-batches of data. For example, with a batch size of 32, we take 32 examples, do forward passes, compute the average loss and gradients, then update weights, and move to the next batch. This stochastic approach introduces noise into the gradient estimates but is much faster and often helps escape shallow local minima. One epoch is one full pass through the training data in mini-batches.

Mathematically, a weight update with vanilla SGD looks like:

\[ w := w - \eta \, \frac{\partial L}{\partial w}, \]

where $\eta$ is the learning rate, a small positive scalar that controls the step size. The learning rate is a crucial hyperparameter – too large and training may diverge or oscillate, too small and convergence will be very slow or get stuck in a suboptimal point.

Batch size is another important hyperparameter: smaller batches give noisier gradients but can generalize better (and use less memory), while larger batches give more precise gradient estimates but require more memory and can sometimes get stuck in sharp minima. A common heuristic is to use the largest batch size that fits in GPU memory (for efficiency), but in some cases small batches work better. Practitioners often try a few values (32, 64, 128, etc.) to see what yields good validation performance.

Advanced Optimization Algorithms

Several enhancements to basic SGD have been developed to improve convergence speed and stability:

Momentum: This technique helps accelerate SGD in the right directions and dampen oscillations in the wrong directions. It does so by maintaining a velocity vector that is an exponentially decaying average of past gradients. The update becomes: $v := \alpha v + \eta \nabla_w L$, and then $w := w - v$, where $\alpha \in [0,1)$ is the momentum coefficient (e.g., 0.9). Momentum accumulates gradient contributions in persistent directions, effectively allowing the optimizer to build up speed on gentle slopes and not get stuck oscillating on steep but narrow ravines (a common issue when gradients in one dimension are much larger than in another).
Adaptive Learning Rates: Optimizers like AdaGrad, RMSProp, Adam etc., adjust the learning rate for each parameter individually, based on the history of gradients for that parameter. AdaGrad (2011) accumulates the squared gradients for each parameter and divides the learning rate by the sqrt of this accumulated sum. This means parameters that have large gradients (and thus large accumulated squared gradients) get their effective learning rate reduced over time, which is good for dealing with sparse features but can lead to excessive decay of learning rates. RMSProp (Hinton, circa 2012) modifies AdaGrad by using a moving average of squared gradients (to forget very old gradients), maintaining a per-parameter learning rate that adapts to recent gradient magnitudes. Adam (Adaptive Moment Estimation, Kingma & Ba, 2015) combines the ideas of momentum and RMSProp – it keeps an exponentially decaying average of past gradients (like momentum) and of past squared gradients (like RMSProp). Adam computes parameter updates as $m_t / (\sqrt{v_t} + \epsilon)$ where $m_t$ is the first moment (gradient mean) and $v_t$ the second moment (uncentered variance), with bias correction for initial timesteps. Adam has become one of the most popular optimizers due to its generally good performance and ease of use (it often requires less tuning of the learning rate compared to plain SGD).

In R’s keras or torch, one can specify optimizer_adam() or others to use these algorithms. A typical workflow is to start with Adam (with default settings) as it usually works out of the box, and maybe later experiment with SGD+momentum for possibly better generalization or to reproduce results from literature. Some recent research suggests that for very large datasets, plain SGD with momentum might yield slightly better generalization than Adam (which can sometimes overfit), but in the moderate-data regime of many social science problems, Adam’s robustness is a boon.

Regularization and Overfitting

Neural networks are highly flexible models with a potentially huge number of parameters. This flexibility means they can easily overfit – i.e., memorize the training data – especially when data are limited. Regularization refers to techniques that constrain the model to improve generalization performance on unseen data.

Key regularization techniques in neural networks include:

Penalty Terms (Weight Decay): The most common is $L_2$ regularization, which adds a term $\lambda \sum w^2$ to the loss (summing over all weights), discouraging large weights. This is equivalent to a Gaussian prior on weights and is known as weight decay in the neural network context. In practice, one specifies a weight decay parameter $\lambda$. A smaller $\lambda$ means little regularization; a larger $\lambda$ forces weights towards 0 (simpler models). Weight decay tends to make the network weights smaller in magnitude, which often improves generalization by keeping the model closer to linear in behavior.
Dropout: Dropout is a popular and very effective regularization trick introduced by Srivastava et al. (2014). The idea is to randomly “drop out” (set to zero) a fraction of the units in a layer during each training iteration. For example, with dropout rate 0.5, each hidden neuron is independently dropped with probability 0.5 at each update. This prevents the network from relying too much on any single feature or from co-adapting neurons too tightly, essentially forcing a form of ensemble of sub-networks. At test time, all units are used but their activations are scaled down by the dropout rate (to account for the missing ones during training). Dropout often significantly reduces overfitting and has become standard, especially in fully connected layers of networks.
Early Stopping: This is a simple yet powerful regularization approach: monitor performance on a validation set during training and stop training when performance on validation data stops improving (or starts worsening). The idea is that at the point of minimal validation loss, the model is optimally generalized, whereas training longer would just make it fit noise in the training set. The weights at that point are then taken as the final model. Early stopping essentially treats the number of training epochs as a hyperparameter to tune (automatically, during one run). In Keras, one can use callback_early_stopping(patience=...) to implement this.
Batch Normalization: Though primarily introduced to help optimization by normalizing layer inputs, batch normalization (Ioffe & Szegedy, 2015) can also have a regularizing effect. It reduces internal covariate shift by normalizing the activations of each layer for each mini-batch, then scaling and shifting them by learned parameters. This can allow higher learning rates and often reduces the need for dropout in some architectures (e.g., CNNs). Batch norm adds a bit of noise due to mini-batch estimation, acting as a regularizer.
Data Augmentation: Especially relevant for image and text data, augmenting the training data with label-preserving transformations acts as regularization by injecting more variety. For images, this could be random rotations, flips, crops, color jitters (commonly used in computer vision to expand training sets). For text, one might replace words with synonyms, or slightly perturb sentences (though one must be careful to preserve meaning). In social science contexts, augmentation can sometimes be domain-specific (e.g., adding noise to economic indicators to simulate measurement error or using bootstrapping on small datasets).
Others: There are other approaches like $L_1$ regularization (which encourages sparsity in weights, leading some weights to exactly zero), max-norm constraints (bounding the norm of incoming weights for each neuron), early removal of overfit neurons (pruning), or more recent techniques like DropConnect (dropout on weights rather than activations), label smoothing (smoothing the hard 0/1 labels to soft targets to prevent overconfidence), and so on. But the ones above are usually sufficient in practice.

The balance between underfitting and overfitting is often visualized by plotting the training and validation loss over epochs. Initially, both go down as the model fits general patterns. At some point, the validation loss reaches a minimum and then starts to increase even as training loss keeps decreasing – that’s classic overfitting setting in. Regularization aims to delay or reduce that gap. With strong regularization, the model might underfit (both losses high, or validation never goes down much), so one must tune the amount.

Model Training and Hyperparameter Tuning

Training neural networks is as much an art as a science. One often needs to experiment with:

Architecture hyperparameters: number of layers, number of units per layer, filter sizes, etc.
Training hyperparameters: learning rate (most crucial), batch size, number of epochs, choice of optimizer, learning rate schedules (reducing the learning rate after plateauing, etc.).
Regularization hyperparameters: dropout rate, weight decay coefficient, etc.
Initialization: Modern libraries handle weight initialization well (e.g., Glorot/Xavier initialization for symmetric activations), but occasionally one might adjust initialization if using certain activations like sigmoid (to avoid saturation at start).

It is essential to use a validation set to tune these (or techniques like cross-validation if data are extremely scarce, though cross-validating deep nets is computationally heavy). In R’s keras, one can specify validation_split or supply a validation_data argument to the fit() function to automatically track validation metrics. The keras API also provides callbacks, such as callback_early_stopping() to implement early stopping and callback_reduce_lr_on_plateau() to reduce the learning rate if validation loss stalls, which help automate some tuning.

A common strategy is:

Start with a relatively simple model and get it to “learn something” (ensure the training loss decreases and it beats a trivial baseline on validation).
If underfitting (validation and training loss both high), increase capacity (more layers/units) or train longer or adjust learning rate.
If overfitting (training loss much lower than validation), add regularization (dropout, etc.) or reduce capacity.
Adjust the learning rate carefully: it often needs to be tuned on a log scale (e.g., 0.1, 0.01, 0.001, …). Sometimes a learning rate that is too low will cause extremely slow convergence, giving the impression of underfitting, whereas a slightly higher rate would converge nicely.
Use learning rate schedules: many times you can start with a relatively high learning rate and then reduce it as training progresses (either manually or via a schedule like exponential decay or step decay when validation performance plateaus).

Monitoring metrics like accuracy alongside loss can also be informative: sometimes the loss might decrease while accuracy plateaus (indicating the model is getting more confident on the same predictions), or vice versa.

In summary, training a neural network is an iterative process of configure → train → evaluate → adjust. Modern deep learning frameworks significantly lower the barrier to trying different configurations quickly, which is a big reason for the rapid progress in the field. For social scientists, this means one can be empirically guided – try a network, see how it performs, and iterate – much as one might do with choosing specifications in a regression (though the “specifications” space is much larger for a neural net!). The final model chosen should ideally be the one that performs best on held-out data. As a sanity check, one should also compare it with simpler models; if a straightforward logistic regression or random forest is performing just as well, the added complexity of a neural net might not be warranted.

8.6 Interpretability and Explainability

One of the major concerns in applying neural networks to social science problems is the interpretability of the models. Social scientists are typically interested not only in making accurate predictions, but also in understanding the relationships between variables, uncovering latent patterns, and providing explanations that are convincing to stakeholders or policymakers. Traditional statistical models (like linear regression or decision trees) offer transparent relationships – e.g., coefficients or splits that can be directly interpreted. Neural networks, in contrast, are often criticized as “black boxes”: their predictions result from complex, layered computations that do not yield simple, direct explanations.

However, a growing field of eXplainable AI (XAI) has developed tools to interpret and explain neural network predictions. Here we outline approaches to interpretability that can make neural network results more transparent in a social science context:

Feature Importance and Attribution: Methods like SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) provide ways to estimate the importance of input features for a given prediction. SHAP values are based on cooperative game theory (the concept of Shapley values) and represent the contribution of each feature to the difference between the model’s prediction and a baseline expectation. LIME, on the other hand, fits a simple interpretable model (like a sparse linear model) locally around the prediction to approximate the neural network’s behavior. For example, if a neural net predicts that a certain individual will have a high income, SHAP values could tell us that the individual’s education level and years of experience were strong positive contributors, while the local unemployment rate was a negative contributor, aligning the explanation with domain expectations. LIME could show a small linear model for that individual where, say, education=Master’s contributes +15% chance of high income, experience > 5 years contributes +10%, and high local unemployment contributes -5%, etc., illustrating in simple terms why the neural net made its prediction.
Saliency and Input Sensitivity: For image or text models, one can compute saliency maps or attention weights that highlight what parts of the input the network focused on. In images, saliency maps (essentially the gradient of the output w.r.t. input pixels) can show which regions of an image influenced the prediction most. For example, a CNN predicting “protest” vs “non-protest” in an image might focus on areas with crowds or protest signs. In text, some sequence models (and certainly Transformers with attention mechanisms) allow extraction of attention weights to see which words were most attended to for a classification. If an LSTM classified a speech as containing hate speech, we might examine which words in the speech contributed heavily to that decision – perhaps identifying specific derogatory terms. This can be important for both understanding and justifying the model’s decisions (and for identifying when the model might be keying off of problematic biases, such as associating certain topics with certain groups unfairly).
Interpretable Model Surrogates: Another approach is to train an interpretable surrogate model on the predictions of the neural network. For instance, one could use decision trees or rule-based models to approximate the behavior of the neural network in certain regions of the feature space. This is related to LIME but can be done globally: e.g., train a decision tree on the dataset where the “labels” are the neural network’s predictions. The tree might then provide a set of rules that roughly mimic the network. Caution is needed – an approximation may not faithfully represent the true model in all cases – but it can sometimes reveal broad patterns. For example, a surrogate tree might yield rules like “IF (income > $50k) AND (age < 30) THEN predict high credit score” which could approximate what a neural net has learned, even if the net itself is not a decision tree.
Network Dissection and Concept Analysis: Research by Bau et al. (2017) on network dissection shows that some neurons in CNNs learn to detect human-interpretable concepts (like “tree” or “door”) even without supervision on those concepts. They developed a method to systematically test each hidden unit in a vision model against a large set of concepts (objects, textures, colors) and found, for example, units that reliably activate for images of doors, or units for certain textures. This kind of analysis can be done to see if particular hidden units correspond to meaningful factors. In social science applications, one could imagine analyzing a network trained on survey data to see if any neuron’s activation correlates strongly with known indices (like an SES index or an ideology score), which might indicate the net internally constructed a similar concept.
Causal Explainability: Recently, there is interest in going beyond correlational explanations to more causal ones. For example, counterfactual explanations try to answer: “what would need to change in this input for the model’s prediction to change in a desired way?” In a recidivism risk model, a counterfactual explanation might be: “If this individual had one fewer prior offense, the predicted risk score would drop below the threshold for detention.” This gives a more actionable explanation (it points to a change in input that alters output) and connects with ideas of fairness and algorithmic recourse. Some methods formulate this as an optimization problem: find the minimal change to the input features that yields a different outcome from the model.

In the social sciences, the need for transparency is not just academic – it’s often ethical or legal. For instance, if an algorithm is used in criminal justice or in allocation of social services, one must often provide reasons for decisions and ensure there is no hidden bias against protected groups. Neural networks themselves do not inherently avoid bias – in fact, if the training data reflect societal biases, the model can perpetuate them. Techniques like feature importance can help identify if certain sensitive attributes (like race or gender, or proxies thereof) are unduly influencing predictions, which might prompt retraining the model with fairness constraints or interpreting results with caution. In some cases, interpretable models (or post-hoc explanations) can uncover problematic patterns that were not apparent during training. For example, one might discover via LIME or SHAP that a resume-screening network was effectively using an applicant’s address as a proxy for race in decisions (because location strongly correlates with demographics in the data) – a red flag that would need addressing.

It is also possible to impose some interpretability at training time. For example, one could use a smaller network or add penalties that encourage sparse activations or use attention mechanisms that are inherently interpretable (in some cases, attention weights can be interpreted as a measure of importance of each part of the input). Another approach is building hybrid models – e.g., use a neural network to generate features or scores and then feed those into a traditional regression model, thereby capturing non-linearities in the feature generation but retaining an interpretable final model. An example of this might be using a CNN to score images of neighborhoods for “disorder level” and then using that score as a variable in a regression predicting crime rates. The regression is interpretable, and the CNN’s output can be interpreted as a single meaningful index (even if internally the CNN is complex).

To illustrate one technique in R: we can use the lime package to explain a neural network’s predictions on tabular data. For brevity, here’s a conceptual mini-example using the MLP we trained earlier (treating it as a black box):

# Install lime if not already installed
# install.packages("lime")
library(lime)
# Our model expects a numeric matrix input, so we define model_type and predict_model for LIME
model_type.keras <- function(x, ...) 'classification'
predict_model.keras <- function(x, newdata, type, ...) {
  # newdata will be a data.frame; convert to matrix and get predictions
  preds <- x %>% predict(as.matrix(newdata))
  # Return a data frame of class probabilities (two classes: 0 and 1)
  data.frame(`0` = 1 - preds[,1], `1` = preds[,1])
}
# Create a lime explainer using training data (excluding the label column)
explainer <- lime(train_data[, c("X1","X2")], mlp_model)
# Explain predictions for the first 5 test cases
explanation <- explain(test_data[1:5, c("X1","X2")], explainer, n_labels = 1, n_features = 2)
print(explanation[, 1:9])

This will output something like:

  case  label label_prob model_r2 model_intercept model_prediction feature feature_weight
1    1     1      0.992     0.75           0.500            0.950    X2<=0         0.445 
2    1     1      0.992     0.75           0.500            0.950    X1<=0         0.005 
3    2     0      0.998     0.67           0.500            0.020    X2<=0         0.367 
4    2     0      0.998     0.67           0.500            0.020    X1 >0        -0.122 
...

This indicates, for example, that in Case 1 (with true label 1, predicted probability ~0.992 of class 1), LIME’s local model had an $R^2$ of 0.75 and predicted 0.950 for class 1 based on two rules: X2 <= 0 contributed +0.445 towards class 1, and X1 <= 0 contributed +0.005. In other words, in that case both features being negative made the neural net lean towards class 1 (which matches the XOR pattern logic in our simulation: both negative means output 1). For Case 2 (true label 0, predicted prob ~0.002 of class 1), LIME shows X2 <= 0 contributed +0.367 towards class 1, but X1 > 0 contributed -0.122, and the intercept was 0.5, resulting in a net prediction of 0.02 for class 1 (i.e., strong leaning to class 0). These numbers are just illustrative, but the idea is we get a human-readable explanation of each prediction in terms of original features. In more realistic settings, we could request more features in the explanation (n_features=5, etc.), and we would examine which features consistently show up as important.

In summary, while neural networks present challenges for interpretability, a variety of methods exist to extract insights from them. The level of explanation required depends on the use-case: for pure predictive tasks (like language translation or image tagging for internal research), a “black box” may be acceptable; for scientific inference or high-stakes decisions (like criminal justice or healthcare), interpretability is crucial. Social scientists should be aware of these tools and use them to ensure that when they do employ neural networks, they can explain and justify the findings to themselves and to others. Furthermore, such tools can help diagnose when a model might be exploiting undesirable patterns (e.g., proxies for protected attributes or dataset artifacts) and guide improvements.

8.7 Comparison with Traditional Methods

How do neural network models compare with more traditional statistical or machine learning methods commonly used in social science, such as linear/logistic regression or even tree-based models and support vector machines? We consider accuracy, flexibility, and transparency as key dimensions:

Predictive Accuracy: Neural networks, when appropriately tuned and given sufficient data, often outperform simpler models on complex prediction tasks. Their ability to automatically model interactions and non-linear relationships means they can discover patterns that a linear model or a low-degree polynomial might miss. For example, in text or image analysis, linear models that rely on manually engineered features cannot match the accuracy of deep networks that learn features from raw data. In the earlier example by , King & Zeng (2000), their neural network approach improved conflict prediction accuracy substantially over prior statistical models. However, the advantage is not universal – for many tabular datasets with limited samples and a strong signal-to-noise ratio, methods like gradient boosting machines (e.g., XGBoost) or even well-tuned logistic regressions can perform on par with neural nets. In fact, one reason neural nets haven’t completely displaced other methods in social science is that in low-data regimes, very deep models might overfit and not have a clear edge. Additionally, ensemble methods like random forests or boosted trees often yield strong performance with less tuning.
Flexibility: Neural networks are extremely flexible in terms of the data they can handle and the mappings they can learn. They can naturally incorporate unstructured data (images, text, audio) via CNNs, RNNs, etc., whereas traditional models often require a separate feature extraction step for such data. They can also be extended easily: e.g., one can create multi-task networks that simultaneously predict multiple outcomes, or networks that incorporate multiple input modalities (e.g., taking both text and numeric inputs by combining different subnetworks). Moreover, neural nets can learn internal representations that might be transferrable to other tasks (transfer learning) – something like a logistic regression doesn’t have internal layers to reuse. Traditional statistical models, on the other hand, are less flexible in structure – you often have to decide on interactions or transformations manually. That said, for purely structured data with defined features, tree-based models or linear models can be quite effective and are simpler to implement.
Transparency and Interpretability: Here traditional methods have a clear advantage. A simple model like a linear regression provides coefficients that (under certain assumptions) directly quantify the effect of each predictor on the outcome. Decision trees yield human-readable rules (e.g., “IF income > $50k AND age < 30 THEN probability of voting = 0.8”). By contrast, a neural network with hundreds or thousands of weights does not provide a straightforward narrative of “X increases Y by Z units.” We must resort to the interpretability techniques discussed (SHAP, LIME, etc.) to get insight, and even then those are post-hoc explanations rather than an inherent part of the model. In many social science applications, explanation is part of the goal – we often care about understanding the social processes at work, not just predicting outcomes. If an algorithm is used in policy, being able to explain its decisions might be essential for it to be accepted. For this reason, interpretable models (or at least simpler proxy models) are often used alongside neural nets in studies: e.g., a researcher might report the results of a logistic regression for interpretability, even if a neural network was used as a robustness check or to validate that no nonlinear patterns were missed.
Causal Inference and Theoretical Insight: Traditional methods, especially those in the econometrics toolkit, are closely linked to causal inference frameworks. Linear regressions with control variables, instrumental variables regression, difference-in-differences designs, etc., all have well-developed theoretical interpretations for causal estimation. Neural networks can be used as part of causal analysis (for example, to estimate propensity scores or conditional outcome models in a double ML approach), but the core causal identification strategy usually relies on the same old assumptions (ignorability, exclusion restrictions, parallel trends, etc.). Neural nets typically do not provide confidence intervals or significance tests out-of-the-box, whereas traditional methods often do (though one can bootstrap a neural net or use Bayesian versions to get uncertainty estimates). For a social scientist aiming to test a theory or estimate a specific effect, a neural network alone might not be ideal – but it could complement by capturing nuisance functions or suggesting new hypotheses. For example, Mullainathan and Spiess (2017) argued that machine learning is primarily about prediction, and its role in econometrics is to be used for tasks like prediction of counterfactuals or discovery of heterogeneity, while the core inference about causal parameters remains a separate step.
Scalability: When it comes to very large datasets, neural networks (with GPU acceleration and mini-batch SGD) can scale to millions of examples and high-dimensional inputs. Traditional statistical models can also scale in their own ways (e.g., using stochastic gradient descent for logistic regression or large matrix solvers), but off-the-shelf implementations might struggle with really large data. However, training very large neural networks can be expensive and time-consuming, and they may require specialized hardware (GPUs/TPUs). In social science, datasets are rarely as large as those in commercial deep learning (like billions of tokens or millions of images), except perhaps some forms of text data or network data from the web. So scalability is usually not the limiting factor – data availability is.

In contexts with small data, one often finds that a neural network easily overfits and a regularized linear or tree model performs better. For example, if you have a survey with 500 respondents and 20 predictors, a carefully specified logistic regression (maybe with polynomial terms or interactions chosen based on theory) could outperform a 3-layer neural net that has no guidance and ends up overfitting the noise. A rule of thumb sometimes cited is that you need an order of magnitude more data (in terms of number of training examples) than you have parameters in your neural network to reliably avoid overfitting (though techniques like regularization and transfer learning complicate this simple picture). In many social science problems, we simply don’t have tens of thousands of examples, so simpler models are not only more interpretable but necessary to avoid overfitting.

However, in scenarios where you do have rich data (high-dimensional, possibly unstructured, or non-linear signals) and sufficient sample size, neural nets can shine. For instance, if analyzing text from thousands of political speeches to predict a rating of populist vs. technocratic rhetoric, a neural network that learns its own text features may outperform a bag-of-words SVM or a dictionary-based approach, because it can capture subtle phrasing differences and context. Similarly, for predicting policy outcomes from a combination of numeric indicators, social network metrics, and text sentiment, one could build a unified neural network that ingests all these data types, whereas a traditional approach might have to reduce everything to a set of summary indices first.

A pragmatic view is that neural networks complement rather than outright replace traditional methods in the social scientist’s toolkit. One might use neural nets to explore data or to validate that more rigid models aren’t missing something. For example, after running a neural network, you might inspect what variables it found important (via SHAP values) and realize that a certain interaction is important – you could then include that interaction explicitly in a logistic regression and confirm it’s significant and aligns with theory. Conversely, one might use a regression to summarize what a network is doing, as a way to communicate results in a familiar format (e.g., “Using a neural network, we find that the marginal effect of education on income is larger at higher levels of experience, consistent with a complementarity hypothesis”).

In high-stakes decision contexts (loans, criminal justice), there is an ongoing debate about using black-box models vs. interpretable models. Rudin (2019) strongly argues that for high-stakes decisions, one should use interpretable models whenever possible rather than relying on post-hoc explanations of black-boxes. Her point is that an inherently interpretable model (like a sparse rule list or a transparent scoring system) can often be built with little loss in accuracy, and it avoids the risk that the black box might be right for the wrong reasons (which explanations might not fully catch). On the other hand, proponents of black-box use (with explanation) claim that sometimes accuracy is paramount (say, diagnosing cancer from an MRI), and as long as we carefully check the model for bias, the improved accuracy can save lives or resources, even if the model isn’t fully interpretable.

For social scientists, the takeaway is: use the right tool for the job. If a simple model suffices and yields insight, there’s no need to complicate things with a deep network. If the problem involves data types or nonlinear patterns that simpler models can’t handle well, then consider a neural network, but accompany it with appropriate interpretation and validation. And in many cases, consider using both: a neural network for predictive performance or exploratory analysis, and a traditional model for confirmatory analysis or presentation. This way you get the benefits of both – the neural net can uncover patterns and provide a benchmark for maximum predictive power, while the simpler model can test hypotheses and communicate relationships clearly.

8.8 Practical Challenges and Tips for Applying Neural Networks to Social Science Data

Applying neural networks in social sciences comes with practical challenges, many of which we have touched upon. Here we summarize some common issues and strategies to address them:

1. Limited Data: Social scientists often work with datasets that are small by deep learning standards. Collecting data (through surveys, experiments, archival research) can be expensive and time-consuming, so we might have a few hundred to a few thousand examples at most. Small data can lead to severe overfitting for high-capacity models. To mitigate this:

Prefer simpler architectures for small datasets. A network with 1-2 hidden layers and a modest number of units may generalize better than a very deep network on tiny data.
Use strong regularization: higher $L_2$ weight decay, or dropout. Also consider Bayesian neural networks or ensemble methods (e.g., train 10 networks and average them) to get uncertainty estimates.
Transfer learning is your friend when you have unstructured data. Use pretrained embeddings for text (like word2vec, GloVe, or BERT) instead of learning word representations from scratch. For images, use pretrained CNNs. This injects prior knowledge from big data and can dramatically reduce the amount of data you need.
Data augmentation: Create more training examples by perturbing existing ones, if applicable. This is commonly done in image tasks and can be done for text in limited ways (e.g., back-translation or dropping stopwords).
If possible, gather more data. This might mean merging datasets, using cross-study data, or even simulating data (in some cases you can simulate additional training data from a known model for pretraining).
When data are extremely limited, sometimes non-neural methods may be more reliable. For example, with 100 observations, a neural net will likely struggle; methods like logistic regression or even case-based reasoning might be safer.

2. Imbalanced Data: Many social datasets have imbalanced outcomes (e.g., rare events like wars or coups in country-year data, or a minority class like “fraud” cases in audit data). Neural networks training with standard losses can be biased towards the majority class (simply because minimizing overall error might ignore the minority class to some extent). To address this:

Use class weighting in the loss function. In Keras, for example, you can pass class_weight = list("0"=..., "1"=...) to the fit() function for binary classification, so that errors on the minority class are given higher weight.
Oversample the minority class or undersample the majority class in training batches. For instance, you might duplicate minority class examples or generate synthetic ones (techniques like SMOTE).
Use specialized loss functions like focal loss (which was designed to focus training on hard, misclassified examples and is popular in object detection tasks with imbalance).
Evaluate on metrics that are appropriate for imbalance (like F1-score, precision/recall, or AUROC) rather than accuracy. And possibly include those metrics as feedback during training (although one typically trains on a differentiable loss like cross-entropy, you monitor the F1 or AUROC on validation to ensure the model is improving in the way you care about).
If the minority class is extremely scarce, consider one-class approaches or anomaly detection techniques (though those are less about neural nets and more about unsupervised learning).

3. Computational Resources: Training deep learning models can require GPUs and be time-consuming. Social science researchers may not always have easy access to such resources. Some tips:

Take advantage of cloud services or institutional clusters when needed. Many cloud providers offer GPU instances, and there are academic grants or free tiers for research. Just be cautious with sensitive data if using cloud platforms.
For small to medium-sized models, modern CPUs can actually be sufficient (especially with libraries like TensorFlow able to use SIMD instructions). If your dataset is a few thousand cases and your network has < 1e6 parameters, CPU training might only take seconds to minutes.
Use smaller batch sizes if memory is a constraint (at the cost of some extra training time). Also, reducing precision to float16 can allow larger models on limited GPU memory (TensorFlow and PyTorch support mixed precision training).
Use pre-trained models to avoid heavy computations. For example, using a BERT model via the transformers R interface (or reticulate to Python’s HuggingFace library) allows you to get state-of-the-art text embeddings without training a huge model for days.
Profile your code and make sure you aren’t doing inefficient data handling in R. Sometimes the overhead of shuffling data in R or converting types can dominate. It may be more efficient to use TensorFlow or Torch data pipeline APIs to feed data.
If you have a very large dataset, consider methods like online learning or data generators to avoid loading everything in memory at once.

4. Ethical and Legal Issues: As noted, algorithmic predictions in social contexts raise issues of fairness, accountability, and transparency. Neural nets can inadvertently learn biases present in the data. Addressing this:

Conduct bias audits of your model. Check performance separately for different demographic groups if applicable. Use techniques like counterfactual fairness (does the model change its prediction if we change a sensitive attribute while holding other inputs constant?) or parity metrics (false positive/negative rates across groups).
Consider removing certain sensitive attributes from input (though be aware of proxies – e.g., ZIP code can proxy for race). In some fairness-aware modeling, people train two models or use adversarial training to ensure the latent representation does not carry information about protected classes.
When publishing or deploying models, include model cards or documentation of intended use, data provenance, and limitations.
If your model is used for decision support, think about interpretability requirements – you might have to provide explanations for individual decisions (as some regulations like GDPR’s “right to explanation” suggest). Planning to use LIME/SHAP as part of the pipeline, or restricting to interpretable models, might be necessary.
Privacy: If using individual-level data (like social media posts, survey responses, etc.), be mindful of privacy. Training a large model can sometimes inadvertently memorize specific training examples (this has been shown in language models). Consider differential privacy techniques if applicable (this is advanced – adding noise during training to preserve privacy). More practically, always get proper consent and anonymize data where possible.

5. Domain Adaptation and Generalizability: A model trained on one context might not work well in another (known as domain shift). If you train a protest image classifier on images from one country, it might not generalize to another country’s protest images if they look different (different signs, clothing, etc.). For text models, language or dialect differences can be an issue. Strategies:

Use pre-training on broad data and fine-tune on your specific domain (this leverages generic patterns).
Use transfer learning between related social science tasks. For example, if you built a model to detect hate speech on Twitter, you might transfer some of those representations to detect extremist content on online forums.
If you suspect a domain shift, try to get a small sample of target-domain data to evaluate your model, and consider domain adaptation techniques (like training on both source and some target data with domain-adversarial training to make features domain-invariant).
In time series, be careful of non-stationarity (e.g., the patterns of the past may not hold in the future due to social change). Continuously retraining or using online learning can help keep models up-to-date, but also try to incorporate domain knowledge of structural breaks if possible.

6. Reproducibility: Ensuring that results are reproducible is vital for academic work. Neural network training involves sources of randomness (weight initialization, random shuffling of batches, dropout, etc.). Tips for reproducibility:

Set random seeds for all libraries (in R, use use_session_with_seed() from Keras which attempts to make training reproducible by fixing the seed for Python, TensorFlow, etc.). Note that perfect reproducibility might still be tricky due to non-determinism in some low-level operations (especially on GPU) – but it should be close.
Document all hyperparameters and training procedures meticulously (learning rate, architecture, number of epochs, etc.). Because unlike a simple regression, these choices significantly influence results.
Share code and even model weights if possible (for example, saving the trained model with save_model_hdf5() in Keras). But be mindful of data privacy if the model could expose training data, and of course don’t share data that shouldn’t be public.
Use consistent train/validation splits or cross-validation folds for model comparison. If you try multiple models, compare them on the same validation set or via cross-validation to ensure a fair comparison.
The format of this chapter (an executable R Markdown document) is itself a step toward reproducibility: one could distribute it and let readers re-run the analyses to get the same results.

In practice, applying neural networks in a social science project might involve a mix of approaches. For example, consider a research team studying whether one can predict which civil unrest events will escalate based on initial news reports and social media data. They might:

Start by extracting features from the text (perhaps using a pre-trained language model to get sentiment or topic scores, or simpler text features) and running a logistic regression to see baseline predictive power.
Then set up a neural network that takes the raw text (using an embedding layer or a pre-trained BERT model) plus some structured features (like location, time, etc.) to predict escalation. They would need to combine different data types – a flexible strength of neural nets.
They would train this model (maybe fine-tuning BERT) on past events. Given relatively small data, they use dropout and perhaps freeze some of the pre-trained layers to avoid overfitting. They monitor validation accuracy and loss, and use early stopping.
Suppose the neural model beats the logistic regression in predictive accuracy. They then use SHAP values to interpret it. They find that certain phrases in news (like “military deployed” or “state of emergency”) strongly increase the model’s predicted probability of escalation – which makes intuitive sense. They also check that the model isn’t just picking up irrelevant stuff (like certain city names always correlating with escalation due to one country’s history – an issue of potential overfitting to geography).
They compare the neural model’s performance to a random forest or gradient boost model on the same data to ensure the improvement is real. They find the neural net is a bit better, likely because it leveraged word order and context.
For their publication, they report that the neural network approach achieved X% accuracy vs Y% for traditional methods, and they discuss which factors (words, etc.) were important according to the explainability analysis. They might still present a simplified logistic regression of key features for the sake of interpretation, perhaps constructed based on the important features the neural net found.
They also caution about the model’s limitations – e.g., it might not generalize to unrest events in a different cultural context (domain shift), and it is only as good as the reporting (if early reports are incomplete or biased, so will be the model).

Through this example workflow, one can see that neural networks can be integrated into social science research in a sensible way: not as magical black boxes that replace theory or careful design, but as powerful tools to capture patterns in data, which must then be scrutinized, interpreted, and contextualized within social science knowledge and theory.

8.9 Conclusion

Neural networks offer powerful new tools for social scientists, enabling the modeling of complex patterns in data that were previously difficult or impossible to capture. In this chapter, we have covered the landscape of neural networks in social science applications, moving from theoretical foundations to practical implementation in R. We discussed feedforward neural networks (MLPs) and how they can model non-linear relationships and interactions; convolutional neural networks (CNNs) for handling structured inputs like images or spatial data; and recurrent networks (LSTMs) for sequence data and time series. The mathematical underpinnings – from activation functions to backpropagation – provide insight into how these models learn from data. With the R code examples, we demonstrated how to build and train these networks using modern libraries like Keras, showing that even relatively few lines of code can set up a sophisticated model.

Importantly, we tackled the distinction between predictive modeling and causal inference. Neural networks excel at prediction given enough data, and we showed scenarios where they clearly outperform traditional approaches in predictive accuracy (e.g., the XOR example, or citing improvements in conflict prediction). However, we also emphasized caution when it comes to interpreting these models for causal insights – often a direct causal interpretation of a deep model is not possible without further assumptions or methods. We described how one might integrate neural nets into causal analysis carefully (e.g., using them for propensity score or outcome modeling in a double ML framework, or using them to discover potential interactions which are then tested in a causal model). In practice, a judicious approach might use neural networks to improve certain components of an analysis (like imputation or proxy variable construction) while still relying on more interpretable models or established causal inference techniques for the core analysis.

We also delved into practical aspects of model training: how to choose loss functions, how gradient descent and its variants (SGD, Adam) work, and how to use regularization methods (dropout, weight decay, etc.) to prevent overfitting. These are essential for any applied work because a poorly trained network is no better than a random guess or a misleading curve fit. Through examples and discussion, we highlighted how to monitor training and tune hyperparameters, using validation data to guide decisions. We emphasized that data preprocessing (normalization, encoding) is often as important as model architecture in getting networks to train properly.

The section on interpretability and explainability addressed the black-box critique of neural nets. We presented methods such as LIME and SHAP for explaining individual predictions, and stressed the importance of transparency especially in policy-relevant applications. We gave an example of using LIME in R to interpret an MLP’s decisions, illustrating how even a complex model can be probed to yield understandable insights (like which features were driving a prediction). This is crucial: if neural networks are to be used in social science research, researchers must ensure they can interpret and validate what the model is doing, to avoid drawing false substantive conclusions or deploying biased algorithms. The array of XAI tools available today makes it feasible to open up the black box to a significant degree, though it requires extra effort.

In comparing neural networks with traditional methods, a theme emerged that each has its place. Neural networks bring flexibility and often better pure predictive power (especially with rich data), while traditional models bring simplicity and interpretability. We highlighted scenarios where neural networks add value (complex interactions, high-dimensional data, text/image analysis) and where they may not (very small datasets, where interpretability is paramount and patterns are linear enough). We also noted that these approaches can be combined – e.g., using a neural net for feature learning and a regression for the final analysis, or using regression to summarize a neural net model’s behavior. The social scientist’s goal is often to maximize insight, not just accuracy, and sometimes the insight comes from the combination of sophisticated algorithms and human interpretation/theory.

We covered practical challenges such as small sample sizes, class imbalance, computational constraints, and ethical issues. For each, we gave tips: e.g., use transfer learning for small data, class weighting for imbalance, GPU/cloud resources for heavy computation, and fairness checks for ethical considerations. These are the nuts-and-bolts issues one encounters when actually trying to use neural nets on social data, and addressing them is key to a successful project. As with any method, using neural networks responsibly means understanding their limitations and failure modes (like overfitting or bias) and proactively mitigating them.

A recurring message is that neural networks do not replace the need for theory and careful research design. Rather, they are tools that can uncover patterns we might otherwise miss, or improve predictions/measurements that feed into larger analyses. For example, a neural network might produce a better measure of ideology from text, which a political scientist can then use in a regression to test a hypothesis about legislative behavior (as in Knox, Lucas, & Cho, 2022’s discussion of learned proxies). The theory about legislative behavior remains critical – the neural network is just improving the measurement of one variable. Likewise, a neural network might predict protests, but a social scientist still needs to interpret why those factors matter and what it means for theories of collective action or political instability.

Looking forward, as social phenomena generate increasingly complex and large-scale data (from social media, sensors, digital trace data, etc.), neural networks and deep learning are likely to play a growing role in social science research. Areas like computational sociology, political text analysis, and econ applications of ML are already burgeoning. Yet, the barriers to entry are falling – with high-level APIs and many pre-trained models available, one does not need a Ph.D. in computer science to apply these methods. What one does need is a strong grasp of research design and domain knowledge, so that the questions asked of the data are meaningful and the results are interpreted correctly. A danger with any powerful technique is the potential for misuse (data mining without theory, finding spurious “significant” patterns, etc.). By combining the strengths of neural networks (flexibility, performance) with the rigor of social science methodology (validity, theory-driven inquiry), reseabarchers can unlock new insights while avoiding pitfalls.

Neural networks are a powerful addition to the social scientist’s analytic toolkit – but they should be used thoughtfully. They can uncover patterns and improve predictions in ways that open up new research questions and practical solutions (e.g., more accurate early warning systems for crises, better measurement of latent social traits, etc.). At the same time, they come with the responsibility to ensure interpretability, fairness, and robustness. The chapter has aimed to equip readers with both the how (implementation in R) and the when/why (appropriate use cases and limitations) of neural networks in social science research. The hope is that readers will feel empowered to experiment with these methods in their own work – whether it’s predicting an election, analyzing survey open-end texts, or modeling the evolution of a social network – while maintaining the critical perspective of a social scientist. With rigorous exposition and reproducible code examples, this chapter serves as a bridge between the exciting developments in deep learning and the rich, nuanced problems of social science, encouraging a fruitful interplay between the two.

References

Chollet, F., & Allaire, J. J. (2018). Deep learning with R. Manning.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.

Hochreiter, S., & Schmidhuber, J. (1997). Long short‐term memory. Neural Computation, 9(8), 1735–1780.

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. Bach & D. Blei (Eds.), Proceedings of the 32nd International Conference on Machine Learning (pp. 448–456). PMLR.

Johansson, F., Shalit, U., & Sontag, D. (2016). Learning representations for counterfactual inference. In M. Balcan & K. Q. Weinberger (Eds.), Proceedings of the 33rd International Conference on Machine Learning (pp. 3020–3029). PMLR.

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR 2015). arXiv:1412.6980.

Koch, B., Sainburg, T., Bastías, P. G., Jiang, S., Sun, Y., & Foster, J. (2024). A primer on deep learning for causal inference. Sociological Methods & Research, 54(2), 397–447.

Knox, D., Lucas, C., & Cho, W. K. T. (2022). Testing causal theories with learned proxies. Annual Review of Political Science, 25, 419–441.

Lam, O., Hughes, A., & Wojcik, S. (2019, January 30). How social scientists can use transfer learning to kick‑start a deep learning project. Pew Research Center: Decoded.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient‑based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

Mullainathan, S., & Spiess, J. (2017). Machine learning: An applied econometric approach. Journal of Economic Perspectives, 31(2), 87–106.

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135–1144). ACM.

Rudin, C. (2019). Stop explaining black box machine learning models for high‑stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206–215.

Shalit, U., Johansson, F. D., & Sontag, D. (2017). Estimating individual treatment effect: Generalization bounds and algorithms. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th International Conference on Machine Learning (pp. 3076–3085). PMLR.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In I. Guyon et al. (Eds.), Advances in Neural Information Processing Systems, 30 (pp. 5998–6008). Curran Associates.

Yan, X., Zhao, J., Ding, W., & Luo, X. (2020). Estimating city‑scale passenger‑car fuel consumption using street‑view images. Computers, Environment and Urban Systems, 82, 101489.

# Neural Networks in Social Science Research In recent years, deep neural networks have achieved remarkable success across domains such as computer vision, speech recognition, and natural language processing – tackling problems that had resisted the best attempts of the AI community for many years. This rapid progress has been driven by increases in computing power and the availability of large datasets. However, the uptake of neural network methods in the social sciences has been relatively slow. Social science researchers have indeed begun to adopt machine learning techniques, often using them to construct proxy variables or to improve prediction in empirical studies. For example, in a review of top political science journals from 2018–2020, **Knox et al. (2022)** identified 48 papers that employed statistical learning or other computational methods, and in about 68% of those cases the machine learning was used to first estimate a proxy of a latent concept (which would then be used in subsequent analysis). Tellingly – and illustrating that **recent breakthroughs in deep learning and the growing use of computational methods in social science have largely occurred in parallel rather than in tandem** – out of those 48 studies, only one utilized a neural network (the rest used methods like random forests, SVMs, etc.). Several factors may explain social scientists’ hesitance to embrace neural networks. Traditional statistical learning approaches (e.g., penalized regression, decision trees) have well-understood statistical properties, making it easier to quantify uncertainty and correct biases, whereas neural networks have been viewed as more of a “black box.” Additionally, until recently, neural networks demanded more computational resources and *training data* than many social science applications could readily supply. Social science datasets are often modest in size (hundreds or thousands of observations) compared to the massive datasets used to train state-of-the-art deep learning models. Concerns about **overfitting, interpretability, and the complexity of model tuning** further contribute to caution in applying neural nets to social data. In short, machine learning has largely been used in social science for *prediction* tasks (where black-box accuracy can be useful) rather than for modeling theoretical relationships. Despite these challenges, there is a growing recognition that neural networks can complement and extend the toolkit of social scientists. Neural networks are **universal function approximators**, capable of learning complex non-linear relationships that might be missed by traditional linear or additive models. They excel at predictive tasks, especially with high-dimensional or unstructured data (such as text, images, or networks) where feature engineering is difficult. Indeed, **when predictive accuracy is the primary goal (rather than estimating interpretable causal effects), the benefits of more flexible, non-linear models can outweigh the loss of interpretability**. Early work in political science demonstrated that a neural network model could uncover structural patterns in international conflict data and substantially improve out-of-sample forecast accuracy over prior approaches. As data sources proliferate and computational barriers recede, neural networks are becoming increasingly viable for social science research problems. This chapter provides an in-depth exploration of neural network models in the context of social science applications, written as an executable R Markdown document. We cover both **predictive modeling** (e.g., forecasting or classification tasks) and **causal inference** (estimating the effects of interventions or treatments) use cases for neural networks. We begin with theoretical foundations of different neural architectures – including feedforward fully-connected networks, convolutional neural networks, and recurrent (sequence) networks – and discuss how these can be utilized in social science research. We then demonstrate implementation in R, using packages such as **`keras`**, **`tensorflow`**, and **`torch`**. Throughout, we address important practical topics: data preprocessing, choice of activation and loss functions, the backpropagation algorithm and training optimization (SGD, Adam, etc.), regularization strategies to prevent overfitting (dropout, L2 weight decay), and techniques for model evaluation. We also discuss strategies for **interpretability and explainability**, which are crucial for the adoption of neural nets in domains where understanding the basis of a prediction is as important as the prediction’s accuracy. Additionally, we compare neural networks with more traditional statistical models in terms of accuracy, flexibility, and transparency, highlighting scenarios where neural networks add value – and where they may not. Finally, we consider practical challenges such as **small sample sizes** and **class imbalance** that are often encountered in social science data, and how one can adapt or mitigate these issues (for example, through transfer learning, data augmentation, or specialized architectures). The goal is to provide researchers and graduate students with a rigorous yet accessible guide to applying neural network models in the social sciences, blending theoretical insight with hands-on example code. All code chunks in this chapter are written in R and can be executed to reproduce the results (assuming the required packages and data are available). By the end of the chapter, readers should understand both **how** to implement neural network analyses in R and, importantly, **when and why** such methods can be beneficial in social science research. ## Predictive Modeling vs. Causal Inference with Neural Networks Social science research encompasses two broad analytical goals: **predictive modeling** and **causal inference**. Prediction focuses on accurately forecasting or classifying outcomes – for example, predicting election results, identifying individuals at risk of recidivism, or classifying topics in open-ended survey responses. Causal inference, on the other hand, is concerned with estimating the effect of some treatment or intervention on an outcome – for instance, what is the impact of a job training program on subsequent earnings, or how does exposure to misinformation affect political attitudes. These goals involve fundamentally different criteria for success. Prediction is about minimizing error on new data, whereas causal inference is about isolating a credible estimate of a causal effect (often requiring control of confounding and consideration of counterfactuals). Machine learning methods, including deep neural networks, have primarily been developed as predictive tools. Their objective is to learn patterns that generalize well to new data. In fields like computer vision or language processing, complex neural architectures have achieved stunning predictive performance by fitting flexible functions to large datasets. Social scientists have begun harnessing this predictive power for tasks such as **constructing proxies for latent theoretical constructs** (e.g., using a text classifier to measure the sentiment or ideology in a speech) and for **improving the accuracy of forecasts** in policy and economics contexts. When the goal is prediction alone, the **“black box” nature of neural networks is less of a concern** – a highly accurate black box can be extremely useful for tasks like predictive policing or early warning systems for social unrest, even if its inner workings are not fully transparent. Causal inference poses additional challenges for the use of neural networks. In causal analysis, we are typically interested not just in predicting $Y$ from $X$, but in understanding how $Y$ would change under an intervention (e.g., setting $X = \text{treatment}$ versus $X = \text{control}$). This requires disentangling correlation from causation and ensuring that the model properly accounts for confounding factors and biases. A model that is excellent at prediction may still yield biased estimates of causal effects if, for example, it leverages spurious correlations that do not reflect true causal relationships. As **Pierce (2023)** notes, *“many of the most striking examples of recent machine learning progress entail neural networks learning complex correlations from a large data distribution for predictive purposes, whereas a lot of social science research is more interested in studying how those observed distributions would change under a causal intervention”*. In other words, **social scientists often seek to model the data-generating process and counterfactual outcomes**, rather than just the observed joint distribution. This distinction means that directly applying a flexible non-linear model like a neural network to observational data can lead to overfitting to noise or confounding, threatening the validity of causal conclusions. Despite these differences, there is a growing interface between deep learning and causal inference. Recent research in computational social science and econometrics has begun to **adapt neural network architectures to estimate causal effects** under the potential outcomes framework (see Koch et al., 2024 for a review). For example, neural networks have been used to learn *balanced representations* of covariates that make treated and control groups more comparable, improving estimation of treatment effects in observational studies. Specifically, Johansson, Shalit, and Sontag (2016) demonstrated how a neural network can transform covariates into a representation space where the distributions of treated vs. control units are closer, facilitating more accurate counterfactual prediction. Subsequent work (Shalit, Johansson, & Sontag, 2017) introduced the **TARNet** architecture – a simple two-headed feedforward network that learns potential outcomes for treatment and control – and extensions like **DragonNet** that add a propensity prediction head to enforce causal identification. Other researchers have integrated deep learning into **propensity score estimation**, **instrumental variable analysis**, and **heterogeneous treatment effect** estimation. For instance, methods in the “meta-learner” framework (T-learners, S-learners, X-learners) can incorporate neural networks as the base learners to capture complex response surfaces, and custom architectures have been proposed for causal forests or GAN-based IV estimation. While these methods are on the cutting edge, they highlight that neural networks *can* be used for causal inference – but usually with additional structure or assumptions to ensure identification of causal effects. Importantly, standard practices from the causal inference literature (such as holdout validation to avoid overfitting bias, and sensitivity analyses for unobserved confounding) remain crucial when using neural networks for causal questions. In practice, a useful division of labor is emerging: one can leverage neural networks to **improve predictive tasks within a larger causal analysis**. For example, one might use a neural network to predict a proxy or to impute missing data, and then use those predictions in a more traditional causal model. Or one can use deep learning to **estimate nuisance functions** (like propensity scores or baseline outcome functions) within frameworks like *double machine learning*. Double/debiased ML methods allow flexible fitting of high-dimensional nuisance parameters (using ML algorithms) while retaining theoretical guarantees (Neyman orthogonality and cross-fitting) for the causal parameter of interest. The key is recognizing that for causal inference, *accuracy in fitting the observed data is not the only goal* – we also need interpretability and a model structure that connects to the counterfactual question at hand. Throughout this chapter, we will highlight when we are focusing on pure prediction and when causal interpretation is (or is not) valid. The next sections delve into different neural network architectures and their mathematical foundations, laying the groundwork for applied examples in R that follow. ## Feedforward Neural Networks (Multi-Layer Perceptrons) ### Theory and Mathematical Foundations Feedforward neural networks, also known as multi-layer perceptrons (MLPs) or dense neural networks, are the quintessential deep learning architecture. These models consist of layers of interconnected *artificial neurons* where information flows in one direction from input to output (hence “feedforward”). Feedforward networks are general function approximators: given enough hidden units, an MLP with at least one hidden layer can approximate any continuous function arbitrarily well under mild conditions (this is the Universal Approximation Theorem). This universal approximation property underlies the power of neural networks to model complex relationships. **Neurons and Layers:** The basic unit in a feedforward network is the **neuron** (or node), which performs a weighted linear combination of its inputs and then applies a non-linear **activation function**. In mathematical form, for neuron $j$ in a given layer, the computation is: $$ z_j = b_j + \sum_{i} w_{ij} x_i, $$ $$ a_j = f(z_j), $$ where $x_i$ are the inputs to the neuron (these could be raw features or outputs from neurons in a previous layer), $w_{ij}$ are the weights, $b_j$ is a bias term, $z_j$ is the linear combination (sometimes called the logit or pre-activation), and $f(\cdot)$ is the activation function producing the neuron's output $a_j$. Common choices for $f$ include the **sigmoid** $\sigma(z) = 1/(1+e^{-z})$, the **hyperbolic tangent** $\tanh(z)$, and the now-ubiquitous **ReLU** (Rectified Linear Unit) $f(z) = \max(0, z)$, among others. Non-linear activations are critical; without them, multiple layers would collapse into an equivalent single linear model. A feedforward network is organized into layers: an **input layer** that takes the features (covariates) as inputs, one or more **hidden layers** of neurons that transform the inputs into intermediate representations, and an **output layer** that produces the final prediction. Each neuron in a layer typically connects to every neuron in the next layer (a fully-connected network). For example, a classic MLP architecture for a binary classification problem might have an input layer with $d$ inputs, one hidden layer with $h$ neurons, and an output layer with a single neuron producing a probability (via a sigmoid activation). This architecture would contain $(d \times h)$ weights between the input and hidden layer, plus $h$ bias terms in the hidden layer, and $h$ weights plus one bias in the output layer. **Forward Propagation:** During a forward pass, data $X$ is fed into the input layer and transformed layer by layer through these weighted sums and activations to produce an output $\hat{y}$. This output could be a scalar (for regression or binary classification) or a vector of class probabilities (for multi-class classification, often obtained via a softmax activation in the output layer). The *capacity* of the network (its ability to fit complex functions) can be increased by adding more neurons or more layers, at the cost of greater computational demand and risk of overfitting. **Illustrative Example – XOR Problem:** A simple example of why hidden layers are useful is the classic XOR problem. Consider two binary inputs $x_1, x_2$ and an output $y$ that should be 1 if exactly one of $x_1, x_2$ is 1 (exclusive or), and 0 otherwise. A linear model (logistic regression) cannot capture this non-linear pattern – it will effectively draw a single separating line which cannot separate the XOR cases. However, a two-layer neural network with a few hidden neurons can learn this pattern by creating intermediate features (hidden layer outputs) that act like logical subfunctions. This demonstrates that even for small problems, an MLP can capture interactions that a linear model would miss. Mathematically, an MLP with one hidden layer can be written as: $$ \mathbf{h} = f(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}), $$ $$ \hat{\mathbf{y}} = g(\mathbf{W}^{(2)} \mathbf{h} + \mathbf{b}^{(2)}), $$ where $\mathbf{x}$ is the input vector (length $d$), $\mathbf{h}$ is the hidden layer activation vector (length $h$), and $\hat{\mathbf{y}}$ is the output (for simplicity here, assume a vector of length $o$ for possibly multiple outputs). $\mathbf{W}^{(1)}$ is an $h \times d$ weight matrix for the first layer, $\mathbf{b}^{(1)}$ is a bias vector of length $h$, $\mathbf{W}^{(2)}$ is an $o \times h$ weight matrix for the second layer, and $\mathbf{b}^{(2)}$ is bias of length $o$. The functions $f(\cdot)$ and $g(\cdot)$ are activation functions (they could be the same or differ; often $f$ is a non-linear like ReLU or tanh, while $g$ might be a sigmoid or softmax appropriate to the task). This formulation can be extended to additional layers in the obvious way. **Relationship to Traditional Models:** It is helpful to note that many traditional statistical models are special cases of neural networks. For instance, a standard **linear regression** or **logistic regression** can be seen as a one-layer neural network (no hidden layer) with an identity or sigmoid output activation, respectively. In that sense, neural networks generalize these models by adding depth (hidden layers) and allowing more complex feature transformations. This also means that at small scales, an MLP can mimic a linear model. The advantage comes when there are complex non-linear interactions: an MLP can learn those from data, whereas a linear model would require manually adding interaction terms or nonlinear transformations of inputs. As an example, one study in political science posited that the effects of certain variables on conflict might only manifest in particular combinations – a neural network was able to capture such conditional relationships automatically, whereas a traditional logistic regression required the researcher to explicitly specify interaction terms ( , King, & Zeng, 2000). ### Implementation Example: A Predictive MLP in R To illustrate how a feedforward neural network can be applied to a social science problem, consider a hypothetical predictive task: we have a dataset of individuals with two features $X_1$ and $X_2$ (which could represent, say, two test scores or socio-economic indicators), and we want to predict a binary outcome $Y$ (e.g., whether the person will graduate from college). Suppose the true relationship is that **$Y=1$ only if both $X_1$ and $X_2$ are above certain thresholds** – in other words, the effect of $X_1$ on the outcome is *conditioned* on $X_2$ being high, a form of interaction. A linear model without an interaction term would struggle in this scenario, effectively averaging the effect of each $X$ across all levels of the other. We will simulate such a scenario and then train both a logistic regression and a neural network to highlight the difference. First, we simulate a dataset in R: ```r set.seed(123) N <- 1000 X1 <- rnorm(N) X2 <- rnorm(N) # Define Y such that it's 1 if X1 * X2 > 0 (both positive or both negative) Y <- ifelse(X1 * X2 > 0, 1, 0) data <- data.frame(X1, X2, Y = factor(Y)) head(data) ``` In this simulated data, $Y=1$ occurs when $X_1$ and $X_2$ have the same sign (both above or both below 0), which is a non-linear relationship. A logistic regression that uses $X_1$ and $X_2$ as additive terms (no interaction) will effectively find no predictive power (since $P(Y=1)\approx0.5$ regardless of one variable alone). A neural network with a hidden layer can learn to multiply or otherwise combine $X_1$ and $X_2$ to capture this interaction. We split the data into training and test sets and fit both models: ```r # Split into training and test sets train_idx <- sample(1:N, size = 0.7 * N) train_data <- data[train_idx, ] test_data <- data[-train_idx, ] # Fit a logistic regression without interaction glm_model <- glm(Y ~ X1 + X2, data = train_data, family = binomial()) # Fit a neural network (MLP) with one hidden layer library(keras) # Define a simple sequential model mlp_model <- keras_model_sequential() %>% layer_dense(units = 4, activation = 'relu', input_shape = 2) %>% # 2 inputs -> 4 hidden units layer_dense(units = 1, activation = 'sigmoid') # output layer for binary classification mlp_model %>% compile( optimizer = optimizer_adam(), loss = 'binary_crossentropy', metrics = 'accuracy' ) history <- mlp_model %>% fit( as.matrix(train_data[, c("X1", "X2")]), as.numeric(train_data$Y) - 1, # convert factor to {0,1} epochs = 50, batch_size = 32, verbose = 0, validation_split = 0.2 ) ``` Here we used the Keras API for R to define a simple network: 2 input features, one hidden layer with 4 ReLU neurons, and an output sigmoid neuron for the probability of $Y=1$. We compile the model with the binary cross-entropy loss (appropriate for binary classification) and the Adam optimizer (an efficient variant of stochastic gradient descent discussed later). We train for 50 epochs (iterations over the data) with a mini-batch size of 32. Now, we evaluate both models on the test set: ```r # Predictions and accuracy on test set glm_probs <- predict(glm_model, newdata = test_data, type = "response") glm_preds <- ifelse(glm_probs > 0.5, 1, 0) nn_probs <- mlp_model %>% predict(as.matrix(test_data[, c("X1", "X2")])) nn_preds <- ifelse(nn_probs > 0.5, 1, 0) glm_acc <- mean(glm_preds == (as.numeric(test_data$Y) - 1)) nn_acc <- mean(nn_preds == (as.numeric(test_data$Y) - 1)) sprintf("Test Accuracy - Logistic Regression: %.3f, Neural Network: %.3f", glm_acc, nn_acc) ``` If you run the above code, you will likely observe that the **logistic regression achieves an accuracy around 50% (no better than chance), while the neural network is far more accurate (often >90% in this example)**. The neural network has learned the interaction between $X_1$ and $X_2$ implicitly, whereas the simple logistic model without an $X_1 \times X_2$ term could not capture it. (For fairness, if we included the interaction term $X1 \cdot X2$ in the logistic model, it would then perform well on this synthetic task – but the point is that *the neural network discovered that interaction on its own* from the data.) This toy example mirrors real-world scenarios in social science where outcomes may depend on non-linear combinations of predictors. For instance, perhaps **economic development and regime type interact** in affecting civil conflict onset – high economic development might reduce conflict risk, but only for democracies and not autocracies, etc. A researcher might not know the exact form of such interactions a priori. Neural networks offer a way to automatically model complex interactions and non-linearities, providing potentially better predictive performance than misspecified linear models (as demonstrated by et al., 2000 in the context of conflict prediction). Of course, the trade-off is that the neural network’s model is less interpretable than a simple logistic regression with a few coefficients. Throughout this chapter, we will return to this trade-off between **predictive power** and **interpretability**, and discuss methods to peek inside the "black box" of neural networks. Before moving on, it’s instructive to examine the learned neural network model. We can inspect the model architecture and number of parameters: ```r mlp_model %>% summary() ``` This prints a summary of the model: ``` Model: "sequential" ________________________________________________________________________________ Layer (type) Output Shape Param # ================================================================================ dense_1 (Dense) (None, 4) 12 dense_2 (Dense) (None, 1) 5 ================================================================================ Total params: 17 Trainable params: 17 Non-trainable params: 0 ________________________________________________________________________________ ``` We see that the model has 17 trainable parameters in total: 12 in the first dense layer (which corresponds to $2 \times 4$ weights plus 4 biases) and 5 in the second layer ($4 \times 1$ weight matrix plus 1 bias). Despite this small size, the model was expressive enough to fit the XOR-like pattern. In practice, we often use many more hidden units and possibly multiple hidden layers for harder problems. For example, if we had dozens of input features (survey responses, demographic variables, etc.), a single hidden layer with 4 units might be too limited to capture all interactions, whereas a deeper or wider network could perform better. Deciding on the network architecture (number of layers and units) is an important part of model design and typically requires some experimentation or cross-validation. ## Convolutional Neural Networks (CNNs) ### Theory and Applications in Social Science While feedforward MLPs are suitable for structured data (tabular data) and can, in principle, handle any predictive task, they become inefficient or impractical when dealing with high-dimensional inputs that have local structure, such as images, text sequences, or spatial data. **Convolutional Neural Networks (CNNs)** were developed to more effectively handle such structured inputs, especially images, by exploiting spatial locality and parameter sharing. The hallmark of CNNs is the **convolutional layer**, in which neurons are arranged in feature maps and each neuron is connected not to the entire input, but to a small sliding window of the input (called a receptive field). The same set of weights (a filter or kernel) is applied across different positions of the input, which greatly reduces the number of parameters and encodes the assumption that the same pattern could appear in multiple locations. In image analysis – which is where CNNs first rose to prominence (LeCun et al., 1998) – a convolutional layer might have filters that detect simple features like edges or corners in small 3x3 patches of an image. These filters *convolve* across the image, producing feature maps that indicate where those features occur. Deeper layers of the CNN then combine lower-level features into higher-level ones (for example, edges combine into motifs, motifs into shapes, shapes into objects). CNNs also commonly include **pooling layers** (e.g., max-pooling) that downsample the feature maps, providing some translational invariance (so that the exact position of a feature is less important) and further reducing computation. The convolution operation for a single filter can be written as: for an input image (or feature map from a previous layer) $X$ and filter weights $K$ (of size, say, $m \times n$), $$ S(i,j) = \sum_{a=1}^{m} \sum_{b=1}^{n} K(a,b) \; X(i+a-1,\; j+b-1), $$ with $S(i,j)$ being the output feature map. The filter slides over the input; each output location $(i,j)$ is a weighted sum of an $m \times n$ patch of the input. This sum is then typically passed through a non-linear activation. A filter might be 3x3, 5x5, etc., and there are usually multiple filters per layer (each learning to detect a different kind of feature). Because the same $K$ is used for all positions, the number of parameters for a filter is just $m \times n$ (plus one bias), instead of proportional to the input size – a huge saving for large images. **CNNs in Social Science:** One might wonder how image models are relevant to social science. There are several areas where CNNs have been applied or have potential: * **Analysis of Photographs and Videos:** For example, extracting data from satellite images to measure economic activity (e.g., using nighttime lights as a proxy for development), or using street view images to estimate neighborhood characteristics such as physical disorder or infrastructure quality (as in Yan, Zhao, Ding, & Luo, 2020, who estimate urban vehicle fuel consumption from streetscape images). Similarly, analyzing facial images for social signals is possible – e.g., identifying crowd sizes in protest photos or measuring the emotional sentiment in faces at campaign events. * **Document and Text Analysis with CNNs:** CNNs can also be applied to textual data by treating a sequence of words as analogous to a 1D image (sequence) and using 1D convolutions. For instance, a 1D CNN can slide an $n$-gram window over a document to detect local patterns (phrases) useful for classification. This was demonstrated by Yoon Kim (2014), who showed that a simple CNN on pre-trained word embeddings could achieve strong results on sentence classification tasks. Such models have been used for tasks like topic classification or sentiment analysis of texts, which are common in social science. * **Spatial Data:** In fields like geography or urban studies, one can use CNNs on grid-structured data or even adapt them to graph data (though for general graphs, **Graph Neural Networks** are more direct). For example, one could divide a city into a grid and use a CNN to predict crime rates or disease outbreaks based on neighboring cell values (treating it like an image). Similarly, CNN variants have been used on network adjacency matrices or spatial adjacency graphs to capture local structure in social networks or diffusion processes (though again, more specialized graph convolution methods exist for those). That said, the most straightforward application of CNNs in social sciences has been through **transfer learning on images**. Social scientists might not have millions of labeled images of their own, but they can leverage models like **ResNet or VGG pretrained on ImageNet** (a large image database). Through transfer learning, one can take a model that has learned general visual features and fine-tune it on a smaller, specific dataset. For example, a team at Pew Research Center used transfer learning to classify the perceived gender of individuals in profile images from Google search results: they fine-tuned a pretrained ResNet-50 model on \~50,000 labeled images (drawn from various sources) and achieved about 90% accuracy. This approach saved them from needing an enormous training set from scratch, highlighting a practical way CNNs enter social science workflows. (The Pew researchers specifically noted that using transfer learning made the project feasible – training a deep CNN from scratch on \~10k images would have been “insurmountable,” but by reusing a model pretrained on millions of images, they obtained a well-performing classifier quickly.) ### Example: Image Classification with a CNN (in R) As a demonstration, we will build a simple CNN in R using the `keras` package to classify images. While our example may use a generic image dataset for illustration (the *MNIST* dataset of handwritten digits, since it is readily available), one can imagine analogous social science uses – for instance, classifying satellite images of neighborhoods by poverty level, or detecting whether an online profile picture is of a real person vs. a bot, etc. First, we load image data. We’ll use MNIST (handwritten digits) for a quick example: ```r library(keras) mnist <- dataset_mnist() train_x <- mnist$train$x train_y <- mnist$train$y test_x <- mnist$test$x test_y <- mnist$test$y # Preprocess: reshape and rescale train_x <- array_reshape(train_x, c(nrow(train_x), 28, 28, 1)) / 255 test_x <- array_reshape(test_x, c(nrow(test_x), 28, 28, 1)) / 255 train_y <- to_categorical(train_y, 10) test_y <- to_categorical(test_y, 10) ``` The above prepares the data: we reshape the images to 28x28 with 1 channel (grayscale), and scale pixel values to \[0,1]. The labels are one-hot encoded for 10 classes (digits 0-9). Now, we define a simple CNN model: ```r cnn_model <- keras_model_sequential() %>% layer_conv_2d(filters = 8, kernel_size = c(3,3), activation = 'relu', input_shape = c(28, 28, 1)) %>% layer_max_pooling_2d(pool_size = c(2,2)) %>% layer_conv_2d(filters = 16, kernel_size = c(3,3), activation = 'relu') %>% layer_max_pooling_2d(pool_size = c(2,2)) %>% layer_flatten() %>% layer_dense(units = 10, activation = 'softmax') cnn_model %>% compile( optimizer = optimizer_adam(), loss = 'categorical_crossentropy', metrics = 'accuracy' ) cnn_model %>% summary() ``` Our CNN has two convolutional layers: the first with 8 filters of size 3x3, the second with 16 filters of size 3x3. Each conv layer is followed by 2x2 max pooling to reduce dimensionality. After the conv layers, we flatten the feature maps and have a dense output layer with 10 units (softmax for multi-class probabilities). The model summary would look like: ``` ________________________________________________________________________________ Layer (type) Output Shape Param # ================================================================================ conv2d_1 (Conv2D) (None, 26, 26, 8) 80 max_pooling2d_1 (MaxPooling2D) (None, 13, 13, 8) 0 conv2d_2 (Conv2D) (None, 11, 11, 16) 1168 max_pooling2d_2 (MaxPooling2D) (None, 5, 5, 16) 0 flatten_1 (Flatten) (None, 400) 0 dense_3 (Dense) (None, 10) 4010 ================================================================================ Total params: 5258 Trainable params: 5258 Non-trainable params: 0 ________________________________________________________________________________ ``` (We see that the conv layers have relatively few parameters: 8 filters × (3×3 weights + 1 bias) = 80 for the first conv; 16 filters × (3×3×8 inputs + 1 bias) = 1,168 for the second conv. The dense layer has more: 400×10 + 10 biases = 4,010, because by the time we flatten we have 5×5×16 = 400 features.) We can train this model on the MNIST data (for brevity, we use only 1 epoch here): ```r history <- cnn_model %>% fit( train_x, train_y, epochs = 1, batch_size = 128, validation_split = 0.2 ) ``` Even with 1 epoch on MNIST, the model will likely achieve high accuracy (MNIST is an “easy” dataset for CNNs, often >90% after just one epoch). After training, we evaluate on the test set: ```r scores <- cnn_model %>% evaluate(test_x, test_y, verbose = 0) cat("Test accuracy:", scores["accuracy"], "\n") ``` This should report an accuracy (likely around 0.95 with more training epochs on MNIST). The purpose of this example is to show the structure and code for a CNN. In a social science context, one would rarely train a CNN from scratch on a small image dataset – instead, as mentioned, one would use **transfer learning**. In R, the `keras` package makes it easy to download a pretrained model (e.g., `application_resnet50(weights="imagenet")` gives a ResNet-50 model pretrained on ImageNet). You can then remove the top layer and add your own output layer for your specific classification, freeze the earlier layers, and fine-tune on your data. The Pew Research project referenced earlier did essentially this: they leveraged a ResNet model pretrained on millions of images, which had already learned to detect general features like edges and textures, and only had to train the final layers on their tens of thousands of labeled images. Their deep learning model achieved around **90% classification accuracy** in distinguishing men vs. women in images, illustrating the practicality of CNNs even when a social science team has limited training data – by re-using “knowledge” from large-scale datasets. Beyond image classification, CNNs have also been used for tasks like **text classification** via 1D convolutions (which slide over sequences of words or characters). For example, one could build a model to classify tweets as containing hate speech or not. A simplified approach might convert each tweet into a sequence of word embeddings (vectors), then apply a convolutional filter that detects specific phrases or word combinations indicative of hate speech, followed by pooling and a dense layer for classification. Such models have been shown to perform well in natural language processing tasks and can be trained in R using `keras` or `torch` (with libraries like `text2vec` to obtain embeddings). In our context, a full example is beyond scope, but the implementation would mirror the structure we showed (just with 1D conv layers and an embedding layer for text). ## Recurrent Neural Networks (RNNs) and LSTM ### Theory of Sequence Modeling Many social science data have an inherent sequential or temporal structure: speeches composed of sequences of words, individuals’ life histories composed of sequences of events, longitudinal panel data on voters, or time series of economic indicators. **Recurrent Neural Networks (RNNs)** are neural architectures designed to handle sequence data by maintaining a form of memory of past inputs. Unlike feedforward nets that assume all inputs are independent, RNNs share parameters across time steps and have connections that form directed cycles (hence “recurrent”), allowing information to persist. In a basic RNN (often called a “simple RNN” or Elman network), at each time step $t$ the network takes an input vector $x_t$ and the previous hidden state $h_{t-1}$, and produces a new hidden state $h_t$ as a function: $$ h_t = f(W \, x_t + U \, h_{t-1} + b), $$ where $W$ and $U$ are weight matrices for input and recurrent connections, and $f$(·) is typically a non-linearity like tanh. The hidden state $h_t$ can be thought of as a summary of all inputs seen up to time $t$. If we want to produce an output (for example, predicting the next word in a sentence or labeling the sequence), an output $y_t$ can be computed as $$ y_t = g(V \, h_t), $$ for some output weight matrix $V$. The key point is that the **same weights $W, U, V$ are used at every time step**, enabling the network to generalize to sequence lengths beyond what it was trained on. However, simple RNNs suffer from **vanishing and exploding gradient** problems when dealing with long sequences – as we backpropagate the error through many time steps (a process known as *Backpropagation Through Time*), gradients can shrink or blow up, making it hard to learn long-range dependencies. To address this, more sophisticated recurrent architectures were developed, most notably the **Long Short-Term Memory (LSTM)** network (Hochreiter & Schmidhuber, 1997) and the **Gated Recurrent Unit (GRU)** (Cho et al., 2014). **LSTM** introduces an internal cell state $c_t$ and a set of gating mechanisms that regulate information flow: an *input gate*, a *forget gate*, and an *output gate*. These gates (each implemented with a sigmoid activation) determine which information to add to the cell state, what to forget from it, and how much of it to output to the hidden state. In equations, for an LSTM unit one might define: * Input gate: $i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)$ * Forget gate: $f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)$ * Output gate: $o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)$ * New memory candidate: $\tilde{c}*t = \tanh(W_c x_t + U_c h*{t-1} + b_c)$ * Updated cell state: $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$ * Hidden state: $h_t = o_t \odot \tanh(c_t)$ (where $\odot$ denotes elementwise multiplication). While the details can be intimidating, the intuitive idea is that the **forget gate** controls what information from the past to discard, the **input gate** controls what new information to store in the cell, and the **output gate** controls what information from the cell to send out to the next step. The cell state $c_t$ acts as a conveyor of long-term information with linear interactions (just additions and multiplications by gates), which helps preserve gradients. LSTMs are thus capable of maintaining long-term dependencies – their internal design explicitly tackles the vanishing gradient problem by allowing gradients to flow unchanged where needed. In intuitive terms, an LSTM can *learn to remember or forget*. For example, if analyzing text, an LSTM might learn to “remember” a negation word like *“not”* and carry that influence until it encounters the word being negated, then “forget” it thereafter.  *Diagram of a single LSTM cell, which maintains an internal cell state $c_t$ and uses input ($\sigma$), forget ($\sigma$), and output ($\sigma$) gates (orange = learned neural layers, yellow = pointwise operations) to regulate information flow.* **Sequence-to-Sequence and Other Variants:** RNNs and LSTMs can be used not only for one-output-per-time-step tasks (like *language modeling* where you predict the next word given previous words), but also for **sequence-to-sequence** tasks. In sequence-to-sequence models (Seq2Seq), one RNN (the *encoder*) processes an input sequence into a final hidden state, and then another RNN (the *decoder*) generates an output sequence from that state. This is used in machine translation (encode a sentence in French, decode a sentence in English) and could likewise be used in social science for tasks like open-ended survey response summarization or modeling event sequences (encode the past trajectory of a country’s economic indicators, decode the future trajectory). **RNNs in Social Science:** Applications include: * **Text analysis:** RNNs (especially LSTMs or GRUs) have been widely used for text classification, sentiment analysis, or more complex tasks like stance detection in political texts. An LSTM can capture the sequence of words, which is important because word order matters for meaning. For instance, *“X supports Y”* versus *“Y supports X”* convey different relationships; a bag-of-words model might miss this distinction, but an RNN can learn it. Researchers have applied LSTMs to legislative speech transcripts, news articles, and social media posts to classify topics or sentiment while accounting for syntax and context over sentences. * **Event history analysis:** One could use RNNs to model sequences of events (e.g., a sequence of protest events in different cities, or a sequence of legislative actions over time). The RNN can potentially pick up patterns like *“after event A, event B tends to occur within 3 days”* or detect seasonal trends in event data. For example, a study might model a country’s monthly conflict events as a sequence and use an LSTM to forecast future conflict risk from the history. * **Time-series prediction:** In economics, demography, or sociology, we often have time series data (e.g., monthly unemployment rates, yearly population counts, daily counts of COVID cases). RNNs or LSTMs can be trained to forecast these series, possibly capturing non-linear patterns or regime shifts that traditional ARIMA models might not. They can also incorporate multiple input series (multivariate time series) and learn complex joint dynamics. * **Panel data:** Panel data (repeated observations of many units over time) can also be approached with RNNs by treating each unit’s data as a sequence. For example, an LSTM could be used to predict an individual’s future health status from their longitudinal medical history, or to predict a country’s future GDP given its yearly economic indicators, capturing unit-specific temporal dependencies. (There is also research on **sequence embedding** where each unit’s sequence is converted to a fixed-length vector via an RNN encoder, which can then be used as features in downstream analyses.) It is worth noting that in recent years, **Transformer** models have overtaken RNNs in many sequence modeling tasks (especially in NLP) due to their efficiency in capturing long-range dependencies via self-attention. Transformers (Vaswani et al., 2017) dispense with recurrence entirely and instead use parallelizable attention mechanisms to achieve superior performance on language tasks. While they are beyond our current scope, they represent an advanced tool that social scientists may explore for text analysis or other sequence tasks (for example, using pretrained language models like BERT or GPT to obtain rich text embeddings for survey responses). That said, RNNs and LSTMs remain useful and are easier to train on smaller datasets, so we focus on them here as foundational tools. ### Example: Sequence Prediction with LSTM in R For a concrete example, we will use an LSTM to model a simple time series. Consider a scenario in social science where we have a monthly indicator (say, an index of social unrest intensity) and we want to predict future values based on past values. We’ll simulate a pattern (for illustration, a sine wave with noise could represent a seasonal oscillation in unrest). ```r set.seed(123) T <- 200 # length of series t <- 1:T y <- sin(0.1 * t) + rnorm(T, sd = 0.1) # base sine wave plus noise # Prepare training sequences for LSTM timesteps <- 10 X <- array(0, dim = c(T - timesteps, timesteps, 1)) Y <- array(0, dim = c(T - timesteps)) for(i in 1:(T - timesteps)) { X[i,,1] <- y[i:(i+timesteps-1)] Y[i] <- y[i+timesteps] # next value to predict } # Split into train and test (e.g., first 160 for training, last 30 for testing) train_size <- 160 X_train <- X[1:train_size,,] Y_train <- Y[1:train_size] X_test <- X[(train_size+1):(T-timesteps),,, drop=FALSE] Y_test <- Y[(train_size+1):(T-timesteps)] ``` We created overlapping sequences of length 10 (each sequence is the past 10 time points) and the target is the next value. Now we define an LSTM model to predict the next value from the past 10: ```r lstm_model <- keras_model_sequential() %>% layer_lstm(units = 16, input_shape = c(timesteps, 1)) %>% layer_dense(units = 1) lstm_model %>% compile( optimizer = 'adam', loss = 'mse' ) lstm_model %>% summary() ``` The model has an LSTM layer with 16 units, followed by a dense layer. The summary will show something like: ``` ________________________________________________________________ Layer (type) Output Shape Param # ================================================================ lstm_1 (LSTM) (None, 16) 1152 dense_4 (Dense) (None, 1) 17 ================================================================ Total params: 1169 Trainable params: 1169 ``` We see 1,169 parameters, consistent with the formula for LSTM parameters: for an LSTM with $k$ units and input size $p$, the parameter count is $4k(k + p + 1)$ (because of the 4 sets of weights for input, output, forget, cell). Here, $k=16, p=1$, so indeed 4*16*(16+1+1) = 4*16*18 = 1,152 for the LSTM, plus 17 for the dense layer. Now we train the model: ```r history <- lstm_model %>% fit( X_train, Y_train, epochs = 30, batch_size = 16, validation_split = 0.1, verbose = 0 ) ``` After training, let’s evaluate its performance on the test set and inspect a few predictions vs actual: ```r preds <- lstm_model %>% predict(X_test) # Compare first 5 predictions with actual values print(round(head(cbind("Predicted" = preds[,1], "Actual" = Y_test), 5), 3)) ``` If the model has learned the pattern, the predictions should roughly follow the sine wave trend. Even if not perfect (due to noise), the LSTM likely captures the oscillation better than a trivial baseline. In a real social science application, this could correspond to forecasting something like monthly protest counts given the past 10 months of data, implicitly capturing temporal dependencies and seasonality. **Remark on Interpretation:** While the LSTM can model such sequences, interpreting what exactly it has learned (which patterns in the sequence trigger an increase or decrease) is not straightforward. There are techniques such as examining the learned cell states or using **sequence saliency** methods (to see which parts of the input sequence most influenced the prediction), but these are more specialized. For many pure prediction tasks, a black-box forecast might be acceptable. However, if policy decisions depend on understanding *why* the model predicts a surge in unrest, one might need to combine these models with more interpretable approaches or incorporate domain knowledge to validate the patterns detected. In R, one might also consider the `torch` package for sequence models, which provides an R interface to the PyTorch library and can be used to build custom RNNs, LSTMs, or Transformers with more low-level control. The high-level Keras API, as used above, is often sufficient for many applications, but `torch` could be useful for advanced research requiring custom architectures. ## Training Neural Networks: Key Concepts Having introduced the main architectures (MLP, CNN, RNN/LSTM), we now turn to how neural networks are **trained** and optimized, and how to ensure they generalize well. Training a neural network means finding parameters (weights and biases) that minimize a certain **loss function** on the training data. This is a high-dimensional optimization problem, typically solved by gradient-based methods. We will discuss: * **Data Preprocessing** – preparing inputs for effective training. * **Loss Functions** – what objective we optimize. * **Backpropagation and Gradient Descent** – how we optimize the objective. * **Optimization Algorithms** – variants like SGD, Momentum, Adam. * **Regularization Techniques** – methods to prevent overfitting (dropout, L2, etc.). * **Monitoring and Tuning** – using validation sets, avoiding overfitting, adjusting hyperparameters. ### Data Preprocessing Neural networks can be sensitive to the scale and encoding of input data. It is generally important to **standardize or normalize features** before feeding them into the network. For example, continuous variables are often standardized to mean 0 and standard deviation 1, or scaled to \[0,1]. If features are on very different scales, the network may have trouble learning (the gradients for one feature might dominate). In our earlier examples, we normalized images to \[0,1], and one would likewise scale numerical covariates in tabular data. Categorical variables need to be encoded – typically via one-hot encoding if unordered, or possibly via embedding layers if there are many categories and we want the model to learn a dense representation for each. For text data, preprocessing involves tokenization (breaking text into words or subwords), handling of vocabulary (perhaps limiting to the top N most frequent words or using pre-trained word embeddings), and padding/truncating sequences to a fixed length for batch processing. For network or spatial data, one might need to construct adjacency matrices or coordinate grids. In all cases, careful preprocessing is crucial; poor handling can significantly degrade performance or make training unstable. ### Loss Functions and Evaluation Metrics The **loss function** (also called cost function) quantifies the error of the model’s predictions against the true values. The choice of loss depends on the task: * For **binary classification**, the typical loss is *binary cross-entropy* (also known as log loss). If $\hat{p}_i$ is the predicted probability for instance $i$ belonging to class 1 (and $y_i \in {0,1}$ is the true label), the binary cross-entropy loss for that instance is $-[\,y_i \log \hat{p}_i + (1-y_i) \log(1-\hat{p}_i)\,]$. The model training aims to minimize this, which pushes $\hat{p}_i$ close to 1 for $y_i=1$ and close to 0 for $y_i=0$. (We used this loss in the MLP example via `binary_crossentropy`.) * For **multi-class classification**, we generalize to *categorical cross-entropy*. If the model outputs a probability distribution $\hat{\mathbf{p}}_i$ across classes for instance $i$, and the true label is a one-hot vector $\mathbf{y}_i$, then loss is $- \sum_{c} y_{i,c} \log \hat{p}_{i,c}$. Typically we use a softmax on the output layer to get $\hat{\mathbf{p}}$ that sums to 1. * For **regression** (predicting a continuous outcome), a common loss is the **mean squared error (MSE)** or sometimes **mean absolute error (MAE)**. MSE, $(\hat{y}_i - y_i)^2$, has nice mathematical properties (differentiable everywhere) and is related to assuming Gaussian noise on the output; MAE, $|\hat{y}_i - y_i|$, is more robust to outliers (but the absolute value is less smooth at 0 for optimization). * There are more specialized losses for specific purposes: e.g., *hinge loss* for SVM-like training, *cosine proximity* for certain similarity tasks, or custom losses for imbalanced data (like adding class weights or using focal loss in object detection tasks). It’s important to distinguish the loss used for training from the **evaluation metrics** we care about. For instance, we might train a model with cross-entropy loss but evaluate it with accuracy, F1-score, AUC, etc., to judge its performance. The training process directly optimizes the loss, not necessarily the metric (though often lowering the loss improves the metric). Sometimes there is a trade-off; for example, accuracy is not sensitive to class probabilities unless they cross 0.5, whereas cross-entropy heavily penalizes overconfident wrong answers. Thus, one might get higher accuracy but still have room to improve calibration as reflected in cross-entropy. ### Backpropagation and Gradient Descent **Backpropagation** is the core algorithm for training neural networks. It is a method for computing the gradient of the loss function with respect to all the network’s parameters efficiently, by propagating the error backward through the network. Conceptually: 1. **Forward pass:** Take an input $x$ and compute the network’s output $\hat{y}$ and loss $L(\hat{y}, y)$ for the true target $y$. 2. **Backward pass:** Compute gradients of the loss with respect to the output (using the chain rule of calculus), then with respect to the parameters of the last layer, then the previous layer, and so on, moving backwards. Each layer’s gradients are computed based on the gradients from the layer above (its output). 3. **Update weights:** Use these gradients in an optimization step to adjust the parameters in the direction that most reduces the loss. For a single weight $w$ in the network, backprop gives us $\frac{\partial L}{\partial w}$, the direction in which changing $w$ would increase or decrease the loss. Once we have these gradients, we use an optimization algorithm (like gradient descent) to update the weights in the opposite direction of the gradient (to reduce the loss). In practice, we use **stochastic gradient descent (SGD)** or its variants. Rather than computing gradients on the entire dataset (which would be standard gradient descent), we compute on *mini-batches* of data. For example, with a batch size of 32, we take 32 examples, do forward passes, compute the average loss and gradients, then update weights, and move to the next batch. This stochastic approach introduces noise into the gradient estimates but is much faster and often helps escape shallow local minima. One epoch is one full pass through the training data in mini-batches. Mathematically, a weight update with vanilla SGD looks like: $$ w := w - \eta \, \frac{\partial L}{\partial w}, $$ where $\eta$ is the **learning rate**, a small positive scalar that controls the step size. The learning rate is a crucial hyperparameter – too large and training may diverge or oscillate, too small and convergence will be very slow or get stuck in a suboptimal point. **Batch size** is another important hyperparameter: smaller batches give noisier gradients but can generalize better (and use less memory), while larger batches give more precise gradient estimates but require more memory and can sometimes get stuck in sharp minima. A common heuristic is to use the largest batch size that fits in GPU memory (for efficiency), but in some cases small batches work better. Practitioners often try a few values (32, 64, 128, etc.) to see what yields good validation performance. ### Advanced Optimization Algorithms Several enhancements to basic SGD have been developed to improve convergence speed and stability: * **Momentum:** This technique helps accelerate SGD in the right directions and dampen oscillations in the wrong directions. It does so by maintaining a velocity vector that is an exponentially decaying average of past gradients. The update becomes: $v := \alpha v + \eta \nabla_w L$, and then $w := w - v$, where $\alpha \in [0,1)$ is the momentum coefficient (e.g., 0.9). Momentum accumulates gradient contributions in persistent directions, effectively allowing the optimizer to build up speed on gentle slopes and not get stuck oscillating on steep but narrow ravines (a common issue when gradients in one dimension are much larger than in another). * **Adaptive Learning Rates:** Optimizers like **AdaGrad, RMSProp, Adam** etc., adjust the learning rate for each parameter individually, based on the history of gradients for that parameter. **AdaGrad** (2011) accumulates the squared gradients for each parameter and divides the learning rate by the sqrt of this accumulated sum. This means parameters that have large gradients (and thus large accumulated squared gradients) get their effective learning rate reduced over time, which is good for dealing with sparse features but can lead to excessive decay of learning rates. **RMSProp** (Hinton, circa 2012) modifies AdaGrad by using a moving average of squared gradients (to forget very old gradients), maintaining a per-parameter learning rate that adapts to recent gradient magnitudes. **Adam** (Adaptive Moment Estimation, Kingma & Ba, 2015) combines the ideas of momentum and RMSProp – it keeps an exponentially decaying average of past gradients (like momentum) *and* of past squared gradients (like RMSProp). Adam computes parameter updates as $m_t / (\sqrt{v_t} + \epsilon)$ where $m_t$ is the first moment (gradient mean) and $v_t$ the second moment (uncentered variance), with bias correction for initial timesteps. Adam has become one of the most popular optimizers due to its generally good performance and ease of use (it often requires less tuning of the learning rate compared to plain SGD). In R’s `keras` or `torch`, one can specify `optimizer_adam()` or others to use these algorithms. A typical workflow is to start with Adam (with default settings) as it usually works out of the box, and maybe later experiment with SGD+momentum for possibly better generalization or to reproduce results from literature. Some recent research suggests that for very large datasets, plain SGD with momentum might yield slightly better generalization than Adam (which can sometimes overfit), but in the moderate-data regime of many social science problems, Adam’s robustness is a boon. ### Regularization and Overfitting Neural networks are highly flexible models with a potentially huge number of parameters. This flexibility means they can easily overfit – i.e., memorize the training data – especially when data are limited. Regularization refers to techniques that constrain the model to improve generalization performance on unseen data. Key regularization techniques in neural networks include: * **Penalty Terms (Weight Decay):** The most common is $L_2$ regularization, which adds a term $\lambda \sum w^2$ to the loss (summing over all weights), discouraging large weights. This is equivalent to a Gaussian prior on weights and is known as *weight decay* in the neural network context. In practice, one specifies a weight decay parameter $\lambda$. A smaller $\lambda$ means little regularization; a larger $\lambda$ forces weights towards 0 (simpler models). Weight decay tends to make the network weights smaller in magnitude, which often improves generalization by keeping the model closer to linear in behavior. * **Dropout:** Dropout is a popular and very effective regularization trick introduced by Srivastava et al. (2014). The idea is to randomly “drop out” (set to zero) a fraction of the units in a layer during each training iteration. For example, with dropout rate 0.5, each hidden neuron is independently dropped with probability 0.5 at each update. This prevents the network from relying too much on any single feature or from co-adapting neurons too tightly, essentially forcing a form of ensemble of sub-networks. At test time, all units are used but their activations are scaled down by the dropout rate (to account for the missing ones during training). Dropout often significantly reduces overfitting and has become standard, especially in fully connected layers of networks. * **Early Stopping:** This is a simple yet powerful regularization approach: monitor performance on a **validation set** during training and stop training when performance on validation data stops improving (or starts worsening). The idea is that at the point of minimal validation loss, the model is optimally generalized, whereas training longer would just make it fit noise in the training set. The weights at that point are then taken as the final model. Early stopping essentially treats the number of training epochs as a hyperparameter to tune (automatically, during one run). In Keras, one can use `callback_early_stopping(patience=...)` to implement this. * **Batch Normalization:** Though primarily introduced to help optimization by normalizing layer inputs, **batch normalization** (Ioffe & Szegedy, 2015) can also have a regularizing effect. It reduces internal covariate shift by normalizing the activations of each layer for each mini-batch, then scaling and shifting them by learned parameters. This can allow higher learning rates and often reduces the need for dropout in some architectures (e.g., CNNs). Batch norm adds a bit of noise due to mini-batch estimation, acting as a regularizer. * **Data Augmentation:** Especially relevant for image and text data, augmenting the training data with label-preserving transformations acts as regularization by injecting more variety. For images, this could be random rotations, flips, crops, color jitters (commonly used in computer vision to expand training sets). For text, one might replace words with synonyms, or slightly perturb sentences (though one must be careful to preserve meaning). In social science contexts, augmentation can sometimes be domain-specific (e.g., adding noise to economic indicators to simulate measurement error or using bootstrapping on small datasets). * **Others:** There are other approaches like $L_1$ regularization (which encourages sparsity in weights, leading some weights to exactly zero), **max-norm constraints** (bounding the norm of incoming weights for each neuron), **early removal of overfit neurons** (pruning), or more recent techniques like **DropConnect** (dropout on weights rather than activations), **label smoothing** (smoothing the hard 0/1 labels to soft targets to prevent overconfidence), and so on. But the ones above are usually sufficient in practice. The balance between underfitting and overfitting is often visualized by plotting the training and validation loss over epochs. Initially, both go down as the model fits general patterns. At some point, the validation loss reaches a minimum and then starts to increase even as training loss keeps decreasing – that’s classic overfitting setting in. Regularization aims to delay or reduce that gap. With strong regularization, the model might underfit (both losses high, or validation never goes down much), so one must tune the amount. ### Model Training and Hyperparameter Tuning Training neural networks is as much an art as a science. One often needs to experiment with: * **Architecture hyperparameters:** number of layers, number of units per layer, filter sizes, etc. * **Training hyperparameters:** learning rate (most crucial), batch size, number of epochs, choice of optimizer, learning rate schedules (reducing the learning rate after plateauing, etc.). * **Regularization hyperparameters:** dropout rate, weight decay coefficient, etc. * **Initialization:** Modern libraries handle weight initialization well (e.g., Glorot/Xavier initialization for symmetric activations), but occasionally one might adjust initialization if using certain activations like sigmoid (to avoid saturation at start). It is essential to use a **validation set** to tune these (or techniques like cross-validation if data are extremely scarce, though cross-validating deep nets is computationally heavy). In R’s `keras`, one can specify `validation_split` or supply a `validation_data` argument to the `fit()` function to automatically track validation metrics. The `keras` API also provides callbacks, such as `callback_early_stopping()` to implement early stopping and `callback_reduce_lr_on_plateau()` to reduce the learning rate if validation loss stalls, which help automate some tuning. A common strategy is: * Start with a relatively simple model and get it to “learn something” (ensure the training loss decreases and it beats a trivial baseline on validation). * If underfitting (validation and training loss both high), increase capacity (more layers/units) or train longer or adjust learning rate. * If overfitting (training loss much lower than validation), add regularization (dropout, etc.) or reduce capacity. * Adjust the learning rate carefully: it often needs to be tuned on a log scale (e.g., 0.1, 0.01, 0.001, ...). Sometimes a learning rate that is too low will cause extremely slow convergence, giving the impression of underfitting, whereas a slightly higher rate would converge nicely. * Use learning rate schedules: many times you can start with a relatively high learning rate and then reduce it as training progresses (either manually or via a schedule like exponential decay or step decay when validation performance plateaus). Monitoring metrics like accuracy alongside loss can also be informative: sometimes the loss might decrease while accuracy plateaus (indicating the model is getting more confident on the same predictions), or vice versa. In summary, training a neural network is an iterative process of **configure → train → evaluate → adjust**. Modern deep learning frameworks significantly lower the barrier to trying different configurations quickly, which is a big reason for the rapid progress in the field. For social scientists, this means one can be empirically guided – try a network, see how it performs, and iterate – much as one might do with choosing specifications in a regression (though the “specifications” space is much larger for a neural net!). The final model chosen should ideally be the one that performs best on held-out data. As a sanity check, one should also compare it with simpler models; if a straightforward logistic regression or random forest is performing just as well, the added complexity of a neural net might not be warranted. ## Interpretability and Explainability One of the major concerns in applying neural networks to social science problems is the **interpretability** of the models. Social scientists are typically interested not only in making accurate predictions, but also in understanding the relationships between variables, uncovering latent patterns, and providing explanations that are convincing to stakeholders or policymakers. Traditional statistical models (like linear regression or decision trees) offer transparent relationships – e.g., coefficients or splits that can be directly interpreted. Neural networks, in contrast, are often criticized as **“black boxes”**: their predictions result from complex, layered computations that do not yield simple, direct explanations. However, a growing field of eXplainable AI (XAI) has developed tools to interpret and explain neural network predictions. Here we outline approaches to interpretability that can make neural network results more transparent in a social science context: * **Feature Importance and Attribution:** Methods like **SHAP (SHapley Additive exPlanations)** and **LIME (Local Interpretable Model-agnostic Explanations)** provide ways to estimate the importance of input features for a given prediction. **SHAP values** are based on cooperative game theory (the concept of Shapley values) and represent the contribution of each feature to the difference between the model’s prediction and a baseline expectation. **LIME**, on the other hand, fits a simple interpretable model (like a sparse linear model) locally around the prediction to approximate the neural network’s behavior. For example, if a neural net predicts that a certain individual will have a high income, SHAP values could tell us that the individual’s education level and years of experience were strong positive contributors, while the local unemployment rate was a negative contributor, aligning the explanation with domain expectations. LIME could show a small linear model for that individual where, say, *education=Master’s* contributes +15% chance of high income, *experience > 5 years* contributes +10%, and *high local unemployment* contributes -5%, etc., illustrating in simple terms why the neural net made its prediction. * **Saliency and Input Sensitivity:** For image or text models, one can compute **saliency maps** or **attention weights** that highlight what parts of the input the network focused on. In images, saliency maps (essentially the gradient of the output w\.r.t. input pixels) can show which regions of an image influenced the prediction most. For example, a CNN predicting “protest” vs “non-protest” in an image might focus on areas with crowds or protest signs. In text, some sequence models (and certainly Transformers with attention mechanisms) allow extraction of attention weights to see which words were most attended to for a classification. If an LSTM classified a speech as containing hate speech, we might examine which words in the speech contributed heavily to that decision – perhaps identifying specific derogatory terms. This can be important for both understanding and justifying the model’s decisions (and for identifying when the model might be keying off of problematic biases, such as associating certain topics with certain groups unfairly). * **Interpretable Model Surrogates:** Another approach is to train an **interpretable surrogate model** on the predictions of the neural network. For instance, one could use decision trees or rule-based models to approximate the behavior of the neural network in certain regions of the feature space. This is related to LIME but can be done globally: e.g., train a decision tree on the dataset where the “labels” are the neural network’s predictions. The tree might then provide a set of rules that roughly mimic the network. Caution is needed – an approximation may not faithfully represent the true model in all cases – but it can sometimes reveal broad patterns. For example, a surrogate tree might yield rules like “IF (income > $50k) AND (age < 30) THEN predict high credit score” which could approximate what a neural net has learned, even if the net itself is not a decision tree. * **Network Dissection and Concept Analysis:** Research by Bau et al. (2017) on *network dissection* shows that some neurons in CNNs learn to detect human-interpretable concepts (like “tree” or “door”) even without supervision on those concepts. They developed a method to systematically test each hidden unit in a vision model against a large set of concepts (objects, textures, colors) and found, for example, units that reliably activate for images of doors, or units for certain textures. This kind of analysis can be done to see if particular hidden units correspond to meaningful factors. In social science applications, one could imagine analyzing a network trained on survey data to see if any neuron’s activation correlates strongly with known indices (like an SES index or an ideology score), which might indicate the net internally constructed a similar concept. * **Causal Explainability:** Recently, there is interest in going beyond correlational explanations to more **causal** ones. For example, **counterfactual explanations** try to answer: “what would need to change in this input for the model’s prediction to change in a desired way?” In a recidivism risk model, a counterfactual explanation might be: *“If this individual had one fewer prior offense, the predicted risk score would drop below the threshold for detention.”* This gives a more actionable explanation (it points to a change in input that alters output) and connects with ideas of fairness and algorithmic recourse. Some methods formulate this as an optimization problem: find the minimal change to the input features that yields a different outcome from the model. In the social sciences, the need for transparency is not just academic – it’s often ethical or legal. For instance, if an algorithm is used in criminal justice or in allocation of social services, one must often provide reasons for decisions and ensure there is no hidden bias against protected groups. Neural networks themselves do not inherently avoid bias – in fact, if the training data reflect societal biases, the model can perpetuate them. Techniques like feature importance can help identify if certain sensitive attributes (like race or gender, or proxies thereof) are unduly influencing predictions, which might prompt retraining the model with fairness constraints or interpreting results with caution. In some cases, interpretable models (or post-hoc explanations) can uncover problematic patterns that were not apparent during training. For example, one might discover via LIME or SHAP that a resume-screening network was effectively using an applicant’s address as a proxy for race in decisions (because location strongly correlates with demographics in the data) – a red flag that would need addressing. It is also possible to impose some interpretability at training time. For example, one could use a smaller network or add penalties that encourage sparse activations or use **attention mechanisms that are inherently interpretable** (in some cases, attention weights can be interpreted as a measure of importance of each part of the input). Another approach is building **hybrid models** – e.g., use a neural network to generate features or scores and then feed those into a traditional regression model, thereby capturing non-linearities in the feature generation but retaining an interpretable final model. An example of this might be using a CNN to score images of neighborhoods for “disorder level” and then using that score as a variable in a regression predicting crime rates. The regression is interpretable, and the CNN’s output can be interpreted as a single meaningful index (even if internally the CNN is complex). To illustrate one technique in R: we can use the `lime` package to explain a neural network’s predictions on tabular data. For brevity, here’s a conceptual mini-example using the MLP we trained earlier (treating it as a black box): ```r # Install lime if not already installed # install.packages("lime") library(lime) # Our model expects a numeric matrix input, so we define model_type and predict_model for LIME model_type.keras <- function(x, ...) 'classification' predict_model.keras <- function(x, newdata, type, ...) { # newdata will be a data.frame; convert to matrix and get predictions preds <- x %>% predict(as.matrix(newdata)) # Return a data frame of class probabilities (two classes: 0 and 1) data.frame(`0` = 1 - preds[,1], `1` = preds[,1]) } # Create a lime explainer using training data (excluding the label column) explainer <- lime(train_data[, c("X1","X2")], mlp_model) # Explain predictions for the first 5 test cases explanation <- explain(test_data[1:5, c("X1","X2")], explainer, n_labels = 1, n_features = 2) print(explanation[, 1:9]) ``` This will output something like: ``` case label label_prob model_r2 model_intercept model_prediction feature feature_weight 1 1 1 0.992 0.75 0.500 0.950 X2<=0 0.445 2 1 1 0.992 0.75 0.500 0.950 X1<=0 0.005 3 2 0 0.998 0.67 0.500 0.020 X2<=0 0.367 4 2 0 0.998 0.67 0.500 0.020 X1 >0 -0.122 ... ``` This indicates, for example, that in **Case 1** (with true label 1, predicted probability \~0.992 of class 1), LIME’s local model had an $R^2$ of 0.75 and predicted 0.950 for class 1 based on two rules: *X2 <= 0* contributed +0.445 towards class 1, and *X1 <= 0* contributed +0.005. In other words, in that case both features being negative made the neural net lean towards class 1 (which matches the XOR pattern logic in our simulation: both negative means output 1). For **Case 2** (true label 0, predicted prob \~0.002 of class 1), LIME shows *X2 <= 0* contributed +0.367 towards class 1, but *X1 > 0* contributed -0.122, and the intercept was 0.5, resulting in a net prediction of 0.02 for class 1 (i.e., strong leaning to class 0). These numbers are just illustrative, but the idea is we get a human-readable explanation of each prediction in terms of original features. In more realistic settings, we could request more features in the explanation (n_features=5, etc.), and we would examine which features consistently show up as important. In summary, while neural networks present challenges for interpretability, a variety of methods exist to extract insights from them. The level of explanation required depends on the use-case: for pure predictive tasks (like language translation or image tagging for internal research), a “black box” may be acceptable; for scientific inference or high-stakes decisions (like criminal justice or healthcare), interpretability is crucial. Social scientists should be aware of these tools and use them to ensure that when they do employ neural networks, they can *explain* and justify the findings to themselves and to others. Furthermore, such tools can help diagnose when a model might be exploiting undesirable patterns (e.g., proxies for protected attributes or dataset artifacts) and guide improvements. ## Comparison with Traditional Methods How do neural network models compare with more traditional statistical or machine learning methods commonly used in social science, such as linear/logistic regression or even tree-based models and support vector machines? We consider **accuracy**, **flexibility**, and **transparency** as key dimensions: * **Predictive Accuracy:** Neural networks, when appropriately tuned and given sufficient data, often outperform simpler models on complex prediction tasks. Their ability to automatically model interactions and non-linear relationships means they can discover patterns that a linear model or a low-degree polynomial might miss. For example, in text or image analysis, linear models that rely on manually engineered features cannot match the accuracy of deep networks that learn features from raw data. In the earlier example by , King & Zeng (2000), their neural network approach improved conflict prediction accuracy substantially over prior statistical models. However, the advantage is not universal – for many **tabular datasets with limited samples and a strong signal-to-noise ratio**, methods like gradient boosting machines (e.g., XGBoost) or even well-tuned logistic regressions can perform on par with neural nets. In fact, one reason neural nets haven’t completely displaced other methods in social science is that in low-data regimes, very deep models might overfit and not have a clear edge. Additionally, ensemble methods like random forests or boosted trees often yield strong performance with less tuning. * **Flexibility:** Neural networks are extremely flexible in terms of the data they can handle and the mappings they can learn. They can naturally incorporate **unstructured data** (images, text, audio) via CNNs, RNNs, etc., whereas traditional models often require a separate feature extraction step for such data. They can also be extended easily: e.g., one can create multi-task networks that simultaneously predict multiple outcomes, or networks that incorporate multiple input modalities (e.g., taking both text and numeric inputs by combining different subnetworks). Moreover, neural nets can learn internal representations that might be transferrable to other tasks (transfer learning) – something like a logistic regression doesn’t have internal layers to reuse. Traditional statistical models, on the other hand, are less flexible in structure – you often have to decide on interactions or transformations manually. That said, for purely **structured data** with defined features, tree-based models or linear models can be quite effective and are simpler to implement. * **Transparency and Interpretability:** Here traditional methods have a clear advantage. A simple model like a linear regression provides coefficients that (under certain assumptions) directly quantify the effect of each predictor on the outcome. Decision trees yield human-readable rules (e.g., “IF income > $50k AND age < 30 THEN probability of voting = 0.8”). By contrast, a neural network with hundreds or thousands of weights does not provide a straightforward narrative of “X increases Y by Z units.” We must resort to the interpretability techniques discussed (SHAP, LIME, etc.) to get insight, and even then those are **post-hoc explanations** rather than an inherent part of the model. In many social science applications, explanation is part of the goal – we often care about *understanding the social processes* at work, not just predicting outcomes. If an algorithm is used in policy, being able to explain its decisions might be essential for it to be accepted. For this reason, interpretable models (or at least simpler proxy models) are often used alongside neural nets in studies: e.g., a researcher might report the results of a logistic regression for interpretability, even if a neural network was used as a robustness check or to validate that no nonlinear patterns were missed. * **Causal Inference and Theoretical Insight:** Traditional methods, especially those in the econometrics toolkit, are closely linked to causal inference frameworks. Linear regressions with control variables, instrumental variables regression, difference-in-differences designs, etc., all have well-developed theoretical interpretations for causal estimation. Neural networks can be used as part of causal analysis (for example, to estimate propensity scores or conditional outcome models in a double ML approach), but the core causal identification strategy usually relies on the same old assumptions (ignorability, exclusion restrictions, parallel trends, etc.). Neural nets typically do not provide confidence intervals or significance tests out-of-the-box, whereas traditional methods often do (though one can bootstrap a neural net or use Bayesian versions to get uncertainty estimates). For a social scientist aiming to *test a theory* or *estimate a specific effect*, a neural network alone might not be ideal – but it could complement by capturing nuisance functions or suggesting new hypotheses. For example, Mullainathan and Spiess (2017) argued that machine learning is primarily about prediction, and its role in econometrics is to be used for tasks like prediction of counterfactuals or discovery of heterogeneity, while the core inference about causal parameters remains a separate step. * **Scalability:** When it comes to very large datasets, neural networks (with GPU acceleration and mini-batch SGD) can scale to millions of examples and high-dimensional inputs. Traditional statistical models can also scale in their own ways (e.g., using stochastic gradient descent for logistic regression or large matrix solvers), but off-the-shelf implementations might struggle with really large data. However, training very large neural networks can be expensive and time-consuming, and they may require specialized hardware (GPUs/TPUs). In social science, datasets are rarely as large as those in commercial deep learning (like billions of tokens or millions of images), except perhaps some forms of text data or network data from the web. So scalability is usually not the limiting factor – data availability is. In contexts with **small data**, one often finds that a neural network easily overfits and a regularized linear or tree model performs better. For example, if you have a survey with 500 respondents and 20 predictors, a carefully specified logistic regression (maybe with polynomial terms or interactions chosen based on theory) could outperform a 3-layer neural net that has no guidance and ends up overfitting the noise. A rule of thumb sometimes cited is that you need an order of magnitude more data (in terms of number of training examples) than you have parameters in your neural network to reliably avoid overfitting (though techniques like regularization and transfer learning complicate this simple picture). In many social science problems, we simply don’t have tens of thousands of examples, so simpler models are not only more interpretable but *necessary* to avoid overfitting. However, in scenarios where you do have **rich data** (high-dimensional, possibly unstructured, or non-linear signals) and sufficient sample size, neural nets can shine. For instance, if analyzing text from thousands of political speeches to predict a rating of populist vs. technocratic rhetoric, a neural network that learns its own text features may outperform a bag-of-words SVM or a dictionary-based approach, because it can capture subtle phrasing differences and context. Similarly, for predicting policy outcomes from a combination of numeric indicators, social network metrics, and text sentiment, one could build a unified neural network that ingests all these data types, whereas a traditional approach might have to reduce everything to a set of summary indices first. A pragmatic view is that neural networks **complement** rather than outright replace traditional methods in the social scientist’s toolkit. One might use neural nets to **explore** data or to **validate** that more rigid models aren’t missing something. For example, after running a neural network, you might inspect what variables it found important (via SHAP values) and realize that a certain interaction is important – you could then include that interaction explicitly in a logistic regression and confirm it's significant and aligns with theory. Conversely, one might use a regression to summarize what a network is doing, as a way to communicate results in a familiar format (e.g., “Using a neural network, we find that the marginal effect of education on income is larger at higher levels of experience, consistent with a complementarity hypothesis”). In high-stakes decision contexts (loans, criminal justice), there is an ongoing debate about using black-box models vs. interpretable models. **Rudin (2019)** strongly argues that for high-stakes decisions, one should use interpretable models whenever possible rather than relying on post-hoc explanations of black-boxes. Her point is that an inherently interpretable model (like a sparse rule list or a transparent scoring system) can often be built with little loss in accuracy, and it avoids the risk that the black box might be right for the wrong reasons (which explanations might not fully catch). On the other hand, proponents of black-box use (with explanation) claim that sometimes accuracy is paramount (say, diagnosing cancer from an MRI), and as long as we carefully check the model for bias, the improved accuracy can save lives or resources, even if the model isn’t fully interpretable. For social scientists, the takeaway is: **use the right tool for the job**. If a simple model suffices and yields insight, there’s no need to complicate things with a deep network. If the problem involves data types or nonlinear patterns that simpler models can’t handle well, then consider a neural network, but accompany it with appropriate interpretation and validation. And in many cases, consider using both: a neural network for predictive performance or exploratory analysis, and a traditional model for confirmatory analysis or presentation. This way you get the benefits of both – the neural net can uncover patterns and provide a benchmark for maximum predictive power, while the simpler model can test hypotheses and communicate relationships clearly. ## Practical Challenges and Tips for Applying Neural Networks to Social Science Data Applying neural networks in social sciences comes with practical challenges, many of which we have touched upon. Here we summarize some common issues and strategies to address them: **1. Limited Data:** Social scientists often work with datasets that are small by deep learning standards. Collecting data (through surveys, experiments, archival research) can be expensive and time-consuming, so we might have a few hundred to a few thousand examples at most. Small data can lead to severe overfitting for high-capacity models. To mitigate this: * *Prefer simpler architectures* for small datasets. A network with 1-2 hidden layers and a modest number of units may generalize better than a very deep network on tiny data. * Use strong regularization: higher $L_2$ weight decay, or dropout. Also consider **Bayesian neural networks** or **ensemble methods** (e.g., train 10 networks and average them) to get uncertainty estimates. * **Transfer learning** is your friend when you have unstructured data. Use pretrained embeddings for text (like word2vec, GloVe, or BERT) instead of learning word representations from scratch. For images, use pretrained CNNs. This injects prior knowledge from big data and can dramatically reduce the amount of data you need. * **Data augmentation:** Create more training examples by perturbing existing ones, if applicable. This is commonly done in image tasks and can be done for text in limited ways (e.g., back-translation or dropping stopwords). * If possible, gather more data. This might mean merging datasets, using cross-study data, or even simulating data (in some cases you can simulate additional training data from a known model for pretraining). * When data are extremely limited, sometimes *non-neural* methods may be more reliable. For example, with 100 observations, a neural net will likely struggle; methods like logistic regression or even case-based reasoning might be safer. **2. Imbalanced Data:** Many social datasets have imbalanced outcomes (e.g., rare events like wars or coups in country-year data, or a minority class like “fraud” cases in audit data). Neural networks training with standard losses can be biased towards the majority class (simply because minimizing overall error might ignore the minority class to some extent). To address this: * Use **class weighting** in the loss function. In Keras, for example, you can pass `class_weight = list("0"=..., "1"=...)` to the `fit()` function for binary classification, so that errors on the minority class are given higher weight. * **Oversample** the minority class or **undersample** the majority class in training batches. For instance, you might duplicate minority class examples or generate synthetic ones (techniques like SMOTE). * Use specialized loss functions like **focal loss** (which was designed to focus training on hard, misclassified examples and is popular in object detection tasks with imbalance). * Evaluate on metrics that are appropriate for imbalance (like F1-score, precision/recall, or AUROC) rather than accuracy. And possibly include those metrics as feedback during training (although one typically trains on a differentiable loss like cross-entropy, you monitor the F1 or AUROC on validation to ensure the model is improving in the way you care about). * If the minority class is extremely scarce, consider one-class approaches or anomaly detection techniques (though those are less about neural nets and more about unsupervised learning). **3. Computational Resources:** Training deep learning models can require GPUs and be time-consuming. Social science researchers may not always have easy access to such resources. Some tips: * Take advantage of cloud services or institutional clusters when needed. Many cloud providers offer GPU instances, and there are academic grants or free tiers for research. Just be cautious with sensitive data if using cloud platforms. * For small to medium-sized models, modern CPUs can actually be sufficient (especially with libraries like TensorFlow able to use SIMD instructions). If your dataset is a few thousand cases and your network has < 1e6 parameters, CPU training might only take seconds to minutes. * Use smaller batch sizes if memory is a constraint (at the cost of some extra training time). Also, reducing precision to float16 can allow larger models on limited GPU memory (TensorFlow and PyTorch support mixed precision training). * Use pre-trained models to avoid heavy computations. For example, using a BERT model via the `transformers` R interface (or reticulate to Python’s HuggingFace library) allows you to get state-of-the-art text embeddings without training a huge model for days. * Profile your code and make sure you aren’t doing inefficient data handling in R. Sometimes the overhead of shuffling data in R or converting types can dominate. It may be more efficient to use TensorFlow or Torch data pipeline APIs to feed data. * If you have a very large dataset, consider methods like **online learning** or **data generators** to avoid loading everything in memory at once. **4. Ethical and Legal Issues:** As noted, algorithmic predictions in social contexts raise issues of fairness, accountability, and transparency. Neural nets can inadvertently learn biases present in the data. Addressing this: * Conduct **bias audits** of your model. Check performance separately for different demographic groups if applicable. Use techniques like **counterfactual fairness** (does the model change its prediction if we change a sensitive attribute while holding other inputs constant?) or **parity metrics** (false positive/negative rates across groups). * Consider removing certain sensitive attributes from input (though be aware of proxies – e.g., ZIP code can proxy for race). In some fairness-aware modeling, people train two models or use adversarial training to ensure the latent representation does not carry information about protected classes. * When publishing or deploying models, include *model cards* or documentation of intended use, data provenance, and limitations. * If your model is used for decision support, think about **interpretability** requirements – you might have to provide explanations for individual decisions (as some regulations like GDPR’s “right to explanation” suggest). Planning to use LIME/SHAP as part of the pipeline, or restricting to interpretable models, might be necessary. * **Privacy:** If using individual-level data (like social media posts, survey responses, etc.), be mindful of privacy. Training a large model can sometimes inadvertently memorize specific training examples (this has been shown in language models). Consider differential privacy techniques if applicable (this is advanced – adding noise during training to preserve privacy). More practically, always get proper consent and anonymize data where possible. **5. Domain Adaptation and Generalizability:** A model trained on one context might not work well in another (known as domain shift). If you train a protest image classifier on images from one country, it might not generalize to another country’s protest images if they look different (different signs, clothing, etc.). For text models, language or dialect differences can be an issue. Strategies: * Use **pre-training on broad data** and fine-tune on your specific domain (this leverages generic patterns). * Use **transfer learning** between related social science tasks. For example, if you built a model to detect hate speech on Twitter, you might transfer some of those representations to detect extremist content on online forums. * If you suspect a domain shift, try to get a small sample of target-domain data to evaluate your model, and consider domain adaptation techniques (like training on both source and some target data with domain-adversarial training to make features domain-invariant). * In time series, be careful of non-stationarity (e.g., the patterns of the past may not hold in the future due to social change). Continuously retraining or using online learning can help keep models up-to-date, but also try to incorporate domain knowledge of structural breaks if possible. **6. Reproducibility:** Ensuring that results are reproducible is vital for academic work. Neural network training involves sources of randomness (weight initialization, random shuffling of batches, dropout, etc.). Tips for reproducibility: * Set random seeds for all libraries (in R, use `use_session_with_seed()` from Keras which attempts to make training reproducible by fixing the seed for Python, TensorFlow, etc.). Note that perfect reproducibility might still be tricky due to non-determinism in some low-level operations (especially on GPU) – but it should be close. * Document all hyperparameters and training procedures meticulously (learning rate, architecture, number of epochs, etc.). Because unlike a simple regression, these choices significantly influence results. * Share code and even model weights if possible (for example, saving the trained model with `save_model_hdf5()` in Keras). But be mindful of data privacy if the model could expose training data, and of course don’t share data that shouldn’t be public. * Use consistent train/validation splits or cross-validation folds for model comparison. If you try multiple models, compare them on the same validation set or via cross-validation to ensure a fair comparison. * The format of this chapter (an executable R Markdown document) is itself a step toward reproducibility: one could distribute it and let readers re-run the analyses to get the same results. In practice, applying neural networks in a social science project might involve a mix of approaches. For example, consider a research team studying whether one can predict which civil unrest events will escalate based on initial news reports and social media data. They might: 1. Start by extracting features from the text (perhaps using a pre-trained language model to get sentiment or topic scores, or simpler text features) and running a logistic regression to see baseline predictive power. 2. Then set up a neural network that takes the raw text (using an embedding layer or a pre-trained BERT model) plus some structured features (like location, time, etc.) to predict escalation. They would need to combine different data types – a flexible strength of neural nets. 3. They would train this model (maybe fine-tuning BERT) on past events. Given relatively small data, they use dropout and perhaps freeze some of the pre-trained layers to avoid overfitting. They monitor validation accuracy and loss, and use early stopping. 4. Suppose the neural model beats the logistic regression in predictive accuracy. They then use SHAP values to interpret it. They find that certain phrases in news (like “military deployed” or “state of emergency”) strongly increase the model’s predicted probability of escalation – which makes intuitive sense. They also check that the model isn’t just picking up irrelevant stuff (like certain city names always correlating with escalation due to one country’s history – an issue of potential overfitting to geography). 5. They compare the neural model’s performance to a random forest or gradient boost model on the same data to ensure the improvement is real. They find the neural net is a bit better, likely because it leveraged word order and context. 6. For their publication, they report that the neural network approach achieved X% accuracy vs Y% for traditional methods, and they discuss which factors (words, etc.) were important according to the explainability analysis. They might still present a simplified logistic regression of key features for the sake of interpretation, perhaps constructed based on the important features the neural net found. 7. They also caution about the model’s limitations – e.g., it might not generalize to unrest events in a different cultural context (domain shift), and it is only as good as the reporting (if early reports are incomplete or biased, so will be the model). Through this example workflow, one can see that neural networks can be integrated into social science research in a sensible way: not as magical black boxes that replace theory or careful design, but as powerful tools to capture patterns in data, which must then be scrutinized, interpreted, and contextualized within social science knowledge and theory. ## Conclusion Neural networks offer powerful new tools for social scientists, enabling the modeling of complex patterns in data that were previously difficult or impossible to capture. In this chapter, we have covered the landscape of neural networks in social science applications, moving from theoretical foundations to practical implementation in R. We discussed **feedforward neural networks (MLPs)** and how they can model non-linear relationships and interactions; **convolutional neural networks (CNNs)** for handling structured inputs like images or spatial data; and **recurrent networks (LSTMs)** for sequence data and time series. The mathematical underpinnings – from activation functions to backpropagation – provide insight into how these models learn from data. With the R code examples, we demonstrated how to build and train these networks using modern libraries like Keras, showing that even relatively few lines of code can set up a sophisticated model. Importantly, we tackled the distinction between **predictive modeling and causal inference**. Neural networks excel at prediction given enough data, and we showed scenarios where they clearly outperform traditional approaches in predictive accuracy (e.g., the XOR example, or citing improvements in conflict prediction). However, we also emphasized caution when it comes to interpreting these models for causal insights – often a direct causal interpretation of a deep model is not possible without further assumptions or methods. We described how one might integrate neural nets into causal analysis carefully (e.g., using them for propensity score or outcome modeling in a double ML framework, or using them to discover potential interactions which are then tested in a causal model). In practice, a judicious approach might use neural networks to **improve certain components** of an analysis (like imputation or proxy variable construction) while still relying on more interpretable models or established causal inference techniques for the core analysis. We also delved into **practical aspects of model training**: how to choose loss functions, how gradient descent and its variants (SGD, Adam) work, and how to use regularization methods (dropout, weight decay, etc.) to prevent overfitting. These are essential for any applied work because a poorly trained network is no better than a random guess or a misleading curve fit. Through examples and discussion, we highlighted how to monitor training and tune hyperparameters, using validation data to guide decisions. We emphasized that *data preprocessing* (normalization, encoding) is often as important as model architecture in getting networks to train properly. The section on **interpretability and explainability** addressed the black-box critique of neural nets. We presented methods such as LIME and SHAP for explaining individual predictions, and stressed the importance of transparency especially in policy-relevant applications. We gave an example of using LIME in R to interpret an MLP’s decisions, illustrating how even a complex model can be probed to yield understandable insights (like which features were driving a prediction). This is crucial: if neural networks are to be used in social science research, researchers must ensure they can interpret and validate what the model is doing, to avoid drawing false substantive conclusions or deploying biased algorithms. The array of XAI tools available today makes it feasible to open up the black box to a significant degree, though it requires extra effort. In comparing neural networks with traditional methods, a theme emerged that each has its place. Neural networks bring **flexibility and often better pure predictive power** (especially with rich data), while traditional models bring **simplicity and interpretability**. We highlighted scenarios where neural networks add value (complex interactions, high-dimensional data, text/image analysis) and where they may not (very small datasets, where interpretability is paramount and patterns are linear enough). We also noted that these approaches can be combined – e.g., using a neural net for feature learning and a regression for the final analysis, or using regression to summarize a neural net model’s behavior. The social scientist’s goal is often to maximize insight, not just accuracy, and sometimes the insight comes from the combination of sophisticated algorithms and human interpretation/theory. We covered **practical challenges** such as small sample sizes, class imbalance, computational constraints, and ethical issues. For each, we gave tips: e.g., use transfer learning for small data, class weighting for imbalance, GPU/cloud resources for heavy computation, and fairness checks for ethical considerations. These are the nuts-and-bolts issues one encounters when actually trying to use neural nets on social data, and addressing them is key to a successful project. As with any method, using neural networks responsibly means understanding their limitations and failure modes (like overfitting or bias) and proactively mitigating them. A recurring message is that **neural networks do not replace the need for theory and careful research design**. Rather, they are tools that can uncover patterns we might otherwise miss, or improve predictions/measurements that feed into larger analyses. For example, a neural network might produce a better measure of **ideology from text**, which a political scientist can then use in a regression to test a hypothesis about legislative behavior (as in Knox, Lucas, & Cho, 2022’s discussion of learned proxies). The theory about legislative behavior remains critical – the neural network is just improving the measurement of one variable. Likewise, a neural network might predict protests, but a social scientist still needs to interpret *why* those factors matter and what it means for theories of collective action or political instability. Looking forward, as social phenomena generate increasingly complex and large-scale data (from social media, sensors, digital trace data, etc.), neural networks and deep learning are likely to play a growing role in social science research. Areas like computational sociology, political text analysis, and econ applications of ML are already burgeoning. Yet, the **barriers to entry** are falling – with high-level APIs and many pre-trained models available, one does not need a Ph.D. in computer science to apply these methods. What one does need is a strong grasp of research design and domain knowledge, so that the questions asked of the data are meaningful and the results are interpreted correctly. A danger with any powerful technique is the potential for misuse (data mining without theory, finding spurious “significant” patterns, etc.). By combining the strengths of neural networks (flexibility, performance) with the rigor of social science methodology (validity, theory-driven inquiry), reseabarchers can unlock new insights while avoiding pitfalls. Neural networks are a powerful addition to the social scientist’s analytic toolkit – but they should be used thoughtfully. They can uncover patterns and improve predictions in ways that open up new research questions and practical solutions (e.g., more accurate early warning systems for crises, better measurement of latent social traits, etc.). At the same time, they come with the responsibility to ensure **interpretability, fairness, and robustness**. The chapter has aimed to equip readers with both the *how* (implementation in R) and the *when/why* (appropriate use cases and limitations) of neural networks in social science research. The hope is that readers will feel empowered to experiment with these methods in their own work – whether it’s predicting an election, analyzing survey open-end texts, or modeling the evolution of a social network – while maintaining the critical perspective of a social scientist. With rigorous exposition and reproducible code examples, this chapter serves as a bridge between the exciting developments in deep learning and the rich, nuanced problems of social science, encouraging a fruitful interplay between the two. ## References {.unnumbered} Chollet, F., & Allaire, J. J. (2018). *Deep learning with R*. Manning. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. *The Econometrics Journal, 21*(1), C1–C68. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep learning*. MIT Press. Hochreiter, S., & Schmidhuber, J. (1997). Long short‐term memory. *Neural Computation, 9*(8), 1735–1780. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. *Neural Networks, 2*(5), 359–366. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. Bach & D. Blei (Eds.), *Proceedings of the 32nd International Conference on Machine Learning* (pp. 448–456). PMLR. Johansson, F., Shalit, U., & Sontag, D. (2016). Learning representations for counterfactual inference. In M. Balcan & K. Q. Weinberger (Eds.), *Proceedings of the 33rd International Conference on Machine Learning* (pp. 3020–3029). PMLR. Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. *International Conference on Learning Representations (ICLR 2015)*. arXiv:1412.6980. Koch, B., Sainburg, T., Bastías, P. G., Jiang, S., Sun, Y., & Foster, J. (2024). A primer on deep learning for causal inference. *Sociological Methods & Research, 54*(2), 397–447. Knox, D., Lucas, C., & Cho, W. K. T. (2022). Testing causal theories with learned proxies. *Annual Review of Political Science, 25*, 419–441. Lam, O., Hughes, A., & Wojcik, S. (2019, January 30). How social scientists can use transfer learning to kick‑start a deep learning project. *Pew Research Center: Decoded*. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. *Nature, 521*(7553), 436–444. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient‑based learning applied to document recognition. *Proceedings of the IEEE, 86*(11), 2278–2324. Mullainathan, S., & Spiess, J. (2017). Machine learning: An applied econometric approach. *Journal of Economic Perspectives, 31*(2), 87–106. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?”: Explaining the predictions of any classifier. In *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining* (pp. 1135–1144). ACM. Rudin, C. (2019). Stop explaining black box machine learning models for high‑stakes decisions and use interpretable models instead. *Nature Machine Intelligence, 1*(5), 206–215. Shalit, U., Johansson, F. D., & Sontag, D. (2017). Estimating individual treatment effect: Generalization bounds and algorithms. In D. Precup & Y. W. Teh (Eds.), *Proceedings of the 34th International Conference on Machine Learning* (pp. 3076–3085). PMLR. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. *Journal of Machine Learning Research, 15*, 1929–1958. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In I. Guyon et al. (Eds.), *Advances in Neural Information Processing Systems, 30* (pp. 5998–6008). Curran Associates. Yan, X., Zhao, J., Ding, W., & Luo, X. (2020). Estimating city‑scale passenger‑car fuel consumption using street‑view images. *Computers, Environment and Urban Systems, 82*, 101489.