7 Machine Learning in Action – KNN and Lessons Learned
In the previous chapter, we introduced fundamental concepts of machine learning and contrasted them with traditional statistical modeling. We saw, for example, how a method like logistic regression can be used for classification, and we discussed the pitfalls of evaluating models only on the data used to fit them. In this chapter, we build on that foundation by putting machine learning into practice with a concrete example. Our goal is to illustrate the typical ML workflow and highlight why this approach is a significant improvement over naive or one-off analyses. We will use the K-Nearest Neighbors (KNN) algorithm as a simple but powerful case study. Through this example, we’ll derive general lessons about model validation, the bias–variance trade-off, and the importance of process in ML. Finally, we will examine several real-world case studies that demonstrate both the power and the potential pitfalls of machine learning, thus underscoring why this chapter’s lessons are relevant—especially as a prelude to the next chapter on AI and ethics.
In earlier chapters, we explored the theory behind machine learning and how it promises better generalization than some traditional approaches. Now it’s time to see ML in action. Machine Learning (ML) can be broadly defined as the field of enabling computers to learn patterns from data and make predictions or decisions without being explicitly programmed for each scenario. ML algorithms can perform tasks such as regression (predicting continuous quantities) or classification (assigning labels or categories), among others. A key aspect that distinguishes ML practice is an emphasis on model generalization: rather than just fitting a model to one dataset and stopping, we train models on a training dataset and then evaluate them on separate validation or test datasets. This approach ensures that the learned patterns actually apply to new, unseen data and are not just artifacts of the training set. In the last chapter, we discussed how relying solely on training performance (like reporting a high \(R^2\) or accuracy on the training data) can be misleading, because a model might simply be overfitting—memorizing the noise or quirks in the training set rather than learning the true underlying signal. By holding out data for testing, we get a more realistic measure of performance on future data.
To make these ideas concrete, let’s delve into a practical example of ML using one of the simplest classification methods: K-Nearest Neighbors (KNN). We choose KNN for this demonstration because it’s easy to understand and visualize, yet it embodies many aspects of the ML approach (like the need for proper validation). Moreover, KNN offers a nice contrast to the parametric models (such as logistic regression) that we covered previously. Unlike logistic regression, which learns a set of coefficients and produces a formula for prediction, KNN is a non-parametric, instance-based method: it makes predictions by directly using the training data, without developing an explicit formula for the relationship between features and the target. This will illustrate a different way of “learning from data” and help us discuss when such flexible approaches might be advantageous.
7.1 What is K-Nearest Neighbors?
K-Nearest Neighbors is an intuitive algorithm for classification (and it can be used for regression too, though here we focus on classification). The premise is simple: to predict the class of a new observation, look at the “nearest” data points in the training set and use their classes to make your prediction. In other words, KNN assumes that cases with similar features are likely to have the same label. The algorithm has just one key parameter, \(K\), which is the number of neighbors (training points) to consult.
Here’s how KNN classification works step by step:
- Choose \(K\): This is a user-specified number of neighbors (usually a small odd integer like 3, 5, 7, etc., to avoid ties in binary classification).
- Measure distance: When a new data point (with features) needs to be classified, the algorithm computes the distance between this point and every point in the training dataset. In the simplest case of numeric features, the distance is often the Euclidean distance in feature space. For instance, if we have features \(x_1, x_2, \ldots, x_p\), and our new point has features \((x_1^*, x_2^*, \ldots, x_p^*)\), then the distance to a training point \((x_1, x_2, \ldots, x_p)\) could be \(\sqrt{\sum_{j=1}^p (x_j - x_j^*)^2}\). Other distance metrics (like Manhattan distance) can also be used depending on the context.
- Find nearest neighbors: Identify the \(K\) training points that are closest (have the smallest distance) to the new point.
- Majority vote: Look at the classes of these \(K\) neighbors. Whichever class is most common among them is the class that KNN will predict for the new point. (For example, if \(K=5\) and among the 5 nearest neighbors 3 are labeled “Up” and 2 are labeled “Down,” the predicted class would be “Up.”)
- Tie-break (if needed): If there’s a tie (which can happen if \(K\) is even, or if there are multiple classes and the top counts are equal), there are various tie-breaking strategies. Often one can resolve a tie by taking the class of the closest neighbor among the tied group, or simply choose at random if needed. In practice, one usually picks an odd \(K\) for binary classification to reduce the chance of a tie.
The beauty of KNN lies in its simplicity—there’s no complex mathematical model being fit; the “training” process essentially boils down to storing all the training data. The heavy lifting happens at prediction time, when distances are computed and neighbors are found. This is fundamentally different from, say, logistic regression, which in training fits a weight to each feature and then uses that fixed equation for all future predictions. KNN doesn’t assume any particular form for the decision boundary; it can form very irregular boundaries if the data demand it. In fact, in the limit of infinite data, 1-nearest neighbor is guaranteed to perform at worst with twice the error rate of the optimal classifier for the problem (the so-called Bayes optimal classifier) – a classic theoretical result by Cover and Hart (1967). However, this flexibility comes at a cost: KNN can be sensitive to the choice of \(K\) and the notion of distance, and it can be computationally expensive to compute distances to all points for each prediction if the dataset is large.
Choosing the Right \(K\)
The parameter \(K\) determines how “localized” our prediction is:
- If \(K\) is very small (like \(K=1\)), we are essentially looking at the single nearest neighbor. The classifier will predict that every new point has whatever label its closest training neighbor has. This can lead to very wiggly, highly detailed decision boundaries that chase individual training points. In the extreme case of \(K=1\), the training error will be zero (because each training point is its own neighbor and has the same class, obviously), but the variance of the model will be high: small changes in the input or differences in the training set can drastically change predictions. In other words, \(K=1\) risks overfitting—capturing noise or outliers in the training data as if they were important patterns.
- If \(K\) is very large (imagine \(K\) equals the total number of training points \(N\)), then the prediction for a new point will be the majority class of the entire training set, essentially ignoring the specific features of the new case. That yields a very smooth, simple decision function (in fact, if \(K=N\) in a binary classification, the model predicts the same class for all inputs – whichever class is global majority). This has high bias (it is too simplistic to capture the true pattern if the true decision boundary is complex) but low variance (if we changed one training point, it hardly matters because so many others are voting). Large \(K\) can lead to underfitting—failing to capture meaningful local structure by averaging with too many unrelated points.
Thus, selecting \(K\) is a classic example of managing the bias–variance trade-off (a fundamental concept in machine learning and statistics). A smaller \(K\) gives a more flexible model (low bias, high variance), and a larger \(K\) gives a more constrained model (high bias, low variance). We need to find a sweet spot that generalizes well. As a rule of thumb, one typically wants enough neighbors to smooth out noise, but not so many that local patterns get washed out by global averaging.
How do we choose \(K\) in practice? This is where the machine learning mindset comes in: we use data-driven validation. Instead of guessing or relying on theory alone, we can let the data inform the best \(K\). Typically, one would split off a validation set or use cross-validation (as we’ll describe shortly) to try different values of \(K\) and see which yields the best predictive performance on data not used for training. This empirical approach to tuning hyperparameters like \(K\) is a hallmark of machine learning, differentiating it from some traditional modeling where such choices might be made based on analytical convenience or prior assumptions.
Before we dive into tuning \(K\), let’s set up a concrete example so we have a playground for experimentation.
A Practical Example: Predicting Stock Market Direction with KNN
To illustrate KNN, we will use an example adapted from the book An Introduction to Statistical Learning (ISLR). Suppose we have daily stock market data and we want to predict whether the stock market will go Up or Down on a given day based on recent trends. The dataset Smarket
from ISLR contains records for 1,250 trading days, including the daily percentage returns for the previous five days (Lag1
through Lag5
), the trading volume, and the Direction (whether the market went up or down on that day).
In the previous chapter, we analyzed this dataset using a logistic regression model as a classical statistical approach. We might recall that logistic regression on all features didn’t perform very well on out-of-sample data (it had about 52% accuracy in predicting direction for the year 2005, as reported in the ISLR lab). Now, we’ll apply KNN to the same task, but to keep things simple and avoid the “curse of dimensionality,” we will use only two features: Lag1
and Lag2
(the returns from the previous 1 and 2 days). Using fewer features makes it easier to visualize what KNN is doing and also reduces the risk that distance calculations become dominated by irrelevant dimensions.
1. Data Preparation: We first split the data into a training set and a test set. A common practice in time series or any scenario with a chronological order is to train on older data and test on newer data. Here, we’ll use the data from 2001–2004 (998 days) as our training set, and we’ll test on the data from 2005 (252 days). This mimics a real-world situation where we train a model on past data and then make predictions on the future. Crucially, we do not peek at 2005 during training or model tuning—that will remain untouched until the very end when we evaluate our final model. Along with splitting the data, we separate the features from the labels. That is, we have a matrix of features X_train (containing Lag1
and Lag2
for 2001–2004) and a vector of labels y_train (containing “Up” or “Down” for each of those days). Similarly, X_test and y_test hold the features and labels for 2005.
2. Feature Scaling (Pre-processing): An important step before running KNN (or any distance-based method) is to ensure that features are on comparable scales. If one feature were measured in, say, dollars (ranging from 0 to thousands) and another in percentages (ranging from 0 to maybe 5), then the distance calculation would be dominated by the feature with the larger numeric range, effectively giving it more “weight” in the neighbor calculation. In our case, Lag1
and Lag2
are both percentage returns (often small values like +0.1%, –0.5%, etc.), and they’re roughly on the same scale already. Even so, it’s good practice to standardize features: typically by centering (subtracting the mean) and scaling (dividing by the standard deviation). This transforms each feature to have mean 0 and standard deviation 1 (based on the training set). For example, if the mean of Lag1
in the training data is \(\mu_{\text{Lag1}}\) and the standard deviation is \(\sigma_{\text{Lag1}}\), we transform each value \(x\) to \((x - \mu_{\text{Lag1}})/\sigma_{\text{Lag1}}\). This is known as z-score standardization. Another approach is min-max normalization (scaling values to a [0,1] range), but z-scores are more common for KNN. In R, one can do this scaling easily (using the scale()
function on the training features, and then using the obtained means and SDs to scale the test features). In Python’s scikit-learn, you would use a StandardScaler
. The key point is to scale the test data using the training mean and SD, not its own, to avoid information leaking from test to train.
3. Choosing \(K\) with Cross-Validation: Now we train the KNN model. But “training” for KNN is trivial (just storing the data), so the real question is: what \(K\) to use? Rather than arbitrarily picking, we use cross-validation on the training set to guide us. For instance, we might try \(K = 1, 3, 5, 7, \ldots, 15\) and see which works best. A simple approach is to perform a leave-one-out cross-validation for each \(K\) (though that can be computationally heavy), or a 5-fold cross-validation: split the 2001–2004 training data into 5 folds, for each fold train on the other 4 and test on that fold, compute accuracy, and average the results for a given \(K\). Suppose we do this and find that the average validation accuracy is highest at, say, \(K=5\). (Indeed, in the ISLR lab, using \(K=5\) or \(K=7\) tends to be near optimal for similar data.) This process of tuning \(K\) via cross-validation ensures we’re choosing a model that is likely to perform well on unseen data, rather than one that just got lucky on the training set.
Let’s say the cross-validation suggests \(K=5\) is the best. We then finalize our model as KNN with \(K=5\), trained on the entire 2001–2004 dataset (with the understanding that this choice of \(K\) was informed by an internal validation process on the training data).
4. Making Predictions on Test Data: Now comes the moment of truth: we take our finalized KNN(5) model and make predictions for each day in 2005 (252 observations). For each test observation, the algorithm finds the 5 closest points among the 998 training points and assigns the majority vote. The output is a predicted Direction (“Up” or “Down”) for each day in 2005. We then compare these predictions to the actual Direction
in 2005 (which, remember, we held out and did not use at all in modeling or validation).
5. Evaluating the Results – Confusion Matrix and Metrics: It’s important not just to get an accuracy number, but to really understand how the model is performing. We use a confusion matrix to summarize the results, tallying correct vs. incorrect predictions for each class:
- True Positives (TP): Cases where the market went Up and our model predicted Up.
- True Negatives (TN): Cases where the market went Down and our model predicted Down.
- False Positives (FP): Cases where the actual outcome was Down but our model wrongly predicted Up (a “false alarm”).
- False Negatives (FN): Cases where the actual outcome was Up but our model predicted Down (a “miss”).
For example, suppose in 2005 there were 252 trading days, of which 130 were “Up” days and 122 were “Down” days. The KNN model with \(K=5\) makes its predictions and we tally the outcomes. We might get a confusion matrix like this (hypothetically):
Predicted
| Up | Down
Actual Up | 60 | 70
Actual Down | 30 | 92
This would mean: out of 130 actual Up days, it correctly predicted 60 (TP = 60) and missed 70 (FN = 70). Out of 122 actual Down days, it correctly predicted 92 (TN = 92) and falsely predicted Up on 30 (FP = 30).
From this matrix, we can compute various metrics:
- Accuracy: \((TP + TN) / \text{total predictions}\). In the hypothetical numbers above, accuracy = \((60+92)/252 \approx 0.603\) or 60.3%. Accuracy is the simplest metric—how often you’re right overall. However, accuracy can be misleading if the classes are imbalanced. (In our case, classes are roughly balanced. But imagine 90% of days are Up; then a trivial “always predict Up” model gets 90% accuracy, which sounds good but is useless for catching Down days.)
- Recall (Sensitivity or True Positive Rate): For the positive class (let’s define “Up” as the positive class here), recall = \(TP / (TP + FN)\). In our example, recall for Up days = \(60/(60+70) \approx 46.2%\). This means the model caught less than half of the actual Up movements. Recall answers: “Out of the days the market actually went up, how many did the model correctly predict as up?”
- Specificity: True Negative Rate = \(TN / (TN + FP)\). For Down days, specificity = \(92/(92+30) \approx 75.4%\). That means when the market actually went down, the model correctly predicted a down 75% of the time. (Specificity is essentially recall for the negative class, or 1 – False Positive Rate.)
- Precision (Positive Predictive Value): \(TP / (TP + FP)\). In our example, precision on predicting “Up” = \(60/(60+30) = 66.7%\). Precision answers: “When the model says ‘Up’, how often is it correct?” This is important in contexts where false positives have a cost. If our model frequently predicts “Up” but is often wrong, that might not be useful for making trading decisions.
- F1 Score: The harmonic mean of precision and recall: \(F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\). This gives a single-number metric that balances the trade-off between precision and recall. In the example, F1 for the Up class would be \(2 \cdot \frac{0.667 \cdot 0.462}{0.667+0.462} \approx 0.546\) (54.6%). F1 is useful when you care about balancing false negatives and false positives (like in many classification tasks where class imbalance is an issue).
- False Positive Rate (FPR): \(FP/(FP+TN)\). In our example, FPR = \(30/(30+92) \approx 24.6%\). This is 1 – specificity, the fraction of actual Down days that were incorrectly predicted as Up. It helps in contexts where you want to explicitly consider the trade-off with TPR (recall) via ROC analysis.
In practice, for the actual stock market KNN model (as reported in ISLR with Lag1
and Lag2
features), the accuracy was only slightly above 50%. Our hypothetical 60% was optimistic; indeed KNN with just two lag variables is only modestly better than coin flipping for this task. That’s an important lesson: not all problems have easily exploitable patterns, and short-term stock movements are notoriously hard to predict. Markets are almost random in the short run, at least with such simple features.
6. (Optional) ROC Curve and AUC: Even though KNN gives discrete class votes, we can get a sort of “confidence” by looking at the fraction of the \(K\) neighbors that voted Up. For instance, if 4 out of 5 neighbors are “Up”, we might translate that to an 80% estimated probability of Up (this isn’t a calibrated probability, but it’s a score). By varying a threshold on this score (e.g., not necessarily using majority 0.5, but saying things like “predict Up only if at least 4 of 5 neighbors are Up (i.e., ≥80% neighbor vote)”), we can plot a Receiver Operating Characteristic (ROC) curve. The ROC curve shows the trade-off between True Positive Rate (Recall) and False Positive Rate as we vary that threshold. The Area Under the ROC Curve (AUC) gives a threshold-independent measure of performance: 0.5 means no better than random, 1.0 means perfect separation of classes. If we were evaluating models in depth, we might compare AUCs. For our KNN model, since it’s basically at ~50-55% accuracy, the AUC might be only a bit above 0.5, indicating limited predictive power. We won’t go deeper into ROC for this example, but it’s good to know the tool exists, particularly in fields like medicine or finance where you might tune models to be more sensitive or more specific depending on the need.
7. Result Interpretation: Summarizing the KNN example reinforces some key points:
- KNN “learns by example.” It literally keeps the training dataset around and bases predictions on it. This is in contrast to a model like logistic regression, which condenses the information in the training data into a fixed set of parameters (coefficients) and then discards the detailed training data.
- Distance calculations mean that feature scaling matters. We took care to standardize
Lag1
andLag2
. If we hadn’t and, say,Lag1
had a much larger variance thanLag2
, our distance measure would unintentionally give more weight toLag1
. In KNN (and many ML methods), proper preprocessing can significantly affect results. - KNN has a hyperparameter (\(K\)) that we tuned using data. This underscores the ML approach of using validation: rather than relying on theory or guesswork alone, we empirically found a good setting that balances overfitting and underfitting for our problem.
- The reasoning of KNN is somewhat transparent: if someone asks “Why did the model predict the market will go Up on this day?”, we can say “Because the 5 most similar days in the past (in terms of recent returns) were mostly Up days.” This is more interpretable than some models (like a large neural network), though less so than a simple logistic regression coefficient (“because the model has a positive weight on yesterday’s return, meaning a higher return yesterday makes it predict Up”).
- KNN’s simplicity is a double-edged sword. It’s easy to implement and explain. However, computationally, KNN requires comparing to every training point to classify one new point (unless you use clever data structures or approximations), so it can be slow if the training set is very large (millions of points) or if the feature space is high-dimensional. In contrast, logistic regression or neural networks take time to train, but once trained, making a prediction is fast (just plug into the formula). So KNN shifts the burden to query time. Additionally, KNN doesn’t handle categorical features or missing values naturally without some modifications (like defining distances for categorical variables or imputation), whereas some other methods can incorporate those more directly.
Having walked through KNN, we’ve effectively seen a microcosm of the machine learning workflow: data prep, model fitting, validation (choosing \(K\)), testing (evaluating on new data), and interpretation. Next, we will step back and generalize these steps into a broader view of the ML process, extracting lessons that apply not just to KNN but to all machine learning methods.
7.2 The Machine Learning Process and Key Lessons
The KNN example above wasn’t just about a specific algorithm; it was also about how we approach building a predictive model. One of the biggest takeaways from modern machine learning is that the process by which you develop a model is as important as the model itself. In fact, many of the advances in ML have come not only from new algorithms, but from a better understanding of how to train and evaluate models properly.
Let’s outline the typical machine learning workflow (often called the ML pipeline). You may notice it formalizes and extends the steps we implicitly took in the KNN example:
Problem Definition and Data Collection: Everything starts with understanding the problem you’re trying to solve and obtaining relevant data. If the question is predicting stock market direction, you gather historical market data. If the task is image recognition, you need a dataset of labeled images. This stage might involve defining the target variable (what you want to predict) and the features (what you will base predictions on). It also involves ensuring you have a representative sample of data. One lesson here is “garbage in, garbage out”: even the fanciest ML algorithm won’t help if your data are badly flawed or not relevant. In many real projects, data collection is a huge effort: logging events in software, scraping data from the web, integrating databases, etc. Moreover, if there are biases or gaps in the data (for example, an HR dataset mostly containing male applicants), those carry through to the model. Often, domain experts are involved here to guide what data should be collected and what might be useful.
Data Preparation (Cleaning and Pre-processing): Raw data is seldom clean. This step includes handling missing values (maybe filling them in or discarding incomplete records), removing or correcting outliers, and dealing with inconsistencies (like different units or category encodings). It might also include combining features or creating new ones (feature engineering). For example, for our stock data we might create a feature that is the average of Lag1 and Lag2, or a volatility measure over the past week. Pre-processing also includes splitting the data into training/validation/test sets, as we emphasized. An important principle is to avoid data leakage: any information that would not actually be available at prediction time should not be used in training. A classic mistake is accidentally using future data to predict the past (which could happen in time series if you shuffle data or don’t time-split properly). Another leakage example is computing a feature using the whole dataset rather than just the training subset—like scaling the test data using the test set’s own mean, which would leak knowledge of the test distribution. Good ML practitioners are very careful about these boundaries.
Exploratory Data Analysis (EDA): This is not always highlighted in ML workflows, but it’s something good practitioners do. They’ll plot data, look at distributions, visualize correlations or trends, to get intuition about what features might be predictive and what idiosyncrasies the data might have. EDA helps inform feature engineering and model selection. For instance, you might discover that one feature is mostly missing and decide to drop it, or notice that the relationship between a feature and the target looks nonlinear (suggesting a nonlinear model or a need for transformation). EDA can also reveal class imbalances or data quality issues that need to be addressed (e.g., maybe you find out that in stock data, the distribution of returns has heavy tails, which might affect certain models).
Model Selection and Training: Here’s where we pick a model or a set of models to try. It could be as simple as choosing between KNN and logistic regression, or as complex as designing a deep neural network architecture from scratch. We then train the model on the training dataset. “Training” means using an algorithm to adjust the model’s parameters to fit the data. In KNN there are no tunable parameters aside from \(K\), but in a neural network there could be millions of parameters to optimize via gradient descent. A key part of this step is also hyperparameter tuning. These are the settings of the model that aren’t learned from the data but set by the practitioner (like the number of neighbors \(K\), or the maximum depth of a decision tree, or the learning rate in a neural network). We typically use cross-validation on the training set to find good hyperparameters. In our KNN example, \(K\) was the hyperparameter we tuned by trying different values and evaluating on a validation fold. For a decision tree, a hyperparameter might be the maximum depth of the tree or minimum leaf size. We often automate this search (e.g., grid search, random search, or more sophisticated Bayesian optimization) across hyperparameters, guided by the validation performance. The model that performs best on validation data (after tuning) is then selected to move forward.
Model Evaluation: After selecting the best model (and hyperparameters) using the training data (and its internal validation process), we finally test the model on the hold-out test set that we kept aside at the beginning (and never touched until now). This gives an unbiased estimate of how the model will perform on completely new data. We then look at metrics like accuracy, precision/recall, F1, AUC, etc., as appropriate for the problem. In many cases, multiple metrics are examined to get a full picture. For example, in an imbalanced classification (like detecting fraud where 0.1% of transactions are fraudulent), overall accuracy might be high for trivial reasons (predicting “not fraud” always), so you’d focus on recall and precision for the fraud class instead. The test evaluation is crucial; it’s the report card of how well we have generalized. If the performance on the test set is much worse than on the validation set, it could mean we overfit the validation process (perhaps by trying too many models or peeking at test early – a form of “evaluation leakage”). In practice, some workflows have an explicit train/validation/test split (the validation was used for tuning, and the test is a final evaluation), while others use train/test with cross-validation on train. Regardless, it’s vital that the test data provides a fresh perspective that the model has never seen during development.
Deployment of the Model: If the model passes the evaluation stage (i.e., it’s accurate enough to be useful, and perhaps other considerations like speed or interpretability are acceptable), the model can be deployed into the real world. Deployment could mean putting the model in an application (e.g., a web service that given input features returns a prediction), embedding it in a device (like a mobile app with an on-device model), or using it to inform decisions in an organization. Deployment brings its own challenges: the model needs to handle real-world data which might be slightly different from your training data, the system needs to be robust (no crashes, reasonable response time), and there may be integration with other software. One lesson here is that engineering considerations matter; a model that is too slow or too large might be impractical even if it’s accurate. For example, a huge neural network that takes 5 seconds to classify an image might be unusable in a real-time app that needs instant results, in which case a slightly less accurate but much faster model might be chosen instead. Sometimes, model compression or distillation is done to make models more deployable.
Monitoring and Maintenance: The story doesn’t end once the model is deployed. Data and environments change over time—a phenomenon often referred to as data drift or concept drift. For instance, if we deployed our stock market model, the market dynamics in 2025 might be different from those in our training data (2001–2004), especially after events or regime changes. Over time, the model’s performance may deteriorate. Therefore, it’s important to continuously monitor how the model is doing on new data. Are the error rates creeping up? Are there certain segments of data where it’s failing more often? Many organizations set up automated monitoring: dashboards for metrics, alerts if performance falls below a threshold. When drift is detected or after some time, the model should be retrained or updated with fresh data. This could be as simple as re-running the training process on the last N months of data, or as complex as an online learning algorithm that updates continuously. Maintenance also includes fixing any issues (bugs in data pipelines, etc.) and adjusting the model if requirements change. Essentially, an ML model in production is a product that requires upkeep – it’s not a one-off analysis that you can forget about.
Ethical and Human-in-the-loop Considerations: This step is increasingly recognized as essential in the ML pipeline (we will dive deep into it in Chapter 13 on AI and Ethics). At every stage, we should assess if the model could be creating or perpetuating unfair biases, if its decisions can be understood and justified, and if there are appropriate checkpoints for human oversight. For example, in a hiring algorithm scenario, one might decide that the algorithm’s recommendations are reviewed by a human recruiter rather than used blindly. Or one might implement bias mitigation strategies during data preparation (like re-sampling data to balance across demographic groups) or add constraints during model training (like equalizing error rates across groups). Documentation of the model’s intended use, performance, and limitations is also part of responsible deployment (there’s a concept of “model cards” in ML to summarize these attributes). These considerations ensure that machine learning serves human goals and values, rather than undermining them. It’s a reminder that an accurate model is not automatically a good model if it operates in a socially harmful way.
Looking at this pipeline, one lesson shines through: machine learning is a process, not just a single algorithm or analysis. In traditional statistics, one might have emphasized the final fitted model and its properties (p-values, residual diagnostics, etc.), often using the same dataset for everything (fitting and evaluation). In ML, the credibility of a model comes from this rigorous process of training and testing on separate data, and from iteratively improving through validation. It’s a more empirical, experiment-driven approach.
Let’s highlight a few key improvements that the ML approach provides (these are things we saw hints of with KNN and discussed earlier, but now we’ll make them explicit):
Out-of-Sample Validation is King: In the previous chapter, we noted the danger of overfitting and the folly of judging a model solely by how well it fits the training data. Machine learning’s insistence on evaluating on unseen data is a paradigm shift from earlier practices. For example, decades ago, one might publish a regression model that had an \(R^2\) of 0.9 on the dataset at hand and claim it’s great—only to later find it performs terribly on new data. The ML community learned (often the hard way) that what matters is predictive performance on new data. That’s why we do train/test splits and cross-validation. One famous statement summarizing this is: “If your training set performance is much better than your test set performance, you’re overfitting.” The discipline of using hold-out validation has been called a “gigantic improvement” over the old ways, because it simulates the model’s behavior on future data while you still have the chance to adjust things. Essentially, it injects a dose of reality into model assessment.
Bias–Variance Trade-off and Model Complexity: Every model has some level of flexibility or complexity. For KNN it’s \(K\), for a polynomial regression it’s the degree of the polynomial, for a decision tree it’s how deep it can grow, for a neural net it’s the number of layers and neurons, etc. Machine learning teaches us to tune complexity to the data via techniques like cross-validation. If a model is too simple (high bias), it won’t capture the signal; if it’s too complex (high variance), it will capture noise. For instance, a 1-neighbor classifier is high-variance; a 101-neighbor classifier might be high-bias. The optimal complexity yields the lowest generalization error. This perspective helps us avoid both underfitting and overfitting. It’s interesting to note that traditional statistical techniques often didn’t provide an easy way to adjust complexity—linear regression has a fixed form (though one could add polynomial terms, etc.), whereas ML algorithms often come with tunable complexity. Additionally, ML popularized methods like regularization (e.g., Ridge or Lasso regression) which explicitly add a penalty for complexity to combat overfitting (these methods were developed in statistics, but became mainstream through machine learning practice). The concept of balancing model complexity with available data – the bias-variance trade-off – is now central to how we think about model development.
Many Algorithms, Unified Approach: In the ML mindset, whether you use a linear model or a more complex model, you still follow the same workflow of training/validation/testing. This unification is powerful. We treat algorithms as swappable tools and empirically test which one works better for our problem. For example, we might try logistic regression, KNN, and a decision tree on the stock data, all under the same protocol of cross-validated tuning and test set evaluation. We don’t have to commit to one upfront; we can compare models objectively on their validation or test performance. This also means that classical statistical models are not thrown away; rather, they’re often the baseline in ML experiments. In fact, a well-tuned logistic regression or Naive Bayes classifier can sometimes beat more complex models if the data is limited or the true pattern is linear. The difference is that in ML we evaluate logistic regression by its predictive accuracy on test data, not by the fact that its coefficients are significant on the training data. Thus, ML is in some ways a repackaging of statistical modeling with a stronger emphasis on prediction and empirical performance. For instance, you can take an ordinary least squares linear regression and run it within a cross-validation loop to select features or polynomial degree – turning what was once a mostly analytical exercise into a data-driven one. Modern ML libraries make it easy to plug in different algorithms. In R, the tidymodels framework (Kuhn & Wickham, 2020) provides a unified interface to dozens of algorithms, and so did the older
caret
package. In Python, scikit-learn’s estimator API does the same. This abstraction encourages treating models as components in the pipeline that you can optimize empirically.From Hand-Crafting to Automation: Traditional modeling often relied on the analyst to do a lot of the work (deciding which variables to include, what transformations to apply, based on theory or intuition). Machine learning, especially with techniques like decision trees or deep learning, shifted the paradigm to let algorithms automatically discover patterns, interactions, and non-linear relationships in the data. This automation of feature discovery means ML can uncover complex signals that humans might miss or might not think to test. For example, a decision tree might find that a certain combination of features within specific ranges is highly predictive – something that would be hard to guess manually. Of course, automation can also latch onto spurious patterns (again, why validation is needed), but when done right, it has led to breakthroughs in fields like computer vision and natural language processing where manually crafted features were quickly outclassed by learned features. Leo Breiman, in a famous 2001 paper, argued that statisticians should embrace this more algorithmic, predictive approach rather than relying solely on data models. That sentiment in many ways forecasted the rise of ML methods in all sorts of applications.
Scalability and Big Data: Machine learning techniques are designed to handle large datasets and often improve with more data. In statistics, having too many data points could be a burden for computational or theoretical reasons. In ML, more data is usually a benefit (the mantra “there’s no data like more data”). Algorithms like stochastic gradient descent and distributed computing frameworks (Hadoop, Spark, etc.) were developed to train models on datasets with millions of examples or features. With more data, even simpler models can perform better because they get to see a wider variety of scenarios and can average out noise. For instance, a nearest-neighbor classifier with a very large and diverse training set can be extremely powerful, because for almost any new input it can find a close example in its memory. A striking illustration of this was the ImageNet competition in 2012: a deep neural network trained on millions of images achieved a dramatic leap in accuracy for image recognition, a success that hinged on both the algorithm and the sheer scale of data available (around 1.2 million labeled images for training). Today’s ML is inseparable from “big data” – they fuel each other.
Continuous Improvement and Iteration: In ML practice, it’s rare to get the best model on the first try. You typically iterate: maybe your first model’s performance is not good enough, so you try adding features, or you realize some data cleaning is needed, or you try a more sophisticated model. This iterative refinement is encouraged by the framework of validation: you can keep improving as long as you evaluate each change on new validation splits (or use techniques like cross-validation properly without leaking test info). This is different from a one-and-done analysis – it’s more experimental. Data scientists often train dozens of models in the process of finding one that is suitable. Tools (and even competitions like Kaggle) have exemplified this trial-and-error approach, where the best solution often comes from many small tweaks and experiments guided by feedback from validation scores.
Better Guardrails Against Self-Deception: By “bias” here I mean bias in the statistical sense (not ethical bias). Classical modeling could inadvertently overfit due to analyst degrees of freedom (for instance, including many predictors and selecting those with low p-values after the fact, or tweaking the model until the fit looked good). The ML approach, with a strict train/test separation, means you detect if you’re overfitting because your test performance will be poor relative to train. It imposes a kind of discipline: you can’t fool yourself that your model is great just because it fits your existing data; ML forces you to confront how it does on new data. Thus, it reduces the optimism bias in model assessment. Techniques like cross-validation also help use data efficiently and avoid the luck of a single split. However, ML can also ingest biases present in data if not carefully addressed (which we will talk about soon). The point here is methodologically, ML improved our ability to gauge true performance, which is a huge win for the reliability of models.
Now, while ML has these advantages, it also introduced new challenges. Complex models like ensemble methods or deep neural networks can be harder to interpret than a simple linear regression or a decision tree. This is sometimes acceptable if prediction accuracy is paramount (like a recommendation system or a speech recognizer), but in other cases (like deciding who gets a loan or parole), stakeholders might require explanations. So, there is often a trade-off between maximizing accuracy and maintaining interpretability. Another challenge is that automation can inadvertently perpetuate biases present in the training data (a theme we’ll explore in the case studies below).
It’s also worth noting that in many practical projects, a combination of approaches works best. One might start with a simple model to establish a baseline and gain insight (because simpler models are easier to debug and interpret), and then move to more complex ML models to try to boost performance. Or one might use domain knowledge to engineer better features and then let the ML algorithm determine the best way to weight them. So ML doesn’t remove the need for human insight; it complements it with computational power and rigorous validation.
To conclude this section, let’s distill a few big lessons:
- Always validate on data the model hasn’t seen. This single lesson cannot be emphasized enough. It’s the guardrail against fooling yourself. In practice, if someone reports a model’s performance, the first question savvy people ask is “Was that on a held-out test set?” If not, be skeptical.
- Use cross-validation to tune and compare models. This makes the most of your data and gives a robust sense of how models will perform on average. Cross-validation is generally more reliable than a single train/test split, especially with limited data, and helps in model selection.
- Be aware of the bias–variance trade-off. Too simple is bad (underfit), too complex is bad (overfit). Use techniques like regularization or ensemble averaging to mitigate variance, and use feature engineering or more complex models to mitigate bias – all guided by validation feedback. As one source put it, “the price to pay for achieving low bias is high variance”, so we seek a balance.
- Data preparation is part of modeling. How you treat missing data, outliers, encode categorical variables, and scale features can dramatically affect model performance. In our KNN, if we hadn’t scaled features, results could differ. If we had a categorical feature (not in this stock example, but say “Day of Week”), we’d need to encode it (e.g., one-hot encoding) to use it in a distance calculation. Always consider preprocessing as a first-class step, not an afterthought.
- Keep it simple (at first). It’s often wise to start with simpler models (they train faster, are easier to interpret) and only move to more complex ones if needed. Simpler models are also easier to troubleshoot. For example, if a linear regression or decision tree isn’t working at all, maybe your features have no signal or data is corrupt; if a giant neural network isn’t working, it could be many things. Simpler models give quick feedback and set reasonable expectations.
- Leverage existing tools and libraries. You typically don’t write algorithms from scratch in practice (unless developing new methods). Instead, use well-tested libraries (
tidymodels
or scikit-learn or TensorFlow, etc.), which also often provide high-level functionalities for cross-validation, hyperparameter search, and pipeline building. This removes boilerplate and lets you focus on the problem specifics. - Think about context and downstream use. A model is only good relative to an application. If the cost of false positives is high (e.g., flagging a legitimate transaction as fraud and annoying a customer), you might favor a model with higher precision at the expense of some recall. If missing a positive is critical (e.g., a cancer screening test), you’ll aim for high recall and can tolerate more false alarms, perhaps with a human review step. Thus, the “best” model isn’t just the one with highest overall accuracy, but the one that best meets the business or societal objectives. This often means looking at the whole confusion matrix and considering costs/benefits, not just one scalar metric. It may also mean incorporating fairness or transparency requirements as part of what makes a model “best” for the situation.
By approaching problems with this ML process, we typically end up with models that perform better on real-world data than ad-hoc approaches, and we have greater confidence in their ability to generalize. This is why machine learning has largely supplanted older modeling workflows in many fields where prediction is key. However, as we deploy models widely, we encounter a new set of issues: those involving fairness, ethics, and trust, which simple train/test metrics don’t capture. To segue into that, we will now look at some case studies where machine learning (or automated decision systems) were applied in real-world scenarios. These stories illustrate why the lessons we’ve learned must be applied very carefully, and why an understanding of the context and potential unintended consequences is crucial.
In short, machine learning can be immensely powerful, but with great power comes great responsibility. The next section presents real cases of ML in action—some triumphant, some cautionary—to underscore the importance of ethical considerations (which we’ll formally discuss in the following chapter).
7.3 Real-World Case Studies: The Power and Pitfalls of ML
Having learned about the ML workflow and its benefits, it’s time to examine what happens when these techniques are deployed outside the classroom and in society at large. Machine learning is a double-edged sword: it can automate decisions and discover patterns that bring efficiency or new insights, but it can also amplify biases or make mistakes at scale that affect people’s lives. The following case studies illustrate why the seemingly technical lessons from ML also translate into important societal lessons. Each case provides an example of an ML or AI system in a high-stakes setting, highlighting both the promise of the technology and the pitfalls when things go wrong or are not handled carefully. These will set the stage for our next chapter on AI ethics by showing that ethical issues aren’t just theoretical—they emerge naturally from real scenarios.
Machine Learning in Hiring – The Amazon Recruiting Tool and Bias in HR
The Promise: Human Resources (HR) is a domain where many hoped AI could help remove human biases. Résumé screening and initial interviews are labor-intensive and subject to human prejudices or inconsistencies. If we could train a model on past hiring decisions to identify promising candidates, perhaps we could both speed up hiring and even reduce bias (the thinking being that an algorithm might ignore irrelevant details like name or gender and focus on qualifications). Companies receive thousands of applications, so an automated tool that flags the top candidates could be very valuable.
Amazon’s Experiment: In the mid-2010s, Amazon – known for automating and optimizing processes – tried to do exactly this. They developed an experimental recruiting tool that they hoped could evaluate resumes and applications with machine learning. The idea was to give each candidate a score (e.g., from one to five stars, similar to how products are rated on Amazon) indicating how well they fit the profile of successful hires, thus allowing recruiters to prioritize those candidates. The model was trained on 10 years of past hiring data: resumes submitted and the outcomes (whether the person was hired, how they performed, etc.). In essence, it was supposed to learn what factors in a resume or application correlated with a good hire, based on Amazon’s historical hiring decisions.
What Went Wrong: By around 2015, as the team examined the model’s recommendations, they noticed a disturbing pattern: the AI was downgrading resumes that included the word “women’s” (as in “captain of women’s chess club”) and generally giving lower scores to candidates from women’s colleges or with certain women-oriented keywords. In effect, the system had taught itself that male candidates were preferable. Why did this happen? The training data reflected 10 years of Amazon’s hiring, and in the male-dominated tech industry, that data was itself skewed: a majority of applicants (and hires) were men. The AI picked up on correlated signals of maleness and associated those with successful hires. In other words, it absorbed the historical bias present in the company’s hiring practices. As one report succinctly put it, “Amazon’s system taught itself that male candidates were preferable.” The model was not explicitly told to discriminate, but given the patterns in data, it inferred that being male (or having attributes correlated with being male) was a positive indicator for tech hiring. This is an example of algorithmic bias: the algorithm’s output was biased because of biased input data.
Amazon’s engineers, upon discovering this behavior, tried to “de-bias” the model. They adjusted the program to ignore explicit gendered terms like “women’s”. However, this proved insufficient because the underlying patterns were more subtle. The model could find other proxies for gender. For example, if certain all-women colleges appeared in resumes and historically those resumes weren’t hired often (perhaps due to bias or pipeline issues), the model could still learn to discount those. It might also infer from a combination of features (say, sports or activities or even first names) something about gender. Removing a few obvious keywords was like a game of whack-a-mole – the bias would pop up elsewhere because it was systemic in the data. Short of overhauling the entire approach and dataset, there was no easy fix to ensure the model was gender-neutral.
Ultimately, by 2017, Amazon realized they couldn’t trust the system. They disbanded the team and scrapped the project. Amazon said the tool “was never used by Amazon recruiters to evaluate candidates,” implying it never became an official gatekeeper for job applicants (fortunately). Yet, this stands as a cautionary tale: just because a model finds a pattern in historical data doesn’t mean that pattern is fair or desirable to replicate. Here, the model was accurately detecting a pattern (men were hired more often in the past), but reinforcing that pattern would clearly be discriminatory. This is a prime example of how an ML model can inadvertently perpetuate and even amplify biases present in training data (Lambrecht & Tucker, 2019 report a similar phenomenon in online advertising, where an algorithm showed STEM career ads less often to women, even without intentional bias).
Key Takeaways:
- Biased Training Data → Biased Model: The Amazon case shows that “algorithmic AI is only as good as the data it’s trained on.” If the historical data is biased (in this case, skewed against women in tech roles), an ML model will likely reproduce that bias. Importantly, removing the protected attribute (like gender) from the input doesn’t guarantee fairness because the model can use proxy variables. In Amazon’s tool, gender wasn’t an explicit input, but other features served as proxies. This underscores that achieving fairness often requires more than just excluding obvious sensitive features; it may need rethinking the data collection or applying algorithmic fairness techniques.
- Lack of Transparency: Part of the problem with complex models (Amazon’s was reportedly a proprietary ensemble or neural network) is that it’s hard to fully understand why they make certain decisions. It wasn’t immediately obvious why certain resumes were downgraded until patterns like the word “women’s” were spotted. Ensuring explainability in AI decisions is crucial, especially in domains like hiring where decisions deeply affect lives and there are legal implications. As one ML professor noted, when it comes to AI hiring tools, “How to ensure the algorithm is fair, how to make it interpretable and explainable – that’s still quite far off.” In other words, current ML models, if not carefully constrained, can behave in ways even their creators have trouble explaining or predicting (this is a research frontier in ML – explainable AI).
- Correction is Hard: Even after identifying a specific bias and trying to correct it (e.g., telling the algorithm to ignore certain terms), Amazon couldn’t be confident the model was unbiased. This shows that fairness isn’t a one-shot fix; it often requires a systemic approach. Sometimes the best solution is to collect better data (for example, one could retrain on a dataset of resumes with equal representation of genders and successful outcomes, if such data were obtainable). There are also algorithmic ways to impose fairness constraints (for instance, ensure the model’s predictions have equal error rates for men and women), but these techniques were less mature at the time. Amazon’s decision to scrap the project suggests they found it infeasible to guarantee fairness with the approaches they tried. It highlights that bias mitigation in AI is a non-trivial task.
- Human Oversight: Amazon ultimately kept human recruiters in charge. This aligns with a broader lesson: in high-stakes domains, AI should often be used to assist rather than fully replace human decision-makers – at least until we have strong assurances of fairness and accuracy. John Jersin, a VP at LinkedIn (which also uses algorithms in recruiting), said he “would not trust any AI system today to make a hiring decision on its own,” seeing such tools as aids for humans, not replacements. The Amazon case strongly supports that view: AI can sort and highlight resumes, but final decisions (and the judgment of what truly makes a candidate qualified) are perhaps best left with humans who can be accountable and nuanced in their reasoning.
- Bias in Other HR AIs: Amazon’s story is high-profile, but it wasn’t unique. Other companies have developed AI for hiring and faced scrutiny. For instance, the company HireVue offered video interview analysis where an AI would score candidates based on not just their spoken answers but also facial expressions and tone of voice. This raised alarms among AI ethicists: could such a system be inadvertently biased against non-native speakers, or those with certain disabilities, or simply mis-evaluate traits like “enthusiasm” through a narrow technical lens? In Illinois, a law was passed in 2019 requiring employers to inform candidates when AI is used in video interviews and to obtain consent, reflecting the discomfort with “black box” assessments in hiring. The lesson is that social acceptance of AI in sensitive areas like jobs requires transparency and fairness; otherwise, there will be pushback from the public or regulators.
- Academic Evidence of Bias: The issues in AI hiring are also backed by research. Lambrecht and Tucker (2019) found that an algorithm delivering online STEM career ads displayed them less often to women – about 20% fewer women saw the ads than men, even though the ad was intended to be gender-neutral. The culprit was the optimization objective: the platform showed ads in the most cost-effective way, and since women aged 25–34 are a highly sought demographic in advertising (hence more expensive to reach), the algorithm showed the STEM ads (with a limited budget) more to men. This is a case of indirect bias: optimizing one metric (cost per click) led to a demographic disparity. It illustrates how even without intent, AI can yield discriminatory outcomes due to correlations and market dynamics.
- The “Coded Gaze”: AI researcher Joy Buolamwini coined the term “the coded gaze” to describe how the biases of creators and datasets are reflected in AI systems. In her words, it’s the algorithmic bias that leads to exclusion and discrimination, essentially the priorities and prejudices of those who shape the technology, embedded in code. The Amazon tool reflected a coded gaze that valued male candidates more, because it was created from data and practices that did so. Buolamwini’s work in evaluating commercial facial recognition has similarly shown racial and gender biases (e.g., higher error rates for dark-skinned women). The takeaway is that we must be vigilant that AI does not unintentionally perpetuate societal biases under the guise of objectivity. Being aware of the “coded gaze” means actively checking and correcting our models for bias.
In summary, ML in hiring, as seen with Amazon’s experiment, demonstrates both the power of AI (scanning and learning from thousands of resumes) and its pitfalls (codifying past biases). The lesson is not that “ML is bad” for hiring, but that it must be approached with great care: carefully curated training data, bias testing, interpretability, and human oversight are all needed if such tools are to be used responsibly. Many companies have since taken these lessons to heart, conducting “bias audits” of their hiring algorithms (Raghavan et al., 2020) and sticking to using AI for less sensitive parts of the pipeline (like scheduling interviews or sourcing candidates, rather than final hiring decisions). This case underlines why understanding the context and ethics around ML is just as important as understanding how to get a high cross-validation accuracy.
Algorithms in Education – The 2020 UK Exam Grading Fiasco
The Scenario: In 2020, the COVID-19 pandemic disrupted schools and exams worldwide. In the UK, A-level exams (taken by students typically at age 18, crucial for university admissions) were canceled for safety reasons. But universities still needed grades to decide on admissions. The authorities in England decided to use an algorithmic approach to moderate or determine students’ grades, since the usual exams couldn’t take place. Teachers were asked to submit predicted grades for each student (called “Centre Assessed Grades” or CAGs) and to rank their students within each subject. The exam regulator Ofqual then designed a statistical model to standardize these grades across schools and maintain overall consistency with prior years.
The Intention: The main goal of the algorithm was to prevent grade inflation and ensure fairness across schools. If every teacher’s optimistic predictions were used, the fear was that 2020 would see a huge jump in top grades (because teachers, in doubt, might err on the side of generosity). That could be seen as unfair to past or future cohorts and could overwhelm universities (which have limited seats). So the algorithm’s job was to standardize results: to adjust the teacher-submitted grades in line with historical distributions. Essentially, if a school historically only had a certain percentage of A’s, the algorithm would cap the number of A’s this year around that number, and similarly for other grades, regardless of what the teachers predicted.
How the Algorithm Worked (in brief): It took into account several factors:
- Each school’s historical grade distribution in each subject (how many A, B, C, etc., usually, in say Chemistry at School X).
- The teachers’ rank order of students and their initially predicted grades.
- The overall national distribution of grades expected (so that roughly the same proportion of students get each grade as in a normal year).
- Some incorporation of a student’s past performance (e.g., GCSE results at age 16) as an indicator.
In practice, for large classes, the algorithm basically said: “If this school usually has, for example, 5% A’s in this subject, and they have 100 students this year, we’ll only allow about 5 students to get an A, the next certain percent to get B, etc., according to the historical profile.” It would then allocate those grades to the top-ranked students in the teacher’s list. For small classes (below a certain size, like 15 students), the algorithm relied more on teacher predictions (because statistical moderation is less reliable with small numbers).
The Outcome: When the algorithmically determined grades were released (August 2020), chaos ensued. About 40% of students saw their teacher-predicted grades lowered by at least one grade by the algorithm. For example, a student predicted to get a B might have received a C or D after standardization. Many students – and their teachers and parents – were shocked and outraged. There were numerous stories of top-performing students being downgraded severely because of their school’s past results. For instance, if an exceptional student attended a historically low-performing school, the algorithm might have capped them from getting an A* (the highest grade) because no one from that school ever got an A* before. On the flip side, students at historically high-performing (often private) schools sometimes kept the high grades their teachers gave or even got bumped up if a teacher was pessimistic.
Protests erupted across the country, with students gathering outside the Department of Education and in city centers holding signs like “#FuckTheAlgorithm” (which bluntly captured the public sentiment). The uproar was not just from those downgraded; it was a broad backlash against what was seen as an unfair and opaque system. Within days, the government made a U-turn: they scrapped the algorithm’s results entirely and announced that students would receive the original teacher-assessed grades (or could opt to take a later exam). The algorithmic grades were essentially thrown out due to the public outcry and perceived injustice.
Let’s analyze what went wrong in this case:
- Collective Fairness vs Individual Fairness: The algorithm was arguably fair in aggregate – it made sure the overall distribution of grades and each school’s outcomes looked like a normal year, preventing unusually high averages. However, it was unfair to individuals. A hard-working student who might have genuinely achieved top marks was downgraded simply because of their school’s past performance. The algorithm treated students as statistics, not as individuals. One commentary noted, “Fair for the group can be shockingly unfair for the individual.” The U.K. algorithm prioritized what you might call procedural fairness (everyone gets processed by the same formula) over substantive fairness (each student gets what they personally earned or deserved). This sparked an ethical debate: what does it mean to be fair in education? Students felt (rightly) that they should be judged on their own merits, not the historical average of their school. Indeed, analysis showed that among the highest-achieving students, those from smaller classes (often independent schools) were less likely to be downgraded, whereas those in large state college cohorts were heavily standardized down. The net effect was that existing inequalities were amplified by the algorithm’s design.
- Biases and Inequities: The data showed that the proportion of top grades (A*/A) awarded to students at independent (private) schools increased by 4.7 percentage points in 2020 – more than double the increase seen at state comprehensive schools. In other words, private school students benefited most from the system. Why? Two reasons: many private schools have smaller class sizes (where the algorithm deferred to teacher grades more often), and their historical performance was already high (so the algorithm didn’t need to adjust much or even could adjust up slight overestimates). Conversely, state schools and particularly sixth-form colleges (larger institutions) saw much smaller increases (or even decreases) in top grades. Disadvantaged students were hit hardest: one analysis showed that high-achieving students from lower socio-economic backgrounds were far more likely to be downgraded than those from affluent backgrounds. The algorithm effectively baked in socio-economic and regional disparities – areas or schools that historically underperformed (often due to lack of resources or other disadvantages) had their students’ grades pulled down, even if those particular cohorts might have done better. This clearly violates the principle of avoiding discrimination by protected characteristics like socio-economic status. It caused a lot of anger because it felt like the system was stacked against students who were already in tougher circumstances, through no fault of their own.
- Transparency and Trust: Initially, the details of how grades were calculated were not well explained to the public. Students just saw that their result didn’t match what their teachers predicted, and there was confusion. Ofqual had published a technical report, but it was dense, and individual students couldn’t easily trace how their grade was decided. The lack of transparency violated a key AI ethics principle: people have a right to an explanation for decisions affecting them. The hashtag #FuckTheAlgorithm captured how people felt treated by an unfeeling black box. Even though Ofqual is an authority, the algorithm made it seem like a machine was arbitrarily messing with lives. This episode shows that when algorithms make personal decisions, the opacity can lead to mistrust and even public fury.
- Feedback and Appeal: The standard process to appeal a grade was not suited to an algorithmic system of this scale. Originally, appeals could be made if there was evidence of a data error or if a student’s mock exam was higher, etc., but not simply because “I think I would have done better.” And you usually had to pay a fee to appeal. With almost 40% of grades altered, this was untenable. The system didn’t have a good individual redress mechanism. Any automated decision system in high stakes should ideally have a human-in-the-loop appeal process where a case can be reviewed on its own merits. In the absence of a practical appeals process, the only option was to scrap the whole thing. This is a lesson in accountability: if an algorithm makes a mistake or an unfair outcome, how can it be corrected? In this case, the answer was only by a blanket policy reversal.
- Context and Constraints: It’s important to acknowledge that 2020 was an emergency, and the authorities were under pressure to deliver results quickly and fairly. The algorithm was built in a few months. One could argue the intentions were reasonable – avoiding grade inflation and trying to maintain standards. However, this context doesn’t excuse the oversight of fairness to individuals. If more educational experts, statisticians, and even student representatives had been involved in testing the approach, they might have foreseen these issues. In fact, in Scotland a similar standardization had been attempted and then rolled back a week earlier after backlash, which should have been a warning. This case underscores that policy algorithms need interdisciplinary oversight – not just data scientists and officials, but ethicists and domain experts who can examine assumptions. For instance, the algorithm assumed a school’s past performance is a good predictor of every future student’s performance, which is not necessarily true, especially for outliers. It also treated a teacher’s ranking as sacrosanct (which introduced its own issues, as some teachers might rank inconsistently).
- Systemic Bias vs Individual Merit: The fundamental issue here was the algorithm prioritized systemic consistency (no grade inflation, each school gets roughly the same results as before) at the expense of individual justice. This raised a philosophical question: should an algorithm strive to correct systemic biases and uplift students who beat the odds, or just mirror the past? The Ofqual algorithm chose the latter. As one academic analysis noted, it essentially encoded “fairness = maintaining historical standards,” which in a society with educational inequality means “fairness = repeating existing inequalities”. That is a flawed definition of fairness if your goal is equality of opportunity.
The end result was a policy failure. The government’s U-turn meant the feared grade inflation happened (teachers’ grades were on average higher than previous years), but given the extraordinary circumstances, that was deemed preferable to the unfairness of the algorithmic approach. The fiasco became a case study in AI ethics classes and discussions about algorithmic governance. It demonstrated that even relatively simple algorithms (this was essentially a statistical moderation, not a complex ML model) can have pernicious effects if they’re not aligned with social values.
Lessons Learned:
- Involve Ethics and Stakeholders Early: If Ofqual had consulted more with educators, students, and ethicists while designing the system, they might have balanced standardization with other notions of fairness. For example, they might have allowed more flexibility for students with exceptional performance relative to their school, or at least prepared an appeals route for them. This speaks to value alignment – the algorithm should have aligned with the value that each student deserves a fair chance.
- Beware of Proxy Discrimination: Using a school’s average as a proxy for a student’s ability is inherently discriminatory against those in poorer schools. This is analogous to redlining in banking (denying loans to people from certain neighborhoods). It shows how algorithms can codify structural biases (here, the class-based disparities in education). Regulators and designers must check for disparate impacts. In the UK case, the impact on disadvantaged groups was evident in the data (free meal students, etc.), which if analyzed prior could have signaled a big problem. Going forward, fairness assessments (maybe removing the direct school effect, or capping how much a grade can move down, etc.) would be crucial.
- Agility in Policy: When the problem became apparent, the willingness to scrap the algorithm was actually a good thing – it showed that human judgment and democratic accountability stepped in to override a failing algorithm. Some argued that the hashtag activism and protests functioned as a “human in the loop” in a broad sense, forcing a course correction. It’s a reminder that while algorithms can be powerful, they operate within human society and we can choose to override them if they conflict with our principles.
- Documentation and Explanation: If an algorithm must be used, it should come with clear documentation about how it works and why, accessible to the people affected. Ofqual’s technical report wasn’t user-friendly. Better communication might not have fixed the injustice, but it could have lessened confusion and allowed for informed debate. In algorithmic systems, transparency (to the extent possible without enabling gaming the system) is key to trust.
- Future Guardrails: This incident led to discussions in the UK about when and how to use algorithms in governance. There were calls for impact assessments and bias audits for any such algorithm, and that citizens should have a say in their design. Essentially, it put algorithmic accountability on the public agenda. It also underscored that sometimes simpler or more human approaches might be better. In 2021 and 2022, when exams were partly affected, teachers’ grades were used with some moderation but nothing as harsh as 2020.
The UK grading fiasco is a vivid example of an algorithm colliding with public values. It shows that technical accuracy (in terms of matching historical patterns) is not the same as social acceptability or justice. When deploying ML or statistical models in social domains, one must consider the various definitions of “fairness” and choose carefully – or risk a public revolt. The episode ultimately reinforces our earlier point: an ML or AI system’s success is not just measured in predictive accuracy, but in how well it aligns with human expectations of fairness and how well it can be integrated into the social context. This is precisely why AI ethics has become a crucial field, which we will delve into in the next chapter.
Customer-Facing AI – Google Duplex and the Ethics of Deception
Now let’s switch to a very different kind of case – one about human-AI interaction and deception. Google Duplex is an AI system announced by Google in 2018 as an extension of their Google Assistant. Duplex’s purpose is to make phone calls on behalf of the user to accomplish specific tasks like booking a restaurant reservation or a hair salon appointment, all via natural conversation.
The Dazzling Demo: At Google’s I/O 2018 developer conference, CEO Sundar Pichai demoed Duplex to the world. He played back phone call recordings of Duplex in action. In one call, the AI scheduled a hair salon appointment, in another it attempted to reserve a table at a restaurant. What astonished people was how human-like the AI sounded. It used speech disfluencies like “um” and “mm-hmm” exactly as a human would. It responded to questions and even misunderstandings fluidly. In the restaurant call, when told that reservations weren’t needed for small parties, the AI seamlessly shifted to asking about wait times – eliciting a laugh from the audience at how naturally it navigated the conversation. The people on the other end of the line did not seem to realize they were speaking to a machine. Pichai hailed this as a huge leap in AI’s ability to understand and speak, and positioned it as a feature that would save users time (no more calling yourself – just tell the Assistant to do it).
The initial reaction was awe at the technical achievement. Google had essentially passed a mini Turing Test in a narrow domain – the human listeners were fooled. But almost immediately, ethicists and commentators raised an alarm: the people receiving those calls were not informed that they were talking to an AI. Google had made an AI that can deceive humans into thinking it’s human, which crosses an ethical line for many.
The Ethical Outrage: There was widespread concern that Duplex was demonstrating a form of deceptive AI. A chorus of voices (in articles, social media, etc.) asked: Is it ethical to have AI impersonate a human without disclosure? Many felt it was not. One scholar, Dr. Thomas King of Oxford’s Internet Institute, commented that Google’s experiment “appears to have been designed to deceive” – noting that Google’s own framing was like a test of whether people could tell, which means deception was a goal. King argued that instead of asking “Can the AI fool someone into thinking it’s human?” they could have focused on usability without deception. If the aim was just to make a convenient assistant, the AI could say up front, “Hi, I’m an automated assistant making a booking for a client.” That would still get the job done without trickery. By not doing so, Google seemed to prioritize the wow factor over transparency.
Several issues were identified:
- Transparency & Consent: The recipients of the calls did not consent to interact with an AI. They weren’t even aware. This violates a basic notion of informed consent in interaction. If you’re talking to a computer, you should have the right to know. Deceiving people, even in a low-stakes context like a salon appointment, was seen as disrespectful and potentially harmful. It could make people feel manipulated or betrayed when they find out.
- Precedent for Abuse: If Google normalized AI that can pass as human, malicious actors could use similar tech for scams. People imagined robocalls that sound human tricking people into giving up information or money. There’s already a scourge of phone scams; human-sounding AI could turbocharge that problem. Even beyond crime, it could lead to a general erosion of trust in phone communication – if you can’t be sure whether a caller is real, you might become more guarded or even rude (as a defense).
- Guidelines Ignored: AI ethics guidelines long existed that caution against exactly this. For example, the IEEE’s Ethically Aligned Design (a 2017 framework) and the British Standard BS 8611:2016 on robot ethics both emphasize transparency. The BSI standard explicitly says: “Avoid deception due to the behavior and/or appearance of the robot and ensure transparency of robotic nature.” It warns that deception can erode trust in technology. It even advises against excessive anthropomorphism unless absolutely necessary. Google either overlooked these guidelines or chose to ignore them for the sake of a flashy demo. Critics pointed out that Google, as a leader in AI, should be setting ethical standards, not flouting them for applause.
- Public Backlash: Indeed, the media and public reaction was negative enough that Google quickly announced changes. Within a couple of months, Google said that Duplex calls would identify themselves at the start of the call. The assistant would say, roughly, “Hi, I’m Google’s automated booking service, calling for [Name].” This was a direct response to the ethical backlash. It was a win for the argument that AI systems should self-identify when interacting with humans. In essence, the public forced Google to align with the principle of transparency. This incident set an important precedent: it signaled to all companies that if you deploy a human-mimicking AI, you better disclose it, or face public relations and possibly regulatory consequences.
- Human Interaction Norms: As King noted, “who or what you’re interacting with shapes how we interact”. We have different social expectations with humans versus machines. If that line blurs, it can “sow mistrust in all kinds of interactions”. For instance, if you get accustomed to the idea you might be speaking to a bot, you might treat unknown callers with suspicion or even hostility (“Is this really a person? Prove it!”). This could degrade everyday social trust. Society operates on a baseline of trust in communication; deceptive AI threatens that.
- Missed Creative Opportunity: Some commentators, like the TechCrunch author Natasha Lomas, argued that Google showed a lack of imagination by going the deception route. They could have designed Duplex to be clearly synthetic yet charming or efficient in its own way – for example, using a pleasant but obviously computerized voice or a scripted intro that sets context. People can be very receptive to non-human agents if they’re presented honestly (think of how we accept cartoon characters or voice assistants like Siri that don’t hide being software). By chasing realism for its own sake, Google neglected a richer design space of human-AI interaction that could be novel and engaging without pretending to be human. This critique is about creativity in AI design – we shouldn’t just aim to mimic humans exactly; sometimes a new form that signals “I am AI” but is still helpful could be better.
Google Duplex did roll out later for a limited set of tasks and regions, and indeed when it did, it included the required disclosure. The furor died down after Google’s change, but the event left a lasting impression in AI ethics discussions. It prompted many to call for laws requiring bots to self-identify. In California, for instance, a “bot transparency” law (passed in 2018, taking effect in 2019) already requires that bots interacting commercially or politically with people disclose they are not human. Duplex arguably would have run afoul of that if it hadn’t been tweaked.
Wider Implications: This case touches on the Turing Test idea – traditionally a measure of AI advancement, but here we see that passing as human isn’t an unequivocal win outside of a laboratory game. It can be a failure of ethical design if done without consent. It reframed the Turing Test: maybe the goal isn’t to trick people, but to collaborate with people. Many AI researchers now emphasize user experience and trust over pure imitation.
Additionally, Duplex foreshadowed a now exploding concern: deepfakes and voice clones (which we’ll address next). If an AI can sound like a generic human, what about sounding like a specific person? We are now grappling with AI-generated voices and videos that can impersonate individuals. Duplex was benign in content, but it showed the technical capability. It underscored that society needs norms and perhaps regulations around AI impersonation and manipulation. After Duplex, AI companies became a bit more cautious. For example, Amazon’s Alexa team and others have studied adding subtle signals (like a brief tone) to indicate a voice is synthetic. There’s also research on making synthesized speech watermarked or detectable as AI by certain methods, to prevent misuse.
For Google, Duplex was a learning moment that likely influenced their later AI principles (in 2018, Google released a set of AI Principles, including that AI should be accountable to people and uphold high standards of privacy and be socially beneficial). One of those principles is to avoid creating AI that contravenes widely accepted principles of ethics – arguably Duplex v1 did, by violating the norm of truthfulness in interaction.
To boil it down, the Duplex case study teaches:
- Just because we can make AI act indistinguishably from humans doesn’t mean we should in production use. Authenticity and honesty are important in human-AI interactions.
- Ethics can’t be an afterthought. It must be considered at design time. Google had to retrofit ethics due to backlash, which is suboptimal compared to building it in from the start.
- Society will likely reject AI that tries to deceive. People generally don’t like being duped, even by something as trivial as a booking call. Trust in AI is fragile; design choices that preserve trust (like transparency) are crucial for long-term adoption.
- There’s a difference between a clever demo and a socially acceptable product. The Duplex demo wowed engineers but worried the public. Tech companies learned that they have to evaluate not just “Will it work?” but “How will people feel about this technology being used on them or around them?”
- Regulation and standards are inevitable in this domain. The fact that standards like BS 8611 explicitly addressed this means experts foresaw it. If companies don’t self-regulate to abide by such norms, government regulation will step in. In the EU, for instance, the draft AI Act considers misleading behavior by AI as a risk to be mitigated. So future AI systems might legally have to disclose themselves.
Duplex, in essence, was a story of misalignment: the system performed as intended technically, but misaligned with societal expectations of honesty. Correcting that misalignment (by adding disclosure) fortunately was easy in this case. In others, it might not be so straightforward, which is why forethought is critical.
Deepfakes – When Seeing (and Hearing) Isn’t Believing
Our final case study is about deepfakes, which ties together themes of deception, trust, and the potential large-scale impact of AI. Deepfakes are AI-generated media – typically video or audio – that convincingly mimics real people. The term comes from a Reddit user “deepfakes” who around 2017 posted porn videos with celebrity faces swapped in using deep learning. Now it refers broadly to synthetic media where someone’s likeness is realistically replaced or altered.
How Deepfakes Work: Deepfakes often use techniques like generative adversarial networks (GANs) or autoencoders. To make a deepfake of a person, you need a bunch of footage or images of them (to learn their face or voice). For a video deepfake, one approach is to train a model to map one person’s facial movements to the target person’s face. Essentially, the model learns to generate frames of the target’s face in various expressions and angles. Then given a source video of another actor doing or saying something, the model can synthesize the target’s face doing the exact same movements, often blending it seamlessly onto the source’s body. For audio, similar principles apply: a model can learn a person’s voice characteristics and then generate new speech (text-to-speech in their voice). Early deepfakes were rough (glitchy artifacts, unstable faces), but they have improved dramatically in a short time. Now, with enough data and proper techniques, it’s possible to create fake videos that many viewers would not immediately recognize as fake.
Positive or Harmless Uses: Not all deepfakes are malicious. For example, filmmakers have used deepfake-like tech to de-age actors or resurrect deceased ones for cameos (with permission). There are apps that let users put themselves into movie clips for fun, or to swap faces in GIFs. In art, some use deepfakes for satire or creative expression (e.g., making political figures sing a song as a parody, where it’s obviously not real but humorous). Another potential use is privacy: researchers have proposed using deepfake tech to anonymize people in videos (changing their face to a synthetic one) while preserving expressions – so you could release footage for analysis without revealing identities. These benign uses show the tech itself is dual-use.
However, the risks and harms have garnered the most attention:
- Non-consensual Pornography: This was the first big alarm. People (mainly on shady forums) used deepfakes to put women’s faces (often celebrities, but also sometimes acquaintances or ex-partners) onto pornographic videos. This is a severe violation of privacy and can be a form of harassment or revenge. Victims can suffer emotional distress, reputational damage, and it’s a form of sexual exploitation. By 2019, studies found that an overwhelming majority of deepfake videos online were pornographic and nearly 100% of those were of women who did not consent. This is a stark example of AI being weaponized against women (Chesney & Citron (2019) note women often are the early victims of such tech, likening them to “canaries in the coal mine”). It’s very difficult to get such videos taken down across the internet, making it a persistent harm.
- Political Misinformation: Deepfakes raise the possibility of fake news on steroids: a video of a politician or public figure saying or doing something they never did. This could be used to sway elections, incite conflict, or sabotage reputations. For a while, this was more theoretical because making a truly convincing deepfake was hard and time-consuming. But the worry was always “what happens when someone releases a realistic fake at a critical moment?” In March 2022, during Russia’s invasion of Ukraine, we saw a real instance: a deepfake video of Ukrainian President Volodymyr Zelenskyy was circulated, in which he appeared to surrender and tell Ukrainian troops to lay down arms. The video was not very convincing to a discerning eye – the face looked off (skin tone mismatched, etc.), the voice had an accent, and it was quickly labeled fake. TV stations and social media removed it within hours. Zelenskyy himself put out a statement calling it a “childish provocation”. While that particular attempt failed (and indeed was ridiculed), it illustrated the threat. Experts said this could be the tip of the iceberg, and that more sophisticated deepfakes could be used in the future. Imagine one that isn’t so obviously bad – it could cause confusion at least for a while. Even a few hours of uncertainty can be dangerous during a crisis (people might panic or make decisions based on false info).
- The “Liar’s Dividend”: This term, coined by Chesney and Citron (2019), describes an indirect harm of deepfakes. The idea is that the knowledge that deepfakes exist can be exploited by dishonest people to deny reality. If a compromising real video surfaces, the person in it can claim “It’s a deepfake!” and some portion of the public might doubt the authenticity of the real evidence. Essentially, liars get a “get out of jail” free card by leveraging the mere possibility of deepfakes. We’ve already seen hints of this: when the Access Hollywood tape (of Donald Trump) came out in 2016, it was obviously real and he admitted it at first, but a year later, reports say he mused that maybe it wasn’t real. In another instance, when a recording of a politician’s comments emerged, they suggested it might be doctored. As deepfakes become more known, any video/audio can be challenged. Chesney & Citron warn that this erosion of trust in authentic media is a serious societal risk. It leads to what some call “reality apathy” or “truth decay” – a world where people don’t believe evidence and everything can be dismissed as fake. That obviously benefits corrupt or bad actors who get caught on camera.
- Fraud and Impersonation: On a more everyday level, deepfakes can enable new scams. For instance, in 2019, there was a report of criminals using AI-generated voice to impersonate a CEO and call a subordinate, tricking them into wiring money (about $240,000) to a fraudulent account. Voice deepfake tools are now relatively accessible; imagine getting a voicemail from your family member saying they’re in trouble and need money – but it’s not really them. This is happening: authorities have warned of scams where a parent gets a call from someone sounding like their child claiming to be kidnapped, etc. Video calls could be next (real-time face swapping). Another angle is political propaganda: a deepfake could impersonate a world leader and give a fake speech that could move markets or start conflicts until debunked.
- Undermining Public Discourse: Even aside from liar’s dividend, a flood of fake videos could just overwhelm our information ecosystem. It’s already hard to trust what we read online due to misinformation; if we can’t trust videos either, it further destabilizes the idea of a shared reality. People might retreat to only trusting what aligns with their biases (since seeing is no longer believing). Some scholars call this a potential “Infocalypse” (information apocalypse) – a scenario where any audio/visual evidence can be manufactured cheaply and convincingly, making the truth very hard to discern for the average person.
Fighting Deepfakes: This is an active area:
- Detection Research: Many researchers are working on deepfake detection algorithms (often using AI to fight AI). Early detectors looked for tell-tale artifacts: for example, older deepfakes sometimes didn’t blink normally, or had odd inconsistencies in lighting or edges of the face. Newer ones are better, so detectors now might analyze beyond visuals, like checking if the physics makes sense, or if the audio is exactly in sync, or training networks to spot subtle patterns invisible to humans. The U.S. Dept. of Defense and big tech companies have run contests to spur better detectors (e.g., Facebook’s Deepfake Detection Challenge in 2020). Hany Farid, a pioneer in digital forensics, is a key figure in developing methods to authenticate media. However, it’s a cat-and-mouse game: as detection improves, deepfake quality also improves, often by training against detectors. Some worry about an eventual parity where highly sophisticated fakes are practically impossible to reliably detect with automated means (especially if they’re short or low-quality clips, or if the perpetrators manually touch them up).
- Authenticity Infrastructure: An alternative approach is flipping the problem – instead of detecting fakes, verify reals. Initiatives like the Content Authenticity Initiative (led by Adobe and partners) aim to create standards for attaching metadata to videos/images at the point of recording, like a signature, so you can later verify if a piece of media is original and unedited. For example, a camera could cryptographically sign each frame it records. If a video lacks such a signature or the signature doesn’t match, you treat it with caution. This is like a provenance chain for digital content. It’s a challenging thing to implement broadly (requires industry cooperation and doesn’t solve everything, but helps with media from reputable sources like news organizations).
- Legal Approaches: Laws are being considered or passed. China implemented rules that if a video is AI-synthesized and could mislead, it must be clearly labeled as such (starting 2020). In the U.S., there’s no federal law yet specifically for deepfakes (aside from certain intellectual property or fraud angles), but a few states have acted. For example, Virginia and Texas outlawed deepfakes intended to sabotage candidates in elections (within a certain time frame of election day). California made it illegal to make pornographic deepfakes without consent (and allowed victims to sue). However, legislating is tricky – you have free speech issues and the difficulty of defining what counts as a prohibited deepfake. Still, lawmakers are aware that something must be done, especially as we approach times when deepfakes could be weaponized in politics or national security. On the flip side, there’s also a need to protect against misuse of the “liar’s dividend” – some have suggested that if someone falsely claims true evidence is a deepfake, that could be made illegal in certain contexts (because it’s a form of misinformation).
- Platforms & Policy: Social media companies have, belatedly, updated their policies. Facebook (Meta) and Twitter announced they will remove or label deepfakes that are designed to mislead (with exceptions for parody or satire). In practice, if a high-profile fake emerges, they now have teams to respond. In the Zelenskyy case, Facebook, YouTube, Twitter all took down the fake quickly. However, if a deepfake is spread in more covert ways (e.g., private groups, encrypted apps, fringe sites), platform policies won’t stop it. So it’s an incomplete solution.
- Education and Awareness: Just as people had to learn to be skeptical of unsolicited emails (to avoid phishing), society must adapt to deepfakes. News literacy programs now include segments on not immediately trusting videos and looking for corroboration. One hope was that deepfakes were so hyped in the news that the public would naturally become skeptical of “too sensational” videos. There’s some evidence people are a bit more cautious now, but also a danger: over-skepticism (the liar’s dividend effect). It’s a fine line between healthy skepticism and corrosive cynicism. Public campaigns (like “think before you share”) are trying to address that.
Deepfakes underscore a broader theme of this chapter: technology’s impact is not just about direct performance (like an ML model’s accuracy) but also about how it affects human trust and social systems. Chesney and Citron (2019) described deepfakes as “a looming challenge for privacy, democracy, and national security”. Indeed, they strike at the heart of how we establish truth. Democracies depend on a shared reality and credible information; deepfakes threaten to undermine that in a way that previous media manipulation (text or image-based) could not, because video evidence has traditionally been a gold standard for truth. If that gold standard falls, the consequences are hard to predict.
On a concluding note for deepfakes: the technology itself is fascinating and in some cases beneficial, but it exemplifies the need for AI ethics and policy to catch up with AI capabilities. It calls for a multi-stakeholder approach – technologists to improve detection and authentication, lawmakers to create deterrents and remedies, platforms to police misuse, and users to be vigilant. Deepfakes also make us realize the importance of media provenance and critical thinking skills in the digital age.
Each case study we’ve discussed (biased hiring algorithms, flawed grading algorithm, deceptive conversational AI, and deepfakes) highlights different failure modes or risks of machine learning and AI:
- They can encode and amplify historical biases (hiring, ads).
- They can conflict with our values of fairness and equal opportunity (exam grading).
- They can violate norms of transparency and consent (Duplex).
- They can erode trust in information and be used maliciously (deepfakes).
These are not purely technical problems; they are socio-technical. The solutions require more than just better code – they require ethical frameworks, interdisciplinary thinking, and often regulation or oversight.
To tie everything together: machine learning in action has taught us not only technical lessons (like the importance of validation and tuning) but also societal lessons (the importance of aligning technology with values). As ML and AI systems continue to proliferate in decision-making, the “lessons learned” must include both how to get the ML pipeline right and how to ensure the outcomes serve humanity. This dual awareness – technical excellence and ethical responsibility – will define the successful AI practitioners and policies of the coming years.
References
Breiman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726
Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of Machine Learning Research, 81, 1–15.
Chesney, R., & Citron, D. K. (2019). Deepfakes and the new disinformation war: The coming age of post‑truth geopolitics. Foreign Affairs, 98(1), 147–155.
Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. https://doi.org/10.1109/TIT.1967.1053964
Dastin, J. (2018, October 10). Amazon scraps secret AI recruiting tool that showed bias against women. Reuters. https://www.reuters.com
European Commission. (2021). Proposal for a regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) COM(2021) 206 final. https://eur‑lex.europa.eu
IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems. (2017). Ethically aligned design: A vision for prioritizing human well‑being with autonomous and intelligent systems (1st ed.). IEEE.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With applications in R. Springer.
Kuhn, M., & Wickham, H. (2020). tidymodels: A collection of packages for modeling and machine learning using tidyverse principles (R package version 0.1.0). https://www.tidymodels.org
Lambrecht, A., & Tucker, C. (2019). Algorithmic bias? An empirical study into apparent gender‑based discrimination in the display of STEM career ads. Management Science, 65(7), 2966–2981. https://doi.org/10.1287/mnsc.2018.3093
Lomas, N. (2018, May 9). Duplex shows Google failing at ethical and creative AI design. TechCrunch. https://techcrunch.com
Office of Qualifications and Examinations Regulation (Ofqual). (2020). Awarding GCSE, AS, A level, advanced extension awards and extended project qualifications in summer 2020: Interim report. https://www.gov.uk/government/publications
Raghavan, M., Barocas, S., Kleinberg, J., & Levy, K. (2020). Mitigating bias in algorithmic hiring: Evaluating claims and practices. In Proceedings of the 2020 ACM Conference on Fairness, Accountability, and Transparency (pp. 469–481). ACM. https://doi.org/10.1145/3351095.3372828
British Standards Institution. (2016). BS 8611:2016 Robots and robotic devices—Guide to the ethical design and application of robots and robotic systems. BSI.