Machine‑Learning Approaches to Classification

Where Part II focused on predicting how much, Part III tackles the equally important question of which one. Classification problems—credit‑default vs. repayment, spam vs. legitimate e‑mail, benign vs. malignant tumour—pervade business analytics, biomedical science, and social research. This part of the book shows how contemporary machine‑learning methods extend classical classification models, offering better predictive accuracy, calibrated probabilities, and tools for navigating tough issues such as class imbalance and fairness.

We open with the probabilistic foundations of classification, revisiting logistic regression as a baseline and introducing the confusion‑matrix family of evaluation metrics—precision, recall, F‑score, ROC‑AUC, and calibration curves—that guide model selection far more effectively than raw accuracy. Building on that foundation, we develop discriminative and generative learners side‑by‑side: Naïve Bayes, linear and quadratic discriminant analysis, k‑nearest neighbours, support‑vector machines, decision trees, and ensemble methods such as random forests and gradient boosting. Each algorithm is presented with geometric intuition, loss‑function formalism, and hands‑on R/Python code for hyper‑parameter tuning via cross‑validation.

Midway through the part, we confront practical challenges that accompany real‑world classification tasks:

High‑dimensional feature spaces (text, images, genomics) and the role of embedding techniques, feature hashing, and dimensionality reduction.
Imbalanced data—rare‑event detection, fraud analytics—and strategies such as cost‑sensitive learning, focal loss, SMOTE, and threshold moving.
Multiclass and multilabel settings, where one‑vs‑rest decompositions give way to native multiclass learners and neural network architectures.

A dedicated chapter examines interpretable and fair classification. We integrate post‑hoc explanation tools (SHAP, LIME), monotonicity and sparsity constraints, and algorithmic fairness metrics—equalised odds, demographic parity—demonstrating how to balance predictive performance with transparency and social responsibility.

The part concludes by linking classification to subsequent decision‑making: probability calibration, expected‑cost frameworks, and utility‑based thresholds illustrate how model outputs translate into optimal actions under uncertainty—whether allocating medical resources or flagging risky transactions.

As in earlier parts, each chapter couples concise theoretical exposition with executable notebooks, reproducible case studies, and check‑lists for the full modelling workflow—from data partitioning to model audit. By the end of Part III you will possess a principled, practice‑ready toolkit for building, evaluating, and deploying classification systems across a wide spectrum of applied domains.

# Machine‑Learning Approaches to Classification Where Part II focused on predicting **how much**, Part III tackles the equally important question of **which one**. Classification problems—credit‑default vs. repayment, spam vs. legitimate e‑mail, benign vs. malignant tumour—pervade business analytics, biomedical science, and social research. This part of the book shows how contemporary machine‑learning methods extend classical classification models, offering better predictive accuracy, calibrated probabilities, and tools for navigating tough issues such as class imbalance and fairness. We open with the **probabilistic foundations of classification**, revisiting logistic regression as a baseline and introducing the confusion‑matrix family of evaluation metrics—precision, recall, F‑score, ROC‑AUC, and calibration curves—that guide model selection far more effectively than raw accuracy. Building on that foundation, we develop **discriminative and generative learners** side‑by‑side: Naïve Bayes, linear and quadratic discriminant analysis, k‑nearest neighbours, support‑vector machines, decision trees, and ensemble methods such as random forests and gradient boosting. Each algorithm is presented with geometric intuition, loss‑function formalism, and hands‑on R/Python code for hyper‑parameter tuning via cross‑validation. Midway through the part, we confront **practical challenges** that accompany real‑world classification tasks: * **High‑dimensional feature spaces** (text, images, genomics) and the role of embedding techniques, feature hashing, and dimensionality reduction. * **Imbalanced data**—rare‑event detection, fraud analytics—and strategies such as cost‑sensitive learning, focal loss, SMOTE, and threshold moving. * **Multiclass and multilabel settings**, where one‑vs‑rest decompositions give way to native multiclass learners and neural network architectures. A dedicated chapter examines **interpretable and fair classification**. We integrate post‑hoc explanation tools (SHAP, LIME), monotonicity and sparsity constraints, and algorithmic fairness metrics—equalised odds, demographic parity—demonstrating how to balance predictive performance with transparency and social responsibility. The part concludes by linking classification to subsequent decision‑making: **probability calibration**, **expected‑cost frameworks**, and **utility‑based thresholds** illustrate how model outputs translate into optimal actions under uncertainty—whether allocating medical resources or flagging risky transactions. As in earlier parts, each chapter couples concise theoretical exposition with executable notebooks, reproducible case studies, and check‑lists for the full modelling workflow—from data partitioning to model audit. By the end of Part III you will possess a principled, practice‑ready toolkit for building, evaluating, and deploying classification systems across a wide spectrum of applied domains.