13 Unsupervised Learning in IB Research

“If you live 5 minutes away from Bill Gates, I bet you are rich.” – Mohammad F. Al-Eryani, Journal of International Business Studies (vol. 21, no. 3)

Machine Learning (ML) is a broad field within data science that focuses on developing algorithms which can learn patterns from data. ML algorithms are typically categorized into three main types:

Supervised learning: Algorithms learn from labeled data (i.e. each training example has an associated target or response value). The goal is to learn a function that maps inputs to outputs, so that we can predict the labels for new data. Examples include regression (predicting a continuous value) and classification (predicting a class label).
Unsupervised learning: Algorithms learn from unlabeled data, discovering underlying structure or relationships without explicit target values. Common tasks include clustering and dimensionality reduction. The goal is often to understand the relationships or groupings in the data rather than to predict an external outcome.
Reinforcement learning: Algorithms learn by interacting with an environment. They receive feedback in the form of rewards or penalties and learn a policy to maximize cumulative reward. In reinforcement learning, an agent takes actions and improves its strategy over time based on reward signals from the environment (e.g. training a game-playing AI to maximize score).

An intuitive way to think about these categories is by the presence or absence of feedback signals. In supervised learning we have a “teacher” signal (the correct labels), in unsupervised learning we do not, and in reinforcement learning the feedback is delayed and comes as rewards.

Another useful concept is from psychology: System 1 vs System 2 thinking (a term popularized by Daniel Kahneman). System 1 is fast, intuitive pattern recognition, whereas System 2 is slow, analytical reasoning. In some ways, certain machine learning models can mimic these: for example, a simple rule-based model might act like System 2 (explicit logic), whereas a complex deep neural network might function more like System 1 (making instant predictions after extensive training). This analogy highlights how machine learning can automate both quick intuition and careful analysis.

Supervised Learning: Regression vs Classification

In supervised learning, we further distinguish between regression and classification tasks:

Regression models predict a continuous numerical output. For example, predicting a stock price or a house’s value based on input features would be a regression problem.
Classification models predict a categorical output (class labels). For instance, determining whether an email is “spam” or “not spam,” or whether a tumor is “benign” or “malignant,” are classification tasks.

Both regression and classification involve learning a mapping $f(\mathbf{x})$ from feature inputs $\mathbf{x}$ to an output $y$. The key difference lies in the nature of $y$: numeric for regression vs. category labels for classification. This difference affects the choice of algorithms and evaluation metrics. For example, regression errors are often measured by mean squared error, whereas classification errors are measured by misclassification rate (the fraction of labels predicted incorrectly).

Example: Suppose we have a dataset of beverages with features like color (measured in nm of light wavelength) and alcohol percentage, and the task is to predict whether the beverage is beer or wine. This is a classification problem (predicting category). If instead the task were to predict the exact alcohol percentage from other properties, that would be a regression problem. In the classification setting, we might observe data such as:

Color (nm)	Alcohol (%)	Beverage Type (Label)
610	5	Beer
599	13	Wine
693	14	Wine
430	5	Beer
…	…	…

Here “Color” and “Alcohol %” are input features $X$, and “Beverage Type” is the target $Y$. A classification model would learn a decision boundary in this 2D feature space to separate beers from wines. A regression model would not apply here because the output is not numeric in this case. Generally, if $Y$ were numeric (say, sugar content), we’d use regression; if $Y$ is categorical, we use classification.

The Machine Learning Process

Regardless of regression or classification, building a supervised ML model usually follows a series of steps:

Gathering data: Collect a dataset with input features and target labels.
Pre-processing: Clean the data, handle missing values, encode categorical variables, and scale features if necessary (feature scaling is especially important for methods that use distance metrics, for example).
Choosing a model: Select an appropriate algorithm or model family (e.g., linear regression, decision tree, neural network, etc., for regression/classification tasks).
Training the model: Fit the model to the training data. This often involves optimizing model parameters to minimize a loss function (like least squares error for regression or a classification loss).
Evaluation: Assess model performance on a test set (data not seen during training) to estimate how well the model generalizes. For classification, metrics could include accuracy, precision/recall, etc., while for regression one might look at RMSE or $R^2$.
Model tuning: Adjust hyperparameters and potentially use techniques like cross-validation (e.g., k-fold cross-validation) to find the best model configuration and avoid overfitting. This may involve repeating training and evaluation multiple times.
Deployment and prediction: Use the trained model to make predictions on new, unseen data (out-of-sample prediction).

These steps can be visualized as a cycle where data is used to train a model, the model is tested, and insights from evaluation might lead us to gather more data or try a different model, and so on. It’s an iterative process of improvement.

Key Principle – Bias-Variance Tradeoff: In supervised learning, a fundamental challenge is the trade-off between bias and variance. Bias is error from erroneous assumptions in the learning algorithm (oversimplified model), while variance is error from sensitivity to small fluctuations in the training set (overly complex model). As model complexity (flexibility) increases, bias tends to decrease but variance tends to increase. We seek a model that balances these to minimize test error – error on new data – which often has a U-shaped curve as model flexibility increases (high error for very simple models due to high bias, high error for very complex models due to high variance, and a sweet spot in between).

Question: What is the difference between regression and classification models? Answer: Regression models predict quantitative continuous outputs (e.g., predicting a number) whereas classification models predict qualitative categorical outputs (assigning class labels). In other words, if the target variable is numeric (like predicting sales revenue), it’s a regression problem; if the target variable is a category (like predicting if a firm will default or not), it’s a classification problem.

Next, we will focus on unsupervised learning techniques, which are especially useful in exploratory data analysis and in situations where we don’t have outcome labels. In particular, we discuss Principal Component Analysis (PCA) and Factor Analysis for dimensionality reduction, and then transition to discuss a classification method (k-Nearest Neighbors) which, while technically supervised, provides an intuitive bridge in understanding pattern recognition.

13.1 Unsupervised Learning I: Principal Component Analysis and Factor Analysis

Unsupervised learning methods aim to summarize or reveal structure in data without reference to an explicit response variable. One important subset of unsupervised techniques is multivariate statistics, which deals with analyzing multiple variables together, focusing on the variation they share in common.

Definition – Multivariate Statistics: A collection of methods for datasets with many variables (high-dimensional data) where the goal is to understand relationships and common variation among variables, rather than predicting an outcome. In other words, multivariate analysis seeks to capture the structure in data by considering all variables simultaneously. This differs from supervised modeling (which focuses on predicting a specific $Y$ from $X$) and also from clustering (which segments observations). Instead, techniques like PCA and factor analysis aim to reduce dimensionality or identify underlying factors that explain the correlations among variables.

Why reduce dimensionality? Modern datasets often have a very large number of variables. This poses challenges:

It is difficult for a human analyst to inspect and reason about each variable individually when there are dozens or hundreds of them – our minds can’t easily visualize high-dimensional data.
Many variables may be redundant or correlated, meaning they carry overlapping information. There may be underlying “themes” or factors that drive a lot of the covariance among those variables.

By using dimensionality reduction, we create a smaller set of derived variables that (hopefully) capture most of the important information in the full dataset. Two common approaches are Principal Component Analysis (PCA) and Factor Analysis (FA).

Both PCA and FA create new variables that are combinations of the original variables, but they have different goals and interpretations, which we will explore.

Principal Component Analysis (PCA)

Goal: PCA seeks to reduce the number of dimensions in a dataset while retaining as much variation (information) as possible. It does this by finding new variables, called principal components, which are linear combinations of the original variables. The principal components are ordered such that the first component explains the largest amount of variance in the data, the second component explains the second-largest amount of variance (subject to being uncorrelated with the first), and so on. In essence, PCA finds the “directions” in the data that have the most spread (variance).

How PCA works: Geometrically, imagine your data as a cloud of points in a high-dimensional space. PCA finds a new coordinate system for that space:

The first axis (first principal component) is the direction in which the data varies the most.
The second axis (second principal component) is the direction of next highest variance orthogonal to the first, and so forth.

Thus, PCA is effectively performing a rotation of the original axes to a new set of axes that align with the directions of greatest variability in the data. These new axes are orthogonal (uncorrelated) and ordered by the amount of variance they capture.

Mathematically, if $\mathbf{X}$ is the standardized data matrix (each column a variable), PCA finds linear combinations of the columns. For example, the first principal component can be written as:

$z_1 = u_{11} X_1 + u_{12} X_2 + \cdots + u_{1p} X_p,$

where $X_1, X_2, \dots, X_p$ are the original variables and $u_{1j}$ are weights. The weights are chosen such that the variance of $z_1$ is maximized (subject to $u_{11}^2 + \cdots + u_{1p}^2 = 1$ to avoid trivial solutions). The resulting $z_1$ (the first component) captures the maximum possible variance in a single dimension. Then we find $z_2$ as another linear combination orthogonal to $z_1$ that captures the next most variance, and so on.

In matrix terms, we solve an eigenvalue/eigenvector problem on the covariance (or correlation) matrix of $\mathbf{X}$. The principal components are the eigenvectors, and their variances are the eigenvalues. Computing PCA typically involves an eigen-decomposition or singular value decomposition of the data matrix.

Properties of PCA:

PCA is an orthogonal linear transformation: the components are uncorrelated with each other by construction.
The total variance in the data is equal to the sum of variances of all principal components (no information is lost if we keep all components). By choosing a subset of the top $k$ components, we can often capture a large percentage of the total variance with far fewer dimensions than $p$.
PCA is sensitive to scale: variables should be standardized (e.g., mean 0 and standard deviation 1) before PCA, otherwise a variable with larger numeric scale will dominate the variance.
PCA works on numeric continuous data. Categorical data must be encoded numerically (e.g., one-hot encoding) before applying PCA.
The principal components are unique up to sign (you can multiply a component by -1 and it’s still the same “direction”). They are ordered by variance (also known as explained variance or information content).

Interpreting PCA results: Because PCA finds combinations that maximize variance, the resulting components often have a mixed interpretation – each may involve many original variables. The first few components are the most informative. Often we look at the explained variance ratio for each component to decide how many components to keep. For example, the first component might explain 40% of variance, the second 20%, the third 10%, etc. We might decide to keep just the first two or three components if they together explain a large majority of the variance.

By projecting the data onto these few components, we can create 2D or 3D visualizations of high-dimensional data and possibly identify patterns such as clustering of observations. PCA is widely used for data exploration and as a preprocessing step to reduce dimensionality before applying other algorithms (especially for algorithms that struggle with too many features).

To summarize in practical terms: PCA reduces a large set of variables to a smaller set of components that still contain most of the information (variance). Instead of analyzing dozens of variables individually, an analyst can examine a handful of components.

Example – Netflix Movie Recommendations: The idea behind PCA is similar to how one might approach a recommendation system. In the famous Netflix Prize context, one could use a matrix of user movie ratings and apply a PCA-like decomposition (technically singular value decomposition) to find a few underlying “components” or latent factors (e.g., genre preference, actor preference) that explain how users rate movies. Each movie and user can be represented in terms of these latent factors, which drastically reduces complexity (from tens of thousands of movies to, say, 20 latent dimensions). This approach of finding latent features is at the core of recommendation algorithms (Netflix’s actual algorithm was a form of matrix factorization). PCA provides a foundation for understanding such latent factor models.

Variance explained by components: The variance of each principal component is given by its eigenvalue. If $\lambda_1, \lambda_2, \dots, \lambda_p$ are the eigenvalues of the covariance matrix (sorted from largest to smallest), then $\lambda_1 / \sum_{j=1}^p \lambda_j$ is the fraction of total variance explained by the first component, and so on. We often look at the scree plot, which is a bar plot of eigenvalues or variance explained by each component, to decide an appropriate number of components to retain (e.g., keep enough components to explain ~80-90% of variance). Components beyond that contribute little and mostly represent noise.

One can formally reconstruct the data from all $p$ components without loss. But if we drop the less informative components (those with very low variance), we get a lower-dimensional approximation of the data. This can act as a noise-filtering mechanism too, since minor components often correspond to noise.

Factor Analysis (FA)

Goal: Like PCA, Factor Analysis reduces the dimensionality of data, but it has a different motivation. FA assumes that observed variables are influenced by a smaller number of unobserved latent variables called factors. The aim is to discover these latent factors that cause the observed correlations among variables, and to estimate both the common factors and the unique “specific” variance for each original variable. In contrast to PCA which is a purely mathematical decomposition, factor analysis is often considered a statistical model with a conceptual underpinning – we hypothesize that certain latent constructs exist.

In Exploratory Factor Analysis (EFA), we typically do not pre-specify the factors but let the data suggest how many there are and what they might represent. Each observed variable is assumed to be a linear combination of some common factors plus a specific factor (unique error term) for that variable.

Common Factor Model: For example, suppose student test scores in various subjects are our observed variables. We might hypothesize two latent factors: one representing “language ability” and another representing “technical/mathematical ability.” A student’s score in French and German would both load highly on the language factor, whereas scores in Math and Physics would load on the technical factor. However, each test also has specific variance (perhaps a student is generally good at languages but just happened to not do well in German specifically – that specific part isn’t explained by the overall language ability factor).

Mathematically, if $X_1, X_2, X_3, X_4$ are four observed test scores, and we assume two common factors $k_1$ (language) and $k_2$ (technical), the model could be written as:

\[ \begin{aligned} X_1 &= c_{11} k_1 + c_{12} k_2 + d_1,\\ X_2 &= c_{21} k_1 + c_{22} k_2 + d_2,\\ X_3 &= c_{31} k_1 + c_{32} k_2 + d_3,\\ X_4 &= c_{41} k_1 + c_{42} k_2 + d_4~, \end{aligned} \]

where the $c_{ij}$ are factor loadings (how strongly each factor influences each variable) and $d_1, \dots, d_4$ are the specific factors (unique part of each $X$ not explained by the common factors). Here $k_1$ might be high for a student who is generally good at languages, and $d_1$ might adjust $X_1$ (say, German test) for the student’s particular affinity or lack thereof for German beyond general language ability.

A key difference in the factor analysis model is that it explicitly includes these unique factors $d_i$ which account for variance in each variable that is not shared with others. PCA, by contrast, treats all variance as common variance to be accounted for by components (PCA does not separate out a “specific variance” term for each variable – effectively PCA assumes that any noise or specific variance will just end up in later components).

Estimation: In factor analysis, we generally estimate the loadings $c_{ij}$ and sometimes the factor scores for each observation (if needed) by methods like maximum likelihood or principal axis factoring. The number of factors is typically chosen by looking at criteria like eigenvalues (Kaiser criterion), scree plot, or by theoretical considerations. One often uses techniques such as factor rotation to achieve a simpler, more interpretable structure of loadings.

Rotation: Rotation is a crucial concept in FA. Once the initial solution is obtained, we can rotate the factor axes (in the multidimensional space of factors) without changing the overall fit of the model (if the rotation is orthogonal, it preserves the independence of factors; if oblique, factors can become correlated). The purpose of rotation is to find a factor orientation where each variable loads strongly on only one factor (achieving simple structure). This makes interpretation easier – ideally each factor corresponds to a clear concept (e.g., a factor where only language tests have high loadings can be interpreted as “language ability”). PCA components, on the other hand, are uniquely defined by the maximal variance criterion and do not have such flexibility for rotation post-hoc (PCA gives one specific orthogonal basis ranked by variance).

Differences between PCA and Factor Analysis:

Variance accounted vs. modeling causes: PCA seeks to explain variance in the dataset with combinations of observed variables (components are constructed to capture maximal variance). Factor analysis seeks to explain covariances or correlations between variables by latent factors. Another way to say this: PCA is a descriptive technique, whereas factor analysis is a statistical model (with latent variables and error terms).
Specific variance: PCA treats all variance of variables as information to be explained by components. FA explicitly separates variance into common (shared by factors) and unique (specific to each variable) parts. This is why in factor analysis the diagonal of the covariance matrix is adjusted (communality vs uniqueness) whereas PCA uses the full variance on the diagonal. The inclusion of specific factors in FA means it generally requires an iterative solving process, often using specialized algorithms.
Interpretability: Because factor analysis allows rotation and seeks factors that “make sense” (often aligned with theoretical constructs), the resulting factors can be more interpretable in terms of real-world latent traits. PCA components might be harder to interpret since they are constrained to maximize variance – they can be arbitrary linear combos that don’t correspond to intuitive concepts, especially if the top variance directions are not aligned with distinct phenomena.
When to use which: If your goal is data reduction for prediction or visualization, and you just want to capture maximum information with fewer variables, PCA is typically appropriate (e.g., reducing dimensionality before feeding data to a machine learning model, or visualizing high-dimensional data). If your goal is to identify latent constructs that underlie your observed measures (as in psychology questionnaires, socioeconomic indicators, etc.), and you care about interpreting those latent factors, then factor analysis is more suitable. For instance, in marketing or IB research, factor analysis might help condense a long survey into a few factors like “Customer Satisfaction”, “Perceived Value”, etc., which are easier to reason about.
Mathematical relation: PCA and factor analysis are closely related – both involve decomposing a matrix – but they are not identical. In fact, one can show that if specific variances were all zero, factor analysis and PCA would yield the same factors/components. In practice, PCA is sometimes used as a quick approximation to factor analysis. However, factor analysis typically uses a covariance matrix with communalities (shared variance) on the diagonal instead of 1’s (for correlation matrix in PCA). This difference means PCA tends to give higher weight to variables with large total variance, whereas FA adjusts for unreliability or specificity in each variable.
Rotation in FA vs PCA: PCA results are unique (for a given dataset after standardization) – you cannot rotate PCA principal components arbitrarily without losing the variance-maximization property. Factor analysis solutions, by contrast, can be rotated to an equivalent solution that may be more interpretable. For example, two factors might be rotated so that each aligns with a subset of questionnaire items, yielding clearer meaning for each factor (one might be labeled “Economic Value” and another “Aesthetic Appeal”, etc., depending on which survey questions load on them).

Example – Survey Questionnaire: Imagine a marketing survey with 20 questions about a product, covering aspects like quality, price fairness, brand credibility, and design aesthetics. Running a factor analysis might reveal, say, four underlying factors:

Factor 1: Economic Value (questions about price, value for money, etc. load high here)
Factor 2: Functional Benefits (questions about product performance, usefulness)
Factor 3: Credibility (questions about brand trust, reputation)
Factor 4: Aesthetics (questions about design, look-and-feel)

Each question will also have some unique variance (people might randomly answer one question differently even if they have a consistent view on the factor). After extracting factors, we might apply a rotation (e.g., oblimin or varimax rotation) to achieve a cleaner pattern where each question strongly associates with one factor. The resulting factors correspond to conceptual groupings that a researcher can name and discuss. This is very useful in International Business (IB) research or other social sciences, where we often assume that certain latent traits (consumer perceptions, organizational capabilities, etc.) manifest through multiple observed indicators.

Summary: PCA is primarily a tool for dimensionality reduction – producing uncorrelated components that successively maximize variance. Factor Analysis is a tool for latent variable discovery – modeling data as arising from a few latent factors and specific noise components, often allowing more interpretability through rotations. Both reduce dimensionality, but their use-cases and interpretations differ.

Now that we’ve covered these techniques for reducing and interpreting high-dimensional data, we will turn our attention to a classic approach in pattern recognition – the k-Nearest Neighbors (KNN) method – which will segue into thinking about classification decision boundaries and model flexibility.

13.2 Unsupervised Learning II: k-Nearest Neighbors Classification

(Note: k-Nearest Neighbors is actually a supervised learning method when used for classification, since it requires labeled examples. However, it’s often discussed in an intuitive pattern recognition context and serves as a simple non-parametric method to illustrate key concepts in classification, such as decision boundaries and the bias-variance trade-off.)

Classification Basics Recap

In classification, we predict a class label $Y$ from input features $X$. As mentioned, the training error rate is the fraction of training examples where the predicted class $\hat{y}_i$ differs from the true class $y_i$. However, our primary concern is the test error rate – how often the model misclassifies new, unseen data. A fundamental theoretical result is that the optimal classification rule (in terms of lowest possible test error) is to predict the class with highest conditional probability given the predictors. This is known as the Bayes classifier.

The Bayes Classifier: For any feature vector $x_0$, the Bayes classifier assigns $x_0$ to the class that has the largest true conditional probability $P(Y=j \mid X = x_0)$. In a two-class scenario, this is equivalent to predicting the class where the probability exceeds 0.5. The Bayes classifier is ideal but theoretical because in practice we usually do not know these true probabilities (they are properties of the unknown data-generating distribution). It’s called Bayes because it uses Bayesian decision theory to minimize error – essentially picking the most likely class for each $x_0$. The minimum achievable error rate by any classifier is called the Bayes error rate, which occurs when we always choose the most likely label.

Think of a simple example: we have two classes, Orange and Blue, and for each point in the feature space we somehow know $P(\text{Orange}|x)$ and $P(\text{Blue}|x) = 1 - P(\text{Orange}|x)$. The Bayes decision boundary is the set of points in feature space where $P(\text{Orange}|x) = P(\text{Blue}|x) = 0.5$. If you’re on one side of that boundary, Orange is more likely and Bayes classifier would predict Orange; on the other side, predict Blue. Figure 2.13 from ISL (James et al.) illustrates such a scenario: the Bayes boundary (purple dashed line) cleanly separates an orange region and a blue region where each class is more probable.

Since we can’t use the Bayes classifier in practice (because we don’t know those true probabilities), we try to approximate it with models learned from data. One conceptually simple method to do this is the k-Nearest Neighbors (KNN) classifier.

k-Nearest Neighbors (KNN) Classifier

Idea: To predict the class of a new observation $x_0$, look at the “nearby” training data points in feature space – i.e., the most similar instances – and have them vote on the class. The intuition is that observations with similar predictor values are likely to have the same class label (this is a form of assuming the underlying probability function is locally smooth).

Algorithm: Given a positive integer $K$:

Compute the distance (typically Euclidean distance) from $x_0$ to all training points.
Identify the $K$ closest training points (the neighborhood $N_0$ of $x_0$).
Estimate $P(Y=j | X = x_0)$ as the fraction of those $K$ neighbors that belong to class $j$. In formula form, if the $i$-th neighbor has label $y_i$, then for class $j$:

\[ \hat{P}(Y=j \mid X=x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i = j)\,, \]

where $I(\cdot)$ is an indicator function.
Finally, assign $x_0$ the class with the largest estimated probability (equivalently, the majority class among the $K$ neighbors).

This procedure makes no explicit assumptions about the overall shape of the decision boundary – it’s completely determined by the local data around the query point. Thus, KNN is a non-parametric method (no fixed number of parameters; complexity grows with number of training instances) and instance-based (it delays decision until query time, using the training instances themselves to compute answers).

Example Illustration: Suppose we have a small 2D dataset for classification with two classes (Blue and Orange). If we set $K=3$, and we want to classify a new point (marked as a black “X”), we find the 3 closest points. Imagine those 3 neighbors consist of 2 Blue points and 1 Orange point. KNN would predict the majority vote, which is Blue in this case. If we had chosen $K=1$, we’d just take the single nearest neighbor’s class. If that single nearest neighbor happened to be Orange, we’d predict Orange. So, $K=1$ is very sensitive to noise or local irregularities, whereas larger $K$ provides a smoothing effect by averaging more neighbors.

To ground this, let’s consider a concrete mini-dataset (with three features $X_1, X_2, X_3$ and a class label $Y$):

Observation	$X_1$	$X_2$	$X_3$	$Y$
1	0	3	0	Red
2	2	0	0	Red
3	0	1	3	Red
4	0	1	2	Green
5	-1	0	1	Green
6	1	1	1	Red

Now, classify $x_0 = (0,0,0)$ (i.e., $X_1=X_2=X_3=0$):

Distances from $x_0$ to each observation (Euclidean): $d(x_0,1) = \sqrt{(0-0)^2+(0-3)^2+(0-0)^2} = 3.0$ $d(x_0,2) = \sqrt{(0-2)^2+(0-0)^2+(0-0)^2} = 2.0$ $d(x_0,3) = \sqrt{(0-0)^2+(0-1)^2+(0-3)^2} \approx 3.16$ $d(x_0,4) = \sqrt{(0-0)^2+(0-1)^2+(0-2)^2} \approx 2.24$ $d(x_0,5) = \sqrt{(0+1)^2+(0-0)^2+(0-1)^2} = \sqrt{1+0+1} = 1.41$ $d(x_0,6) = \sqrt{(0-1)^2+(0-1)^2+(0-1)^2} = \sqrt{1+1+1} = 1.73$

So, the nearest neighbor is observation 5 (distance ~1.41, class Green), second nearest is observation 6 (dist ~1.73, class Red), third nearest is observation 2 (dist 2.0, class Red).
If $K=1$: We take the single closest observation (obs 5, which is Green) and predict Green. The rationale is simply that the closest known point is Green, so our query is likely Green. This can, of course, be risky if that single point is an outlier or noise. (In our example, observation 5 being Green might indicate $x_0$ should be Green, but we have little evidence.)
If $K=3$: We take the three closest (obs 5: Green, obs 6: Red, obs 2: Red). Among these, Red appears twice vs once for Green, so the majority vote is Red. Thus, with $K=3$, we’d predict $x_0$ as Red. The reasoning: two of the three most similar instances were Red.

This simple exercise shows how different choices of $K$ can lead to different predictions. With $K=1$, we got Green; with $K=3$, we got Red. Which is better depends on the true underlying pattern and noise. Generally:

Smaller $K$ means the classifier is more flexible (since it uses very local information). It can capture fine detail (potentially good if the decision boundary is very irregular or nonlinear), but it can also chase noise — leading to high variance and potential overfitting.
Larger $K$ means the classifier is more restrictive/smooth (looking at a broader neighborhood). It will have higher bias (it might oversimplify the decision boundary), but lower variance (less sensitive to individual data points).

In fact, $K=1$ is an extreme: it will perfectly classify all training points (0 training error) because each point is its own nearest neighbor, but it often performs poorly on test data (since it essentially memorizes the training set noise). $K=N$ (where $N$ is the size of the training set) is the other extreme: it assigns every query to the majority class of the whole training set, ignoring any nuance of $X$ (high bias, low variance). So, we choose $K$ somewhere in between. A common approach is to try several values and use cross-validation to pick the $K$ that minimizes test error.

Decision Boundary and Model Complexity: KNN’s decision boundary becomes more jagged for smaller $K$, and more smooth for larger $K$. If the true Bayes decision boundary is very nonlinear, a small $K$ is needed to approximate it closely. If the true boundary is fairly linear, a too-small $K$ will just overfit noise, and a larger $K$ would suffice (or even perform better).

In our earlier two-class example with Orange and Blue regions, using KNN with a moderate $K$ can yield a boundary that closely tracks the Bayes boundary. For instance, with $K=10$ on a larger simulated dataset, the KNN boundary was almost as good as knowing the true Bayes rule. With $K=1$ vs $K=100$, we see the contrast: $K=1$ produces a wildly wiggly boundary that tries to separate every little point (very low bias, very high variance), while $K=100$ produces an almost linear boundary (high bias, low variance). Neither extreme is optimal.

Typically, the misclassification error on test data will decrease initially as we increase $K$ from 1 (reducing variance and still capturing true structure) until an optimal point, and then start increasing if $K$ becomes too large (introducing too much bias). This results in that characteristic U-shaped test error curve as a function of model flexibility, analogous to what we see in regression models. The training error, in contrast, always decreases (or stays the same) as model flexibility increases – e.g., going from $K=5$ to $K=3$ to $K=1$ will never raise training accuracy because you can always fit training points more precisely with a more flexible model. But the goal is not to minimize training error – it’s to minimize test error, which requires the right balance of flexibility.

KNN in practice: Despite its simplicity, KNN can be quite powerful for certain tasks, especially when the decision boundary is irregular and training data is plentiful. However, KNN suffers in high dimensions due to the curse of dimensionality: when $p$ (number of features) is large, points tend to all be far apart; the concept of “nearest” becomes less meaningful as every point is somewhat distant. Additionally, KNN can be computationally heavy for large datasets, because making a prediction involves computing distances to all training points. There are data structures (like KD-trees or ball trees) and approximation techniques to speed up neighbor searches.

Summary: KNN is a straightforward classification approach: find the closest training examples and let them vote. It’s an example of a memory-based learner (it effectively “stores” the training set and defers generalization until query time). Its simplicity makes it a good baseline but it highlights core concepts:

The idea of approximating the Bayes classifier by local probability estimation.
The relationship between model flexibility and overfitting: low $K$ (high flexibility) can overfit, high $K$ (low flexibility) can underfit.
How distance metrics and feature scaling matter (all features contribute to distance; one should normalize features to avoid one dominating due to scale).

Finally, it is worth noting that while we introduced KNN as a classification method, a similar approach can be used for regression (where you average the $Y$ values of neighbors instead of taking a majority vote).

13.3 Conclusion

In this chapter, we explored key concepts of machine learning and unsupervised techniques in the context of IB research:

We differentiated supervised vs unsupervised learning and regression vs classification. Supervised learning uses labeled data to predict outcomes, whereas unsupervised learning draws insights from unlabeled data.
We learned about Principal Component Analysis (PCA) as a tool for dimensionality reduction that creates new orthogonal components capturing maximum variance. PCA helps simplify complex datasets while retaining most information.
We discussed Factor Analysis (FA), which seeks latent factors underlying observed variables and is widely used when interpretability of underlying constructs is important (common in social science research). We saw how FA introduces the concept of specific variance and allows rotation for interpretability.
We examined the k-Nearest Neighbors method as an intuitive classification approach. Through KNN, we highlighted the bias-variance trade-off and the idea of decision boundaries. KNN is simple but can approximate the theoretically optimal Bayes classifier if provided enough data and an appropriate choice of $K$.

For researchers in International Business (or any applied field), these techniques are valuable. PCA and FA can reduce survey data or financial indicators into a few meaningful dimensions (e.g., indices of economic freedom, or cultural factors) which can then be used in further analysis. KNN and other classification tools can help in segmenting markets or predicting categories (like which companies are likely to expand abroad vs which are not, based on various features).

In practice, choosing the right method depends on the goal: use PCA when you need an efficient summary of data, use FA when you hypothesize hidden constructs, and use classification methods like KNN (or more advanced ones) when you have known categories to predict. Regardless of method, always be mindful of overfitting vs generalizing – a model should not just explain the past data but also reliably predict future or out-of-sample cases.

You’ve now survived your first deep dive into machine learning concepts – congratulations! This foundation will serve you well as you encounter more complex models and real-world data challenges. The next steps could include exploring advanced algorithms and also considering the ethical implications of ML in business contexts (such as fairness, transparency, and privacy). Keep experimenting with these techniques on datasets relevant to IB research, and you’ll gain both insights and confidence in applying machine learning to real problems.

# Unsupervised Learning in IB Research > "If you live 5 minutes away from Bill Gates, I bet you are rich." > – *Mohammad F. Al-Eryani, **Journal of International Business Studies*** (vol. 21, no. 3) Machine Learning (ML) is a broad field within data science that focuses on developing algorithms which can learn patterns from data. ML algorithms are typically categorized into three main types: * **Supervised learning:** Algorithms learn from *labeled* data (i.e. each training example has an associated target or response value). The goal is to learn a function that maps inputs to outputs, so that we can predict the labels for new data. Examples include regression (predicting a continuous value) and classification (predicting a class label). * **Unsupervised learning:** Algorithms learn from *unlabeled* data, discovering underlying structure or relationships without explicit target values. Common tasks include clustering and dimensionality reduction. The goal is often to understand the relationships or groupings in the data rather than to predict an external outcome. * **Reinforcement learning:** Algorithms learn by interacting with an environment. They receive feedback in the form of rewards or penalties and learn a policy to maximize cumulative reward. In reinforcement learning, an *agent* takes actions and improves its strategy over time based on reward signals from the environment (e.g. training a game-playing AI to maximize score). An intuitive way to think about these categories is by the presence or absence of feedback signals. In supervised learning we have a “teacher” signal (the correct labels), in unsupervised learning we do not, and in reinforcement learning the feedback is delayed and comes as rewards. Another useful concept is from psychology: **System 1 vs System 2 thinking** (a term popularized by Daniel Kahneman). System 1 is fast, intuitive pattern recognition, whereas System 2 is slow, analytical reasoning. In some ways, certain machine learning models can mimic these: for example, a simple rule-based model might act like System 2 (explicit logic), whereas a complex deep neural network might function more like System 1 (making instant predictions after extensive training). This analogy highlights how machine learning can automate both quick intuition and careful analysis. ### Supervised Learning: Regression vs Classification In supervised learning, we further distinguish between **regression** and **classification** tasks: * **Regression** models predict a continuous numerical output. For example, predicting a stock price or a house’s value based on input features would be a regression problem. * **Classification** models predict a categorical output (class labels). For instance, determining whether an email is “spam” or “not spam,” or whether a tumor is “benign” or “malignant,” are classification tasks. Both regression and classification involve learning a mapping $f(\mathbf{x})$ from feature inputs $\mathbf{x}$ to an output $y$. The key difference lies in the nature of $y$: numeric for regression vs. category labels for classification. This difference affects the choice of algorithms and evaluation metrics. For example, regression errors are often measured by mean squared error, whereas classification errors are measured by misclassification rate (the fraction of labels predicted incorrectly). *Example:* Suppose we have a dataset of beverages with features like color (measured in nm of light wavelength) and alcohol percentage, and the task is to predict whether the beverage is **beer** or **wine**. This is a classification problem (predicting category). If instead the task were to predict the exact alcohol percentage from other properties, that would be a regression problem. In the classification setting, we might observe data such as: | Color (nm) | Alcohol (%) | Beverage Type (Label) | | ---------- | ----------- | --------------------- | | 610 | 5 | Beer | | 599 | 13 | Wine | | 693 | 14 | Wine | | 430 | 5 | Beer | | ... | ... | ... | Here “Color” and “Alcohol %” are input features $X$, and “Beverage Type” is the target $Y$. A **classification model** would learn a decision boundary in this 2D feature space to separate beers from wines. A **regression model** would not apply here because the output is not numeric in this case. Generally, if $Y$ were numeric (say, sugar content), we’d use regression; if $Y$ is categorical, we use classification. ### The Machine Learning Process Regardless of regression or classification, building a supervised ML model usually follows a series of steps: 1. **Gathering data:** Collect a dataset with input features and target labels. 2. **Pre-processing:** Clean the data, handle missing values, encode categorical variables, and scale features if necessary (feature scaling is especially important for methods that use distance metrics, for example). 3. **Choosing a model:** Select an appropriate algorithm or model family (e.g., linear regression, decision tree, neural network, etc., for regression/classification tasks). 4. **Training the model:** Fit the model to the training data. This often involves optimizing model parameters to minimize a loss function (like least squares error for regression or a classification loss). 5. **Evaluation:** Assess model performance on a **test set** (data not seen during training) to estimate how well the model generalizes. For classification, metrics could include accuracy, precision/recall, etc., while for regression one might look at RMSE or $R^2$. 6. **Model tuning:** Adjust hyperparameters and potentially use techniques like cross-validation (e.g., k-fold cross-validation) to find the best model configuration and avoid overfitting. This may involve repeating training and evaluation multiple times. 7. **Deployment and prediction:** Use the trained model to make predictions on new, unseen data (out-of-sample prediction). These steps can be visualized as a cycle where data is used to train a model, the model is tested, and insights from evaluation might lead us to gather more data or try a different model, and so on. It’s an iterative process of improvement. **Key Principle – Bias-Variance Tradeoff:** In supervised learning, a fundamental challenge is the trade-off between bias and variance. **Bias** is error from erroneous assumptions in the learning algorithm (oversimplified model), while **variance** is error from sensitivity to small fluctuations in the training set (overly complex model). As model complexity (flexibility) increases, bias tends to decrease but variance tends to increase. We seek a model that balances these to minimize **test error** – error on new data – which often has a U-shaped curve as model flexibility increases (high error for very simple models due to high bias, high error for very complex models due to high variance, and a sweet spot in between). **Question:** *What is the difference between regression and classification models?* **Answer:** Regression models predict *quantitative* continuous outputs (e.g., predicting a number) whereas classification models predict *qualitative* categorical outputs (assigning class labels). In other words, if the target variable is numeric (like predicting sales revenue), it’s a regression problem; if the target variable is a category (like predicting if a firm will default or not), it’s a classification problem. Next, we will focus on **unsupervised learning** techniques, which are especially useful in exploratory data analysis and in situations where we don’t have outcome labels. In particular, we discuss **Principal Component Analysis (PCA)** and **Factor Analysis** for dimensionality reduction, and then transition to discuss a classification method (k-Nearest Neighbors) which, while technically supervised, provides an intuitive bridge in understanding pattern recognition. ## Unsupervised Learning I: Principal Component Analysis and Factor Analysis Unsupervised learning methods aim to summarize or reveal structure in data without reference to an explicit response variable. One important subset of unsupervised techniques is **multivariate statistics**, which deals with analyzing multiple variables together, focusing on the variation they share in common. **Definition – Multivariate Statistics:** A collection of methods for datasets with many variables (high-dimensional data) where the goal is to understand relationships and common variation among variables, rather than predicting an outcome. In other words, multivariate analysis seeks to capture the structure in data by considering all variables simultaneously. This differs from supervised modeling (which focuses on predicting a specific $Y$ from $X$) and also from clustering (which segments observations). Instead, techniques like PCA and factor analysis aim to reduce dimensionality or identify underlying factors that explain the correlations among variables. **Why reduce dimensionality?** Modern datasets often have a very large number of variables. This poses challenges: * It is difficult for a human analyst to inspect and reason about each variable individually when there are dozens or hundreds of them – our minds can’t easily visualize high-dimensional data. * Many variables may be redundant or correlated, meaning they carry overlapping information. There may be underlying “themes” or factors that drive a lot of the covariance among those variables. By using dimensionality reduction, we create a smaller set of *derived* variables that (hopefully) capture most of the important information in the full dataset. Two common approaches are **Principal Component Analysis (PCA)** and **Factor Analysis (FA)**. Both PCA and FA create new variables that are combinations of the original variables, but they have different goals and interpretations, which we will explore. ### Principal Component Analysis (PCA) **Goal:** PCA seeks to reduce the number of dimensions in a dataset while retaining as much *variation* (information) as possible. It does this by finding new variables, called **principal components**, which are linear combinations of the original variables. The principal components are ordered such that the first component explains the largest amount of variance in the data, the second component explains the second-largest amount of variance (subject to being uncorrelated with the first), and so on. In essence, PCA finds the “directions” in the data that have the most spread (variance). **How PCA works:** Geometrically, imagine your data as a cloud of points in a high-dimensional space. PCA finds a new coordinate system for that space: * The first axis (first principal component) is the direction in which the data varies the most. * The second axis (second principal component) is the direction of next highest variance *orthogonal* to the first, and so forth. Thus, PCA is effectively performing a rotation of the original axes to a new set of axes that align with the directions of greatest variability in the data. These new axes are orthogonal (uncorrelated) and ordered by the amount of variance they capture. Mathematically, if $\mathbf{X}$ is the standardized data matrix (each column a variable), PCA finds linear combinations of the columns. For example, the first principal component can be written as: $z_1 = u_{11} X_1 + u_{12} X_2 + \cdots + u_{1p} X_p,$ where $X_1, X_2, \dots, X_p$ are the original variables and $u_{1j}$ are weights. The weights are chosen such that the variance of $z_1$ is maximized (subject to $u_{11}^2 + \cdots + u_{1p}^2 = 1$ to avoid trivial solutions). The resulting $z_1$ (the first component) captures the maximum possible variance in a single dimension. Then we find $z_2$ as another linear combination orthogonal to $z_1$ that captures the next most variance, and so on. In matrix terms, we solve an eigenvalue/eigenvector problem on the covariance (or correlation) matrix of $\mathbf{X}$. The principal components are the eigenvectors, and their variances are the eigenvalues. Computing PCA typically involves an eigen-decomposition or singular value decomposition of the data matrix. **Properties of PCA:** * PCA is an **orthogonal linear transformation**: the components are uncorrelated with each other by construction. * The total variance in the data is equal to the sum of variances of all principal components (no information is lost if we keep all components). By choosing a subset of the top $k$ components, we can often capture a large percentage of the total variance with far fewer dimensions than $p$. * PCA is sensitive to scale: variables should be standardized (e.g., mean 0 and standard deviation 1) before PCA, otherwise a variable with larger numeric scale will dominate the variance. * PCA works on numeric continuous data. Categorical data must be encoded numerically (e.g., one-hot encoding) before applying PCA. * The principal components are unique up to sign (you can multiply a component by -1 and it’s still the same “direction”). They are ordered by variance (also known as **explained variance** or information content). **Interpreting PCA results:** Because PCA finds combinations that maximize variance, the resulting components often have a *mixed* interpretation – each may involve many original variables. The first few components are the most informative. Often we look at the **explained variance ratio** for each component to decide how many components to keep. For example, the first component might explain 40% of variance, the second 20%, the third 10%, etc. We might decide to keep just the first two or three components if they together explain a large majority of the variance. By projecting the data onto these few components, we can create 2D or 3D visualizations of high-dimensional data and possibly identify patterns such as clustering of observations. PCA is widely used for data exploration and as a preprocessing step to reduce dimensionality before applying other algorithms (especially for algorithms that struggle with too many features). To summarize in practical terms: **PCA reduces a large set of variables to a smaller set of *components* that still contain most of the information (variance).** Instead of analyzing dozens of variables individually, an analyst can examine a handful of components. > **Example – Netflix Movie Recommendations:** The idea behind PCA is similar to how one might approach a recommendation system. In the famous Netflix Prize context, one could use a matrix of user movie ratings and apply a PCA-like decomposition (technically singular value decomposition) to find a few underlying “components” or latent factors (e.g., genre preference, actor preference) that explain how users rate movies. Each movie and user can be represented in terms of these latent factors, which drastically reduces complexity (from tens of thousands of movies to, say, 20 latent dimensions). This approach of finding latent features is at the core of recommendation algorithms (Netflix’s actual algorithm was a form of matrix factorization). PCA provides a foundation for understanding such latent factor models. **Variance explained by components:** The variance of each principal component is given by its eigenvalue. If $\lambda_1, \lambda_2, \dots, \lambda_p$ are the eigenvalues of the covariance matrix (sorted from largest to smallest), then $\lambda_1 / \sum_{j=1}^p \lambda_j$ is the fraction of total variance explained by the first component, and so on. We often look at the **scree plot**, which is a bar plot of eigenvalues or variance explained by each component, to decide an appropriate number of components to retain (e.g., keep enough components to explain \~80-90% of variance). Components beyond that contribute little and mostly represent noise. One can formally reconstruct the data from all $p$ components without loss. But if we drop the less informative components (those with very low variance), we get a lower-dimensional approximation of the data. This can act as a noise-filtering mechanism too, since minor components often correspond to noise. ### Factor Analysis (FA) **Goal:** Like PCA, Factor Analysis reduces the dimensionality of data, but it has a different motivation. FA assumes that observed variables are influenced by a smaller number of unobserved *latent variables* called **factors**. The aim is to discover these latent factors that *cause* the observed correlations among variables, and to estimate both the common factors and the unique “specific” variance for each original variable. In contrast to PCA which is a purely mathematical decomposition, factor analysis is often considered a statistical model with a conceptual underpinning – we hypothesize that certain latent constructs exist. In **Exploratory Factor Analysis (EFA)**, we typically do not pre-specify the factors but let the data suggest how many there are and what they might represent. Each observed variable is assumed to be a linear combination of some common factors *plus* a specific factor (unique error term) for that variable. **Common Factor Model:** For example, suppose student test scores in various subjects are our observed variables. We might hypothesize two latent factors: one representing “language ability” and another representing “technical/mathematical ability.” A student’s score in French and German would both load highly on the language factor, whereas scores in Math and Physics would load on the technical factor. However, each test also has specific variance (perhaps a student is generally good at languages but just happened to not do well in German specifically – that specific part isn’t explained by the overall language ability factor). Mathematically, if $X_1, X_2, X_3, X_4$ are four observed test scores, and we assume two common factors $k_1$ (language) and $k_2$ (technical), the model could be written as: $$ \begin{aligned} X_1 &= c_{11} k_1 + c_{12} k_2 + d_1,\\ X_2 &= c_{21} k_1 + c_{22} k_2 + d_2,\\ X_3 &= c_{31} k_1 + c_{32} k_2 + d_3,\\ X_4 &= c_{41} k_1 + c_{42} k_2 + d_4~, \end{aligned} $$ where the $c_{ij}$ are factor loadings (how strongly each factor influences each variable) and $d_1, \dots, d_4$ are the specific factors (unique part of each $X$ not explained by the common factors). Here $k_1$ might be high for a student who is generally good at languages, and $d_1$ might adjust $X_1$ (say, German test) for the student’s particular affinity or lack thereof for German beyond general language ability. A key difference in the **factor analysis model** is that it explicitly includes these unique factors $d_i$ which account for variance in each variable that is *not* shared with others. PCA, by contrast, treats all variance as common variance to be accounted for by components (PCA does not separate out a “specific variance” term for each variable – effectively PCA assumes that any noise or specific variance will just end up in later components). **Estimation:** In factor analysis, we generally estimate the loadings $c_{ij}$ and sometimes the factor scores for each observation (if needed) by methods like maximum likelihood or principal axis factoring. The number of factors is typically chosen by looking at criteria like eigenvalues (Kaiser criterion), scree plot, or by theoretical considerations. One often uses techniques such as **factor rotation** to achieve a simpler, more interpretable structure of loadings. **Rotation:** Rotation is a crucial concept in FA. Once the initial solution is obtained, we can rotate the factor axes (in the multidimensional space of factors) without changing the overall fit of the model (if the rotation is orthogonal, it preserves the independence of factors; if oblique, factors can become correlated). The purpose of rotation is to find a factor orientation where each variable loads strongly on only one factor (achieving **simple structure**). This makes interpretation easier – ideally each factor corresponds to a clear concept (e.g., a factor where only language tests have high loadings can be interpreted as “language ability”). PCA components, on the other hand, are uniquely defined by the maximal variance criterion and do not have such flexibility for rotation post-hoc (PCA gives one specific orthogonal basis ranked by variance). **Differences between PCA and Factor Analysis:** * **Variance accounted vs. modeling causes:** PCA seeks to *explain variance* in the dataset with combinations of observed variables (components are constructed to capture maximal variance). Factor analysis seeks to *explain covariances* or correlations between variables by *latent factors*. Another way to say this: PCA is a descriptive technique, whereas factor analysis is a *statistical model* (with latent variables and error terms). * **Specific variance:** PCA treats all variance of variables as information to be explained by components. FA explicitly separates variance into common (shared by factors) and unique (specific to each variable) parts. This is why in factor analysis the diagonal of the covariance matrix is adjusted (communality vs uniqueness) whereas PCA uses the full variance on the diagonal. The inclusion of specific factors in FA means it generally requires an iterative solving process, often using specialized algorithms. * **Interpretability:** Because factor analysis allows rotation and seeks factors that “make sense” (often aligned with theoretical constructs), the resulting factors can be more interpretable in terms of real-world latent traits. PCA components might be harder to interpret since they are constrained to maximize variance – they can be arbitrary linear combos that don’t correspond to intuitive concepts, especially if the top variance directions are not aligned with distinct phenomena. * **When to use which:** If your goal is **data reduction for prediction or visualization**, and you just want to capture maximum information with fewer variables, PCA is typically appropriate (e.g., reducing dimensionality before feeding data to a machine learning model, or visualizing high-dimensional data). If your goal is **to identify latent constructs** that underlie your observed measures (as in psychology questionnaires, socioeconomic indicators, etc.), and you care about interpreting those latent factors, then factor analysis is more suitable. For instance, in marketing or IB research, factor analysis might help condense a long survey into a few factors like “Customer Satisfaction”, “Perceived Value”, etc., which are easier to reason about. * **Mathematical relation:** PCA and factor analysis are closely related – both involve decomposing a matrix – but they are not identical. In fact, one can show that if specific variances were all zero, factor analysis and PCA would yield the same factors/components. In practice, PCA is sometimes used as a quick approximation to factor analysis. However, factor analysis typically uses a covariance matrix with communalities (shared variance) on the diagonal instead of 1’s (for correlation matrix in PCA). This difference means PCA tends to give higher weight to variables with large total variance, whereas FA adjusts for unreliability or specificity in each variable. * **Rotation in FA vs PCA:** PCA results are unique (for a given dataset after standardization) – you cannot rotate PCA principal components arbitrarily without losing the variance-maximization property. Factor analysis solutions, by contrast, can be rotated to an equivalent solution that may be more interpretable. For example, two factors might be rotated so that each aligns with a subset of questionnaire items, yielding clearer meaning for each factor (one might be labeled “Economic Value” and another “Aesthetic Appeal”, etc., depending on which survey questions load on them). **Example – Survey Questionnaire:** Imagine a marketing survey with 20 questions about a product, covering aspects like quality, price fairness, brand credibility, and design aesthetics. Running a factor analysis might reveal, say, four underlying factors: * Factor 1: **Economic Value** (questions about price, value for money, etc. load high here) * Factor 2: **Functional Benefits** (questions about product performance, usefulness) * Factor 3: **Credibility** (questions about brand trust, reputation) * Factor 4: **Aesthetics** (questions about design, look-and-feel) Each question will also have some unique variance (people might randomly answer one question differently even if they have a consistent view on the factor). After extracting factors, we might apply a **rotation** (e.g., oblimin or varimax rotation) to achieve a cleaner pattern where each question strongly associates with one factor. The resulting factors correspond to conceptual groupings that a researcher can name and discuss. This is very useful in **International Business (IB) research** or other social sciences, where we often assume that certain latent traits (consumer perceptions, organizational capabilities, etc.) manifest through multiple observed indicators. **Summary:** *PCA* is primarily a tool for **dimensionality reduction** – producing uncorrelated components that successively maximize variance. *Factor Analysis* is a tool for **latent variable discovery** – modeling data as arising from a few latent factors and specific noise components, often allowing more interpretability through rotations. Both reduce dimensionality, but their use-cases and interpretations differ. Now that we’ve covered these techniques for reducing and interpreting high-dimensional data, we will turn our attention to a classic approach in pattern recognition – the **k-Nearest Neighbors (KNN)** method – which will segue into thinking about classification decision boundaries and model flexibility. ## Unsupervised Learning II: k-Nearest Neighbors Classification *(Note:* k-Nearest Neighbors is actually a **supervised** learning method when used for classification, since it requires labeled examples. However, it’s often discussed in an intuitive pattern recognition context and serves as a simple non-parametric method to illustrate key concepts in classification, such as decision boundaries and the bias-variance trade-off.*)* ### Classification Basics Recap In classification, we predict a class label $Y$ from input features $X$. As mentioned, the **training error rate** is the fraction of training examples where the predicted class $\hat{y}_i$ differs from the true class $y_i$. However, our primary concern is the **test error rate** – how often the model misclassifies new, unseen data. A fundamental theoretical result is that the optimal classification rule (in terms of lowest possible test error) is to predict the class with highest conditional probability given the predictors. This is known as the **Bayes classifier**. **The Bayes Classifier:** For any feature vector $x_0$, the Bayes classifier assigns $x_0$ to the class that has the largest true conditional probability $P(Y=j \mid X = x_0)$. In a two-class scenario, this is equivalent to predicting the class where the probability exceeds 0.5. The Bayes classifier is ideal but *theoretical* because in practice we usually do not know these true probabilities (they are properties of the unknown data-generating distribution). It’s called Bayes because it uses Bayesian decision theory to minimize error – essentially picking the most likely class for each $x_0$. The minimum achievable error rate by any classifier is called the **Bayes error rate**, which occurs when we always choose the most likely label. Think of a simple example: we have two classes, Orange and Blue, and for each point in the feature space we somehow know $P(\text{Orange}|x)$ and $P(\text{Blue}|x) = 1 - P(\text{Orange}|x)$. The **Bayes decision boundary** is the set of points in feature space where $P(\text{Orange}|x) = P(\text{Blue}|x) = 0.5$. If you’re on one side of that boundary, Orange is more likely and Bayes classifier would predict Orange; on the other side, predict Blue. Figure 2.13 from ISL (James et al.) illustrates such a scenario: the Bayes boundary (purple dashed line) cleanly separates an orange region and a blue region where each class is more probable. Since we can’t use the Bayes classifier in practice (because we don’t know those true probabilities), we try to approximate it with models learned from data. One conceptually simple method to do this is the **k-Nearest Neighbors (KNN) classifier**. ### k-Nearest Neighbors (KNN) Classifier **Idea:** To predict the class of a new observation $x_0$, look at the “nearby” training data points in feature space – i.e., the most similar instances – and have them vote on the class. The intuition is that observations with similar predictor values are likely to have the same class label (this is a form of assuming the underlying probability function is locally smooth). **Algorithm:** Given a positive integer $K$: 1. Compute the distance (typically Euclidean distance) from $x_0$ to all training points. 2. Identify the $K$ closest training points (the neighborhood $N_0$ of $x_0$). 3. Estimate $P(Y=j | X = x_0)$ as the fraction of those $K$ neighbors that belong to class $j$. In formula form, if the $i$-th neighbor has label $y_i$, then for class $j$: $$ \hat{P}(Y=j \mid X=x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i = j)\,, $$ where $I(\cdot)$ is an indicator function. 4. Finally, assign $x_0$ the class with the largest estimated probability (equivalently, the majority class among the $K$ neighbors). This procedure makes no explicit assumptions about the overall shape of the decision boundary – it’s completely determined by the local data around the query point. Thus, KNN is a **non-parametric** method (no fixed number of parameters; complexity grows with number of training instances) and **instance-based** (it delays decision until query time, using the training instances themselves to compute answers). **Example Illustration:** Suppose we have a small 2D dataset for classification with two classes (Blue and Orange). If we set $K=3$, and we want to classify a new point (marked as a black “X”), we find the 3 closest points. Imagine those 3 neighbors consist of 2 Blue points and 1 Orange point. KNN would predict the majority vote, which is Blue in this case. If we had chosen $K=1$, we’d just take the single nearest neighbor’s class. If that single nearest neighbor happened to be Orange, we’d predict Orange. So, $K=1$ is very sensitive to noise or local irregularities, whereas larger $K$ provides a smoothing effect by averaging more neighbors. To ground this, let’s consider a concrete mini-dataset (with three features $X_1, X_2, X_3$ and a class label $Y$): | Observation | $X_1$ | $X_2$ | $X_3$ | $Y$ | | ----------- | -------: | -------: | -------: | ----- | | 1 | 0 | 3 | 0 | Red | | 2 | 2 | 0 | 0 | Red | | 3 | 0 | 1 | 3 | Red | | 4 | 0 | 1 | 2 | Green | | 5 | -1 | 0 | 1 | Green | | 6 | 1 | 1 | 1 | Red | Now, classify $x_0 = (0,0,0)$ (i.e., $X_1=X_2=X_3=0$): * Distances from $x_0$ to each observation (Euclidean): $d(x_0,1) = \sqrt{(0-0)^2+(0-3)^2+(0-0)^2} = 3.0$ $d(x_0,2) = \sqrt{(0-2)^2+(0-0)^2+(0-0)^2} = 2.0$ $d(x_0,3) = \sqrt{(0-0)^2+(0-1)^2+(0-3)^2} \approx 3.16$ $d(x_0,4) = \sqrt{(0-0)^2+(0-1)^2+(0-2)^2} \approx 2.24$ $d(x_0,5) = \sqrt{(0+1)^2+(0-0)^2+(0-1)^2} = \sqrt{1+0+1} = 1.41$ $d(x_0,6) = \sqrt{(0-1)^2+(0-1)^2+(0-1)^2} = \sqrt{1+1+1} = 1.73$ So, the nearest neighbor is observation 5 (distance \~1.41, class Green), second nearest is observation 6 (dist \~1.73, class Red), third nearest is observation 2 (dist 2.0, class Red). * **If $K=1$:** We take the single closest observation (obs 5, which is Green) and predict **Green**. The rationale is simply that the closest known point is Green, so our query is likely Green. This can, of course, be risky if that single point is an outlier or noise. (In our example, observation 5 being Green might indicate $x_0$ should be Green, but we have little evidence.) * **If $K=3$:** We take the three closest (obs 5: Green, obs 6: Red, obs 2: Red). Among these, **Red** appears twice vs once for Green, so the majority vote is **Red**. Thus, with $K=3$, we’d predict $x_0$ as Red. The reasoning: two of the three most similar instances were Red. This simple exercise shows how different choices of $K$ can lead to different predictions. With $K=1$, we got Green; with $K=3$, we got Red. Which is better depends on the true underlying pattern and noise. Generally: * Smaller $K$ means the classifier is more **flexible** (since it uses very local information). It can capture fine detail (potentially good if the decision boundary is very irregular or nonlinear), but it can also chase noise — leading to **high variance** and potential overfitting. * Larger $K$ means the classifier is more **restrictive/smooth** (looking at a broader neighborhood). It will have higher **bias** (it might oversimplify the decision boundary), but lower variance (less sensitive to individual data points). In fact, **$K=1$** is an extreme: it will perfectly classify all training points (0 training error) because each point is its own nearest neighbor, but it often performs poorly on test data (since it essentially memorizes the training set noise). **$K=N$** (where $N$ is the size of the training set) is the other extreme: it assigns every query to the majority class of the whole training set, ignoring any nuance of $X$ (high bias, low variance). So, we choose $K$ somewhere in between. A common approach is to try several values and use cross-validation to pick the $K$ that minimizes test error. **Decision Boundary and Model Complexity:** KNN’s decision boundary becomes more jagged for smaller $K$, and more smooth for larger $K$. If the true Bayes decision boundary is very nonlinear, a small $K$ is needed to approximate it closely. If the true boundary is fairly linear, a too-small $K$ will just overfit noise, and a larger $K$ would suffice (or even perform better). In our earlier two-class example with Orange and Blue regions, using KNN with a moderate $K$ can yield a boundary that closely tracks the Bayes boundary. For instance, with $K=10$ on a larger simulated dataset, the KNN boundary was almost as good as knowing the true Bayes rule. With $K=1$ vs $K=100$, we see the contrast: $K=1$ produces a wildly wiggly boundary that tries to separate every little point (very low bias, very high variance), while $K=100$ produces an almost linear boundary (high bias, low variance). Neither extreme is optimal. Typically, the misclassification error on test data will decrease initially as we increase $K$ from 1 (reducing variance and still capturing true structure) until an optimal point, and then start increasing if $K$ becomes too large (introducing too much bias). This results in that characteristic U-shaped test error curve as a function of model flexibility, analogous to what we see in regression models. The training error, in contrast, *always* decreases (or stays the same) as model flexibility increases – e.g., going from $K=5$ to $K=3$ to $K=1$ will never raise training accuracy because you can always fit training points more precisely with a more flexible model. But the goal is not to minimize training error – it’s to minimize test error, which requires the right balance of flexibility. **KNN in practice:** Despite its simplicity, KNN can be quite powerful for certain tasks, especially when the decision boundary is irregular and training data is plentiful. However, KNN suffers in **high dimensions** due to the *curse of dimensionality*: when $p$ (number of features) is large, points tend to all be far apart; the concept of “nearest” becomes less meaningful as every point is somewhat distant. Additionally, KNN can be computationally heavy for large datasets, because making a prediction involves computing distances to all training points. There are data structures (like KD-trees or ball trees) and approximation techniques to speed up neighbor searches. **Summary:** KNN is a straightforward classification approach: find the closest training examples and let them vote. It’s an example of a *memory-based* learner (it effectively “stores” the training set and defers generalization until query time). Its simplicity makes it a good baseline but it highlights core concepts: * The idea of approximating the Bayes classifier by local probability estimation. * The relationship between model flexibility and overfitting: low $K$ (high flexibility) can overfit, high $K$ (low flexibility) can underfit. * How distance metrics and feature scaling matter (all features contribute to distance; one should normalize features to avoid one dominating due to scale). Finally, it is worth noting that while we introduced KNN as a classification method, a similar approach can be used for regression (where you average the $Y$ values of neighbors instead of taking a majority vote). ## Conclusion In this chapter, we explored key concepts of machine learning and unsupervised techniques in the context of IB research: * We differentiated **supervised** vs **unsupervised** learning and regression vs classification. Supervised learning uses labeled data to predict outcomes, whereas unsupervised learning draws insights from unlabeled data. * We learned about **Principal Component Analysis (PCA)** as a tool for dimensionality reduction that creates new orthogonal components capturing maximum variance. PCA helps simplify complex datasets while retaining most information. * We discussed **Factor Analysis (FA)**, which seeks latent factors underlying observed variables and is widely used when interpretability of underlying constructs is important (common in social science research). We saw how FA introduces the concept of specific variance and allows rotation for interpretability. * We examined the **k-Nearest Neighbors** method as an intuitive classification approach. Through KNN, we highlighted the *bias-variance trade-off* and the idea of decision boundaries. KNN is simple but can approximate the theoretically optimal Bayes classifier if provided enough data and an appropriate choice of $K$. For researchers in International Business (or any applied field), these techniques are valuable. PCA and FA can reduce survey data or financial indicators into a few meaningful dimensions (e.g., indices of economic freedom, or cultural factors) which can then be used in further analysis. KNN and other classification tools can help in segmenting markets or predicting categories (like which companies are likely to expand abroad vs which are not, based on various features). In practice, choosing the right method depends on the goal: use PCA when you need an efficient summary of data, use FA when you hypothesize hidden constructs, and use classification methods like KNN (or more advanced ones) when you have known categories to predict. Regardless of method, always be mindful of overfitting vs generalizing – a model should not just explain the past data but also reliably predict future or out-of-sample cases. *You’ve now survived your first deep dive into machine learning concepts – congratulations!* This foundation will serve you well as you encounter more complex models and real-world data challenges. The next steps could include exploring advanced algorithms and also considering the ethical implications of ML in business contexts (such as fairness, transparency, and privacy). Keep experimenting with these techniques on datasets relevant to IB research, and you’ll gain both insights and confidence in applying machine learning to real problems.