11  Logistic Regression in IB Research (Part 2 of 2)

All I know is that I know nothing — Aristotle

Logistic regression is used when the outcome (response) is categorical (often binary), rather than continuous as in linear regression. In a binary logistic model, we model the probability \(\pi\) of “success” (e.g. \(Y=1\)) using the logistic function. For a single predictor \(x\), the model is:

\[ \pi = P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}. \]

Unlike linear regression, logistic regression ensures predicted probabilities stay between 0 and 1, and it uses a logit (log-odds) link function. The coefficients \(\beta_i\) represent effects on the log-odds of \(Y=1\). A positive \(\beta_i\) means increasing that predictor raises the odds of success (higher probability), while a negative \(\beta_i\) means it lowers the odds. One can exponentiate coefficients to get odds ratios (OR): \(e^{\beta_i}\) is the multiplicative change in odds for a one-unit increase in the predictor. For example, an OR = 2 means the odds of \(Y=1\) double for each unit increase in that predictor, holding others constant.

Logistic regression can be extended to multiple predictors and to multi-category outcomes. Key types include: binary logistic (two-category outcome), ordinal logistic (ordered categories), and multinomial logistic (nominal categories). The fitting uses maximum likelihood estimation rather than ordinary least squares, and model fit is often evaluated by pseudo-\(R^2\) measures (since a traditional \(R^2\) isn’t directly applicable). McFadden’s pseudo-\(R^2\), for instance, typically ranges lower than linear \(R^2\), with values between 0.2–0.4 indicating an excellent fit.

Statistical inference in logistic regression parallels linear regression: we can test whether coefficients differ from zero (e.g. using Wald \(z\)-tests) to determine if predictors are significant, and we can construct confidence intervals for odds ratios or predicted probabilities (often using the delta method or simulation). Model comparisons use chi-square (likelihood-ratio) tests for nested models and AIC for non-nested models, analogous to F-tests and adjusted \(R^2\) in linear models.

12 Portugal’s Red Wines

To illustrate multiple logistic regression, consider a dataset of 1,599 Portuguese red wines. Each wine was physicochemically analyzed (variables like acidity, sugar, chlorides, sulfur dioxide, density, sulfates, alcohol, etc.), and expert tasters rated the wine’s quality from 0 (very bad) to 10 (excellent). For our example, the quality scores were simplified into a binary outcome “good” vs “not good” (e.g. wines rated ≥7 considered “good” = 1, else 0). We want to see if we can predict whether a wine is good from its chemical properties.

Using multiple predictors in logistic regression has pros and cons similar to linear regression. Including more predictors can improve accuracy and better utilize data (capturing more relationships), but it becomes harder to visualize and interpret, and fitting can be computationally intensive with large data.

We fit a multiple logistic regression model for Good Wine (Yes=1, No=0) using eight predictors: fixed acidity, volatile acidity, residual sugar, chlorides, total sulfur dioxide, density, sulphates, and alcohol. The estimated model (coefficients \(b_i\)) can be written as:

\[ \hat{\pi} = \frac{1}{1 + \exp{-(b_0 + b_1\cdot \text{fixed.acidity} + b_2\cdot \text{volatile.acidity} + \cdots + b_8\cdot \text{alcohol})}}. \]

Suppose the fitted coefficients (not all shown here) include \(b_{\text{alcohol}} \approx 0.78\) and \(b_{\text{volatile.acidity}} \approx -0.29\) (with intercept \(b_0\) large in magnitude to adjust for the scale of predictors). The positive coefficient for alcohol means higher alcohol content increases the odds of a good wine, holding other variables constant. In fact, \(\exp(0.78) \approx 2.18\), so for each 1% increase in alcohol, the odds of a wine being good are about 2.18 times greater (≈118% increase in odds). Conversely, volatile acidity has a negative coefficient; higher volatile acidity (which often imparts an unpleasant vinegar taste) decreases the odds of a good rating. Each unit increase in volatile acidity multiplies the odds of being good by \(\exp(-0.29) \approx 0.75\), i.e. a 25% reduction in odds for a one-unit rise in volatile acidity.

Other predictors follow suit: fixed acidity, residual sugar, sulphates, and alcohol had positive effects (higher values tend to improve quality odds), whereas volatile acidity, chlorides, total sulfur dioxide, and density had negative effects (higher values make a wine less likely to be rated good). The magnitude of coefficients on the log-odds scale is not straightforward to interpret directly, so we focus on their sign and on the odds ratios. It’s more interpretable to say “one unit increase in sulphates multiplies the odds of good wine by some factor” than to interpret the raw log-odds change. Often, we also examine which effects are statistically significant (e.g. \(p<0.05\)).

Predicted probabilities: We can use the model to predict \(\pi\) for specific wine profiles. For example, consider a wine with fixed acidity 6.2, volatile acidity 0.36, residual sugar 2.2, chlorides 0.095, total sulfur 42, density 0.9946, sulphates 0.57, and alcohol 11.7 (these are one wine’s values). Plugging into the model yields an estimated probability \(\hat{\pi} \approx 0.141\) (14.1%) that this wine is “good.” Indeed, in the data this wine was rated not good (0). We can also compute a 95% confidence interval for the true probability for wines with those features; suppose it comes out to (0.089, 0.195). We would say: for wines with that composition, we are 95% confident the probability of a good rating is between 8.9% and 19.5%.

Model performance: A pseudo-\(R^2\) (e.g. McFadden’s) was around 0.313 for this wine-quality model. This indicates the model explains a moderate portion of the deviance in wine quality. (By McFadden’s rule of thumb, 0.313 is a decent fit – recall 0.2–0.4 is considered very good.)

Below we show boxplots of each predictor for bad vs good wines. These illustrate the direction of effects discussed above – for instance, good wines tend to have higher alcohol and sulphates, and lower volatile acidity, chlorides, sulfur, and density, on average:

Physicochemical predictors vs wine quality. Each subplot is a boxplot comparing “bad” (not good) vs “good” wines on a given variable. Good wines show higher median fixed acidity, sulphates, and alcohol; and lower median volatile acidity, chlorides, total sulfur dioxide (tsulf), and density, consistent with the logistic model coefficients.

In fitting the model, we would examine the significance of each coefficient. Suppose in this example all eight predictors were significant at \(α=0.05\) when included together (this often happens when we have a large sample, but in practice some variables might drop out if not significant). Significant positive predictors (like alcohol, sulphates) contribute meaningfully to quality prediction, while significant negative ones (volatile acidity, density, etc.) detract from quality. Non-significant predictors could be removed for parsimony.

As an example of interpretation: for alcohol, \(b_{alcohol} \approx 0.78\) gave OR \(\approx 2.18\). We can say: For one unit (% vol) increase in alcohol content, the odds of the wine being rated good increase by a factor of ~2.18 (a 118% increase), holding all other variables constant. Similarly, if total sulfur dioxide had an estimated coefficient of about –0.005 (just as an illustration), then an extra 10 mg/L of sulfur dioxide would yield \(e^{-0.005 \cdot 10} ≈ 0.951\), i.e. about a 5% drop in the odds of a good rating. Often, we focus on whether the effect is positive/negative and on its OR magnitude rather than the raw coefficient.

Finally, we can generate predicted probability curves. For instance, holding all other variables at their mean, we can see how the predicted chance of a good wine increases as alcohol goes up from, say, 8% to 15%. The model might predict probabilities from very low (~5% at 8% alcohol) up to quite high (~80% or more at 15% alcohol). Indeed, keeping other factors fixed at average, increasing alcohol yields a sharply rising probability of quality:

13 Comparing Logistic Models

After fitting a logistic model, we need to assess its adequacy and possibly compare it to alternative models. Residual analysis in logistic regression is trickier than in linear regression because \(Y\) is 0/1, so the raw residual \(e = y - \hat{p}\) takes on only two possible values for each \(\hat{p}\) (either \(1-\hat{p}\) or \(0-\hat{p}\)). A residuals vs fitted plot will show two discrete bands: one for points where \(y=1\) and one where \(y=0\) (above and below the zero line):

Residuals vs. fitted values for a logistic model (simulated loan default example). The pattern shows two streaks: if the actual outcome \(y=1\), the residual is \(1 - \hat{p}\) (points in the upper band); if \(y=0\), the residual is \(0 - \hat{p} = -\hat{p}\) (points in the lower band). The discrete nature stems from the binary response.

As seen above, the binary nature of \(Y\) leads to residuals that are not symmetrically distributed around 0 as in linear regression – they cluster in two lines. This makes some usual diagnostics (like checking homoscedasticity or normality of residuals) not directly applicable. Nonetheless, we can still check for lack of fit (e.g. via the Hosmer–Lemeshow test) or outliers (points with large deviance residuals).

If our goal is prediction, we might focus on classification accuracy, ROC curves, etc., rather than residuals. Logistic regression can be evaluated by how well it classifies outcomes at a chosen cutoff (like predicting 1 if \(\hat{p}>0.5\)).

In practice, we may try multiple logistic models and select the best. Nested models (where one model is a subset of another’s predictors) can be compared by a likelihood-ratio chi-square test. This test computes \(G^2 = -2(\text{LL}*{\text{reduced}} - \text{LL}*{\text{full}})\), which under \(H_0\) (no improvement by added predictors) follows \(\chi^2\) with degrees of freedom equal to the number of additional predictors. A significant chi-square means the fuller model gives a significantly better fit. For example, say we previously modeled loan default using only FICO score. If we consider adding purpose of loan (a categorical variable, e.g. credit card, educational, etc.), we can do a chi-square test for the six added coefficients (since “purpose” has 7 categories, one baseline). If the chi-square statistic is large and \(p<0.0001\), we conclude loan purpose provides significant additional predictive power (improving the model beyond FICO alone).

For non-nested models, we rely on information criteria like AIC. The model with the lower AIC is preferred (it balances fit and complexity). Automated procedures like backward stepwise selection can be applied for logistic regression just as in linear models. We start with a full model of all candidate predictors and iteratively remove the least useful predictor (the one whose removal yields the largest drop in AIC) until no further improvement occurs. The result is a parsimonious model that still predicts well.

Example (loan default): Imagine we have a dataset of loans with borrower info and whether they defaulted (not fully paid). A simple model used FICO score to predict default: higher FICO (better credit) should decrease default risk. Indeed, a logistic fit would show a downward sloping S-curve – low FICO borrowers have a high probability of default, which falls as FICO increases. We can then add loan purpose to see if it helps. The likelihood-ratio test might yield, say, \(\chi^2 = 50\) on 6 df (\(p \ll 0.0001\)), indicating purpose is significant. Thus, “credit card debt” vs “educational loan” vs “small business loan,” etc., have different baseline risks of default even controlling for FICO.

Visualization: The logistic curve and data can be plotted to see model fit:

Logistic fit for loan default vs FICO score. Gray “x” marks are individual loans (jittered vertically: 1 = defaulted, 0 = paid). The blue curve is the predicted probability of default from a logistic model using only FICO. It shows that borrowers with low FICO (around 550–600) have a high default probability (50–80%), while those with very high FICO (800+) have a very low default chance (<5%). The model captures the trend that default risk decreases as credit score improves.

As shown, the logistic model provides a smooth probability that aligns with the 0/1 outcomes. We also notice that actual data points are either at 0 or 1 on the y-axis (since either they defaulted or not). The model’s role is to give an estimated probability in between.

If we overlay another model’s predictions, or include the purpose variable (which shifts the curve up or down for different loan types), we could compare the curves or use metrics like AUC (area under ROC) to gauge improvement in classification performance.

14 Ordinal Logistic Regression

Sometimes the outcome variable is ordinal – it has a natural order but isn’t numerical per se. For example, college juniors might be asked: How likely are you to apply to graduate school? with responses: “unlikely” (0), “somewhat likely” (1), or “very likely” (2). These categories have an order (unlikely < somewhat < very likely), but the “distance” between them isn’t strictly equal. Treating this as numeric (0,1,2) and doing linear regression would be inappropriate. Instead, we use an ordinal logistic regression (also known as the proportional odds model).

In an ordinal logistic model, we essentially fit cumulative logit functions. For \(J\) ordered categories, the model estimates \(J-1\) intercept terms (thresholds) \(\alpha_1, \alpha_2, ..., \alpha_{J-1}\) and a common set of slopes \(\beta\) for predictors. For example, with the grad school likelihood data, \(J=3\) (“unlikely,” “somewhat,” “very”) so we have two cutpoints. The model might assume:

\[ \log\frac{P(Y > 0)}{P(Y \le 0)} = \alpha_1 + \beta_1 x_1 + \beta_2 x_2 + \cdots, \]

\[ \log\frac{P(Y > 1)}{P(Y \le 1)} = \alpha_2 + \beta_1 x_1 + \beta_2 x_2 + \cdots, \]

with the same \(\beta\) across equations (this is the proportional odds assumption). In our example, predictors might include parental education (whether parents attended college, coded 0/1 as pared), school type (public=1 vs private=0), and GPA.

Fitting this model yields coefficients that we interpret via odds ratios similarly, but now it refers to odds of being in a higher category versus all lower categories combined. Suppose the results gave:

  • \(\beta_{\text{pared(attend)}} \approx 1.05\) (meaning if parents went to college, it increases the log-odds of answering in a more likely category). Exponentiating, \(\exp(1.05) ≈ 2.85\). This implies students whose parents attended college have about 2.85 times the odds of being in a higher likelihood category (somewhat/very likely vs unlikely) compared to those whose parents did not. Equivalently, those with college-educated parents have much lower odds of being in the “unlikely” group – parental education strongly and positively influences aspiration to grad school.

  • \(\beta_{\text{public school}} \approx -0.058\) (for public vs private undergrad). This gives OR \(= e^{-0.058} ≈ 0.94\). So attending a public college slightly reduces the odds of a higher likelihood category by about 5.7% compared to a private college (OR ~0.94). In other words, private school students seem a tad more likely to consider grad school (perhaps due to differences in counseling, environment, etc.). This effect is small (odds ~0.94, close to 1) but was it significant? Possibly not very, but let’s assume for illustration it was.

  • \(\beta_{\text{GPA}} \approx 0.617\) for each 1 point GPA increase. OR \(= e^{0.617} ≈ 1.85\). This means each one-point higher GPA multiplies the odds of being in a higher likelihood category by ~1.85 (an 85% increase). So stronger students academically are much more inclined towards grad school.

These interpretations can be a bit confusing due to the “cumulative” nature of odds. Another way to say it: For a one-unit GPA increase, the odds of being “very or somewhat likely” (vs “unlikely”) increase by 85%, and likewise the odds of “very likely” (vs “somewhat or unlikely”) increase by 85%, assuming proportional odds. Similarly, for parental education: those whose parents went to college have 2.85 times the odds of being “very likely” vs not, relative to those whose parents didn’t. The proportional odds model implies the OR is the same no matter where we draw the line between categories (here, that OR = 2.85 applies whether we compare “≥somewhat likely” vs “unlikely” or “≥very likely” vs others).

The ordinal model uses all the information about ordering and typically gives more power than a series of binary models. One can visualize the data by faceting or using boxplots of GPA by category and group, for instance:

(Imagine a faceted plot here: GPA by likelihood category, separated by parent education and school type. It would show, for example, that among those whose parents did not attend college (pared=0), the GPA of “very likely” students tends to be higher than those “unlikely” to go to grad school. Similarly, private school students might show higher propensity categories at a given GPA than public, albeit subtle.)

Overall, the ordinal logistic regression allowed us to make use of the ordered categories efficiently. We confirmed our intuition: having college-educated parents and a higher GPA both significantly push students up the scale of considering grad school (with sizable ORs ~2.85 and ~1.85), while public vs private school had a much smaller effect (OR ~0.94) that might not even be significant. The model’s proportional odds assumption should be checked (there are tests for whether one set of \(\beta\) is valid for all cutpoints). If that assumption fails, one might need a more complex model (like different slopes for different logits or an adjacent-categories model).

Reference interpretation: These exact results align with UCLA’s FAQ example – for parental education: “Parents did attend college: odds of being more likely to apply (very or somewhat vs unlikely) is 2.85 times that of those whose parents did not”; for GPA: “each 1 unit GPA increase multiplies odds of higher category by 1.85 (85%)”, etc..

15 Multinomial Logistic Regression

Finally, consider outcomes that are categorical but not ordered – e.g. types of programs students choose: academic, vocational, or general (a common example from education studies). We can’t rank these choices, so we use a multinomial logistic regression. This is essentially like doing multiple binary logistic comparisons by choosing a baseline category.

For instance, using a dataset of 200 students with outcome prog (1 = general, 2 = academic, 3 = vocational program), predictors might be SES (socio-economic status: low, middle, high) and writing score (continuous). We treat (say) academic as the baseline outcome. The model will produce two sets of coefficients: one for predicting “general” vs “academic”, and one for “vocational” vs “academic”. Each set has its own intercept and slopes.

Suppose the fitted model yields (numbers illustrative):

  • For general vs academic: \(\beta_{\text{SES=high}} = -1.165\) (with low SES as baseline, say middle SES had \(\beta=-0.5\), high \(\beta=-1.165\)). Exponentiating, OR (RRR) for SES=high is ~0.312. This would mean students from high SES families are far less likely to be in a general program relative to academic (RRR ~0.31). In other words, high-SES kids favor the academic track (consistent with expectations). Conversely, low SES is associated with the general program (since being low SES would have a higher risk of general vs academic, as indicated by the OR < 1 for high SES).

  • The coefficient for writing score (continuous) in the general vs academic equation might be \(\beta_{\text{write, gen}} = -0.058\). Then \(e^{-0.058} ≈ 0.94\). This means each point increase in writing test score multiplies the odds of being in general (vs academic) by 0.94 (i.e. a 6% decrease). Better writers lean towards the academic program.

  • For vocational vs academic: perhaps \(\beta_{\text{SES=high}} = -0.791\) (RRR ~0.454). So high SES kids are also less likely to be in vocational vs academic (though not as extremely as for general). Writing score coefficient might be different here: say \(\beta_{\text{write, voc}} = -0.058\) as well. If it were the same -0.058, that suggests writing score’s effect is similar for preferring academic over either other program (this is just hypothetical; often it could differ).

Multinomial logit results are often presented as relative risk ratios (RRR) for each predictor and outcome category. An RRR is basically the exponentiated coefficient (analogous to an odds ratio, but in this context it’s the ratio of probabilities of outcome \(m\) vs baseline outcome). For example, \(RRR_{\text{SES=high, gen}} = 0.3126\) means the relative risk of choosing General (over Academic) for high SES students is about 0.31 times that of low SES students. Conversely, one could invert it: low SES have ~3.2 times higher risk of general vs academic compared to high SES. For writing score, \(RRR_{write, gen} = 0.9437\) means each point increase in writing score multiplies the relative risk of General vs Academic by 0.94 (i.e. reduces it by ~5.6%). So higher scores tilt students toward the academic program.

These interpretations align with intuition: high-performing (high writing) and high-SES students favor the academic program, whereas lower SES or lower scoring students more often end up in general or vocational programs.

If SES had 3 levels, we’d have two dummy predictors (e.g. middle vs low, high vs low). A joint chi-square can test if SES overall has an effect (analogous to the chi-square in Stata output for the set of SES coefficients). In our hypothetical, it clearly does.

One must be careful not to interpret these like simple odds ratios of one outcome vs another for a given level – they are always relative to baseline category. If we wanted to know the ratio of probability of being in general vs vocational for a predictor, we’d need to combine the two equations or pick a different baseline. Typically, one baseline is enough because if you want (for completeness) all pairwise comparisons, you can derive them from the ones given.

Summary: Multinomial logistic regression generalizes binary logistic to \(K>2\) outcomes. It yields a set of \(K-1\) logit equations. The coefficients can be exponentiated to give RRRs, which are interpreted similarly to odds ratios: \(RRR>1\) means higher relative risk of that outcome vs baseline as predictor increases, \(RRR<1\) means lower risk.

In our example, the RRR for writing score was ~0.94 for general vs academic, and perhaps ~0.94 for vocational vs academic as well – indicating each point of writing score consistently lowers odds of non-academic programs. The RRR for high SES (vs low SES) might have been ~0.31 for general and ~0.56 for vocational, indicating a strong SES effect particularly on avoiding the general track. These numbers are just illustrative; real data analysis would give precise estimates and \(p\)-values (which we’d check to see which effects are significant).

16 Conclusion

Logistic regression is a powerful tool for categorical outcomes. In binary logistic regression, we model \(\log(\text{odds})\) as a linear function of predictors, which ensures predicted probabilities remain in \([0,1]\). We saw how to interpret coefficients in terms of direction (sign) and effect size via odds ratios, and how to make probability predictions for specific cases. We also introduced pseudo-\(R^2\) as a rough gauge of fit (e.g. McFadden’s \(R^2\)) – while not the same as variance explained, it helps compare models.

For model building and comparison, logistic regression parallels linear regression: we can perform significance tests (likelihood-ratio chi-square) for nested models and use criteria like AIC for model selection. We can include multiple predictors to improve predictions (at the cost of complexity) and use stepwise procedures to simplify models. Diagnostic plots of residuals vs fitted values show distinct patterns due to binary \(Y\), but overall model adequacy can be assessed with tests and predictive metrics.

We extended the binary logistic model to ordinal outcomes (using a proportional odds model) and to multinomial outcomes. Ordinal logistic regression allowed us to exploit the natural ordering of categories, yielding one set of coefficients that shifts the cumulative odds at different cutpoints. We interpreted those in terms of odds of being in higher vs lower outcome categories. Multinomial logistic regression gave us separate comparisons of each non-baseline category to a baseline, and we interpreted the results via relative risk ratios. Both ordinal and multinomial models are vital in International Business (IB) research and other fields when outcomes like customer satisfaction (low/medium/high), strategy choices, etc., are analyzed.

Logistic regression (simple or multiple) is analogous to linear regression in methodology (estimation, inference, prediction) but tailored to categorical dependent variables. It provides meaningful insights through coefficient signs and odds ratios, allows probability predictions for scenario analysis, and uses measures like pseudo-\(R^2\) to gauge goodness-of-fit. Comparing logistic models uses chi-square tests (for nested models) and AIC (for non-nested) similarly to how we use F-tests and adjusted \(R^2\) in linear models. With these extensions (ordinal, multinomial), logistic regression becomes a flexible framework capable of addressing a wide range of research questions in IB and beyond – from entry strategy choices under corruption conditions to consumer behavior segmentation.

Break-out Example: Uhlenbruck et al. (2006) studied how corruption impacts entry strategy in emerging economies. They used a multinomial logit model to analyze entry mode choices of telecom projects (e.g. joint venture vs wholly-owned) under different corruption levels. This is an example of logistic regression applied in IB research, where the outcome (entry strategy) had multiple categories influenced by country corruption level and firm factors. The insights from such models help inform managers of the odds or relative risks of choosing one strategy over another in corrupt environments, illustrating the practical value of logistic regression in organizational decision-making analysis.