10  Logistic Regression in IB Research (1/2)

The realm of regression analysis can extend beyond modeling continuous outcomes. In this chapter, we explore methods for handling a binary dependent variable (taking only two values, e.g. 0 or 1) and introduce logistic regression. To motivate this, consider a famous series of social psychology experiments by Stanley Milgram in the 1960s, which measured obedience to authority. Participants (as “teachers”) were instructed to deliver electric shocks to a “learner” (actually an actor) for wrong answers. Milgram unexpectedly found that a high proportion of subjects followed orders: 65% of participants administered the maximum shock level despite its severity. In statistical terms, each person’s behavior (refuse or obey) can be viewed as a trial with two possible outcomes. We call such a trial a Bernoulli random variable, conventionally labeling a “success” with 1 and a “failure” with 0. A Bernoulli variable has a single parameter p = P(success).

Logistic regression will allow us to model the probability of success (or other outcome of interest) as a function of predictor variables, addressing limitations of using linear regression for binary outcomes. Before delving into logistic regression, we review some key discrete distributions (Geometric, Binomial, Negative Binomial, Poisson) that often arise when modeling counts or binary events.

10.1 Distributions

Geometric Distribution: The geometric distribution models the waiting time until the first success in a sequence of independent Bernoulli trials. In our Milgram example, suppose we recruit people one by one until we find someone who refuses to administer the severe shock (a “success” in this context with probability p = 0.35, since 35% refused). The probability that the first person is a success is simply 0.35. The probability that we find the first success on the 3rd person means the first two were failures (each with probability 0.65 of obeying) and the 3rd was a success, which is \(0.65^2 \times 0.35 \approx 0.15\). In general, if \(p\) is the success probability, the probability of seeing the first success on the \(n\)th trial is:

\[ P(\text{first success on } n\text{th trial}) = (1-p)^{\,n-1}\,p, \]

where \((1-p)\) is the probability of failure. The geometric distribution thus has a memoryless quality due to the independence and identical probability assumptions. Its expected value (mean number of trials to get one success) is \(1/p\), and standard deviation is \(\sqrt{\frac{1-p}{p^2}}\).

Binomial Distribution: While the geometric counts trials until one success, the binomial distribution deals with a fixed number of trials \(n\) and models the number of successes \(k\) in those trials. For example, if we select 4 individuals for the Milgram experiment at random, the binomial distribution can give the probability that exactly \(k=1\) of them refuses to shock. The general formula is:

\(P(\text{exactly } k \text{ successes in } n \text{ trials}) = \binom{n}{k} p^k (1-p)^{\,n-k},\)

where \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\) is the number of distinct scenarios (combinations) with \(k\) successes in \(n\) trials. In our example, \(\binom{4}{1}=4\) scenarios yield one refusal, and each scenario (one success and three failures) has probability \(0.35^1 0.65^3 \approx 0.0961\), summing to \(4 \times 0.0961 = 0.3844\) (approximately 38.44%). The binomial distribution requires four conditions (often called a “binomial experiment”): (1) fixed number of trials \(n\); (2) independent trials; (3) each trial is success/failure; (4) constant probability of success \(p\) for each trial. Under these conditions, the binomial is appropriate for modeling counts of successes. Its mean and variance are \(E[K]=np\) and \(\mathrm{Var}(K)=np(1-p)\).

Negative Binomial Distribution: The negative binomial distribution describes the probability of observing the \(k\)th success on the \(n\)th trial of a sequence of Bernoulli trials. In contrast to the binomial (which fixes \(n\) and counts successes), the negative binomial fixes the number of successes and asks how long (how many trials \(n\)) it takes to achieve that many. The last trial must be a success in this formulation, and the same first three conditions (independent trials, success/failure outcomes, constant \(p\)) apply as for the binomial. The probability that the \(k\)th success occurs on trial \(n\) is given by:

\(P(k\text{th success on } n\text{th trial}) = \binom{n-1}{\,k-1\,}\, p^k (1-p)^{\,n-k},\)

for \(n = k, k+1, k+2, \dots\). Here \(\binom{n-1}{k-1}\) accounts for the ways in which the \(k-1\) successes can occur in the first \(n-1\) trials, so that the \(n\)th trial is the \(k\)th success. For instance, if successes are rare (\(p\) small), the negative binomial can model the distribution of the count of trials until a certain number of successes happens. Notably, if \(k=1\), the negative binomial reduces to the geometric distribution (one success by trial \(n\)). A practical example: if the psychologist Dr. Smith repeats Milgram’s experiment and stops only when 10 people have refused (10 successes), the negative binomial can answer questions like “What is the probability the 10th refusal occurs on the 30th person tested?” (with \(p=0.35\) for refusal). Using the formula: \(P(\text{10th success on 30th trial}) = \binom{29}{9}(0.35)^{10}(0.65)^{20} \approx 0.00012\).

The difference between binomial and negative binomial is worth emphasizing: in a binomial model we typically have a predetermined number of trials and count successes, whereas in a negative binomial scenario we have a predetermined number of successes to observe and count how many trials are needed (with the final trial being a success). Thus, the negative binomial is useful for “waiting time” problems involving a fixed target count of successes.

Poisson Distribution: The Poisson distribution is often used as a model for the number of rare events occurring in a fixed interval of time or space, under conditions of independence. It is characterized by a rate parameter \(\lambda\), which represents both the expected number of occurrences in the interval and the variance of the distribution. If events happen independently and on average \(\lambda\) events occur per unit time (or space), then the probability of observing exactly \(k\) events in one unit is:

\(P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \qquad k = 0, 1, 2, \ldots:contentReference[oaicite:15]{index=15}.\)

The Poisson mean is \(E[X]=\lambda\) and the standard deviation is \(\sqrt{\lambda}\). This distribution emerges as an approximation to the binomial when \(n\) is large and \(p\) is small (such that \(np = \lambda\) is moderate), which is why it fits scenarios of infrequent events in large populations. For example, if in a rural area the power grid fails on average \(\lambda=2\) times per week, the Poisson model can estimate the chance of exactly 1 outage in a given week: \(P(X=1) = \frac{2^1 e^{-2}}{1!} \approx 0.27\). Generally, the Poisson is appropriate when events occur singly, independently, and with a roughly constant average rate. (If observed variance significantly exceeds the mean, other models like negative binomial might be more appropriate due to over-dispersion.)

10.2 Motivating Example: Loan Defaults

Consider a peer-to-peer lending context such as LendingClub, which connects borrowers with individual lenders. We have data on 9,578 three-year loans issued via LendingClub between May 2007 and February 2010. Each loan record includes whether the loan was not fully paid (we’ll call this a default for simplicity) and attributes like the borrower’s FICO credit score at loan application. The dependent variable \(Y\) of interest is default, coded as 1 if the borrower failed to pay back the loan in full (default or “charge off”), and 0 if the loan was paid back. This is a binary outcome. We want to understand and predict the probability of default using the borrower’s FICO score (and potentially other factors).

If we attempt to visualize the relationship between FICO score and default outcome, a simple scatter plot is not very informative because \(Y\) takes only 0/1 values. Instead, one might consider the linear probability model (i.e., run a linear regression of \(Y\) on FICO) as a naive first approach. However, linear regression has strong assumptions that are violated here: it treats \(Y\) as continuous and normally distributed, which is inappropriate for binary data. Indeed, fitting an OLS line to this data yields a prediction equation like \(\hat{Y} = 1.187 - 0.001445 \times \text{FICO}\). This suggests that each one-point increase in FICO linearly lowers the default probability by about 0.00145. While the coefficient appears statistically significant, the model is problematic conceptually. For a high FICO score of 825, the linear model would predict \(\hat{Y} \approx -0.005\), i.e. a negative probability of default, which is nonsensical. Likewise, for very low FICO, it could predict a probability above 1. The fundamental issue is that linear regression does not constrain predictions to the [0,1] interval, so it can produce impossible probability estimates. In other words, nothing in the linear model guarantees the outputs are valid probabilities. This also undermines the model assumptions (errors would be heteroskedastic and non-normal when \(Y\) is 0/1). Thus, using a linear model for a binary outcome is generally not suitable, especially for extrapolation, because it can yield unbounded or meaningless predictions and violate statistical assumptions.

If the only tool you have is a hammer, you are tempted to treat everything as a nail. In our case, linear regression is the hammer – but clearly an inappropriate one for this binary nail. We need a different tool specifically designed for binary response data. This leads us to logistic regression, which addresses the shortcomings by modeling probability in a way that inherently respects the 0–1 bounds.

10.3 The Basics of Logistic Regression

Logistic regression is part of the family of generalized linear models (GLMs), tailored for binary dependent variables. Instead of modeling \(E[Y]\) as a linear combination of predictors (which led to unbounded probabilities), logistic regression models the log-odds of the event \(Y=1\) as a linear function. Equivalently, it models the probability \(P(Y=1)\) using the logistic (sigmoid) function to keep the result between 0 and 1. In the simplest case with one predictor \(x\), the logistic regression model is:

\(\pi(x) = P(Y=1 \mid x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}.\)

Here \(\beta\_0\) and \(\beta\_1\) are parameters to be estimated from data. This can also be written as \(\pi(x) = \frac{e^{\beta\_0 + \beta\_1 x}}{,1 + e^{\beta\_0 + \beta\_1 x},}\), which makes it clear \(\pi(x)\) is positive and, since the numerator can never exceed the denominator, \(\pi(x) < 1\). The logistic sigmoid function \(S(t) = 1/(1+e^{-t})\) outputs an S-shaped curve ranging from 0 to 1. No matter what linear combination \(\beta\_0+\beta\_1 x\) we plug in (which could be any real number from \(-\infty\) to \(\infty\)), the logistic function will produce a value between 0 and 1. This guarantees predicted probabilities are valid. The curve is steepest at the midpoint (where \(\pi=0.5\)) and flattens out as it approaches 0 or 1 at the extremes, reflecting diminishing changes in probability for very low or very high \(x\) values. This S-shape often aligns better with real-world binary outcome behavior than a straight line does.

To estimate the coefficients \((\beta\_0, \beta\_1)\) from data, logistic regression uses maximum likelihood estimation (MLE) rather than ordinary least squares. We choose the parameters that maximize the likelihood of the observed binary outcomes. Unlike linear regression, there is no closed-form solution for the best \(\beta\) values; they are typically found via iterative algorithms (such as Newton-Raphson). However, the principle is straightforward: we find the \(\beta\_0, \beta\_1\) that make the observed defaults and non-defaults most probable under the model. This approach to fitting is standard for logistic regression. (Indeed, the logistic model’s parameters are “most commonly estimated by maximum-likelihood estimation”.) Modern statistical software will do this optimization behind the scenes, yielding coefficient estimates often denoted \(b\_0, b\_1\).

In our LendingClub loan example, if we run a logistic regression of default (Y=1 for default) on FICO score, we might obtain something like:

\(\hat{\pi}(x) = \frac{1}{1 + \exp[-(6.72 - 0.012\,x)]},\)

where \(x\) is the FICO score. Here \(b\_0 \approx 6.72\) and \(b\_1 \approx -0.012\). These estimates imply that higher FICO scores are associated with lower probability of default (as expected, since \(b\_1\) is negative). We can use this model to predict default probabilities. For example, a person with FICO 600:

\(\hat{\pi}(600) = \frac{1}{1 + \exp[-(6.72 - 0.012\times 600)]} \approx 0.38,\)

i.e., about a 38% predicted chance of default. For someone with an excellent FICO of 850:

\(\hat{\pi}(850) = \frac{1}{1 + \exp[-(6.72 - 0.012\times 850)]} \approx 0.03,\)

only around a 3% chance of default. These predictions make sense (they are in [0,1] and reflect the intuition that default probability decreases with FICO). The logistic model’s non-linearity captures the idea that going from a very low score to a moderate score yields a big drop in risk, whereas going from an already high score to an even higher score yields only a small further drop in risk – consistent with the S-shaped probability curve.

It’s worth noting that in logistic regression, we often interpret coefficients in terms of odds ratios rather than raw probability changes. For instance, \(b\_1=-0.012\) can be exponentiated: \(e^{-0.012} \approx 0.988\), meaning each 1-point increase in FICO multiplies the odds of default by about 0.988 (a 1.2% decrease in odds). But for now, we will focus on assessing the significance and fit of the model rather than detailed interpretation of coefficients.

10.4 Inference and Goodness of Fit

After fitting a logistic regression, we can perform hypothesis tests and compute confidence intervals for the parameters, similar to linear regression (though using \(z\)-tests or likelihood-ratio tests instead of \(t\)-tests). The primary question for inference on a predictor is usually: Does this predictor have a statistically significant relationship with the outcome? In our simple model, we test \(H\_0: \beta\_1 = 0\) (no effect of FICO) vs \(H\_1: \beta\_1 \neq 0\). The output from software will often provide a coefficient estimate \(b\_1\) and its standard error, along with a Wald \(Z\) statistic and a p-value. For our loans example, \(b\_1\) was about \(-0.0119\) with a standard error \(\approx 0.000823\). This yields a large \(|Z|\) and a p-value far below 0.05. Thus we reject \(H\_0\) and conclude that FICO score is a significant predictor of default probability (here, a higher FICO significantly reduces the chance of default). In practical terms, the logistic model has provided evidence that credit score is associated with loan repayment odds.

We can also construct a confidence interval for \(\beta\_1\). Using an approximate normal sampling distribution for \(b\_1\), a 95% CI is \(b\_1 \pm 1.96 \cdot SE(b\_1)\). In our case, \(-0.0119 \pm 1.96(0.000823)\) gives roughly \((-0.0135,; -0.0103)\). This interval does not include 0, reinforcing that the true effect is likely negative (and not zero). While the exact value of \(\beta\_1\) on the log-odds scale is a bit abstract, the CI at least tells us the direction (negative) and that it’s statistically distinguishable from no effect.

Sometimes we care directly about confidence intervals for the predicted probability \(\pi(x)\). Logistic models are nonlinear, so the standard error for a predicted probability at a given \(x\) is obtained via the delta method or by software directly. For example, at FICO 700, our model predicted \(\hat{\pi}\approx0.169\). The 95% confidence interval for the true default probability at FICO 700 can be computed (often via software) as, say, (0.161, 0.177). We would report: for a borrower with a 700 FICO score, the model estimates about a 16.9% default probability, with a 95% CI of approximately 16.1% to 17.7%. This reflects uncertainty in the estimate of the curve.

Goodness of fit: Unlike linear regression, logistic regression doesn’t have a straightforward \(R^2\) to summarize fit. However, there are pseudo-\(R^2\) measures. One common metric is McFadden’s \(R^2\), defined as:

\(R^2_{\text{McFadden}} = 1 - \frac{\ln L_{\text{model}}}{\ln L_{\text{null}}},\)

where \(\ln L\_{\text{model}}\) is the log-likelihood of the fitted model and \(\ln L\_{\text{null}}\) is the log-likelihood of a null model with only an intercept. This quantity gives a sense of how much better the model is at explaining the outcome compared to having no predictors at all. McFadden’s \(R^2\) is analogous to an \(R^2\) in that it ranges from 0 to 1 (the closer to 1, the better), but in practice its values are typically much lower than OLS \(R^2\) values. For logistic models, a McFadden \(R^2\) between 0.2 and 0.4 is considered an excellent fit. Values around 0.1 or 0.2 are more common for decent models, and a value as high as 0.5 would be extraordinarily good. In our loan default example, using only FICO score, we obtained \(R^2\_{\text{McFadden}} \approx 0.027\) (about 2.7%). This is very low, indicating that FICO alone explains only a tiny improvement in log-likelihood over the intercept-only model. This isn’t surprising since many other factors influence loan default beyond credit score. In general, one should not be alarmed by low pseudo-\(R^2\) values in logistic regression – they simply operate on a different scale than the variance-explained metric of linear models. They are best used to compare relative fits of models rather than as an absolute measure of “variance explained”.

Lastly, it’s important to evaluate the model’s predictive accuracy and check for lack of fit. Common techniques include examining the classification table at a chosen probability cutoff (e.g., how many defaults vs non-defaults are correctly predicted if we predict default when \(\hat{\pi}>0.5\)), computing metrics like precision, recall, F1-score, or the area under the ROC curve (AUC). Residual analysis in logistic regression can involve looking at deviance residuals or leverage, but those are more advanced topics. For our purposes, recognizing that logistic regression allows sensible probability predictions and significance testing is the key takeaway.

10.5 Conclusion

In this chapter, we introduced logistic regression as a solution for modeling binary outcome data, using the example of loan defaults and credit scores. We reviewed several discrete distributions (geometric, binomial, negative binomial, Poisson) that set the stage for understanding binary and count data scenarios. Logistic regression was derived as a generalized linear model using the logit link function, which transforms a linear predictor into a probability via the logistic (S-shaped) curve. This ensures predictions remain in \([0,1]\) and often better captures real-world probabilities than a linear model would. We saw how coefficients can be interpreted (especially through odds ratios) and how to assess their significance (e.g., testing \(\beta\_1=0\)) using standard errors and p-values from maximum likelihood estimation. We also discussed measures of fit like McFadden’s pseudo-\(R^2\), noting that they are generally lower in value and interpretation differs from the familiar \(R^2\) of linear regression.

In summary, logistic regression expands our toolkit beyond linear regression, allowing us to tackle yes/no outcomes in a principled way. It is widely used in fields from finance (credit risk modeling) to healthcare (disease outcome predictions) to marketing (purchase vs no-purchase modeling), because it provides a clear probabilistic interpretation and handles binary data appropriately. In the next session, we will extend these ideas to multiclass outcomes (multinomial logistic regression) and ordered categories, continuing our exploration of regression techniques in International Business research and beyond.

10.6 References

  1. Biguru Blog – S. Kundu, “A Brief Introduction to Statistics – Part 2 – Probability and Distributions,” The Business Intelligence Blog, Dec 2014. (Bernoulli, Binomial, Negative Binomial, Poisson distributions)

  2. Wikipedia – “Milgram experiment,” Wikipedia, The Free Encyclopedia, last modified May 2023. (Summary of Milgram’s obedience experiment results)

  3. Amete DataScience Portfolio – “Random Forest Project (LendingClub loans dataset),” 2018. (Description of LendingClub 2007–2010 loan data and variables)

  4. Pedace, Roberto – “3 Main Linear Probability Model (LPM) Problems,” Econometrics For Dummies, Wiley, 2016. (Limitations of using linear regression for binary outcomes; unbounded probabilities issue)

  5. Masters in Data Science – “What Is Logistic Regression?” Online MS in Data Science, University of Wisconsin. (Logistic function as an S-shaped curve mapping real inputs to [0,1])

  6. Wikipedia – “Logistic regression,” Wikipedia, last modified Nov 2023. (Parameters of logistic regression are commonly estimated by maximum likelihood)

  7. NumberAnalytics Blog – Sarah Lee, “A Comprehensive Guide to McFadden’s R-squared in Logistic Regression,” Mar 13, 2025. (Definition of McFadden’s pseudo-\(R^2\) and typical value ranges in logistic models)

  8. CliffsNotes – “Math 2700 Project #2 – Probability Distributions,” Vancouver Community College (student paper on probability). (Poisson distribution mean = λ and standard deviation = √λ)

  9. OpenIntro Statistics – Lecture Notes: “Negative Binomial Distribution,” (Conditions for a negative binomial scenario: independent trials, success/failure, constant p, last trial a success)