9  Panel Data Analysis in IB Research and Other Advanced Considerations

“True education is a kind of never ending story — a matter of continual beginnings, of habitual fresh starts, of persistent newness.” – J. R. R. Tolkien

In this chapter, we focus on quantifying relationships between variables using advanced regression techniques relevant for international business (IB) research. International business data often involve observations of multiple firms or countries across time, which calls for specialized methods. We will specifically cover panel data analysis (also known as longitudinal or time-series cross-section data) and additional modeling considerations such as non-linear effects, categorical predictors, interaction terms, and model comparison. By the end, you should understand how to build and interpret regression models for panel data and know how to improve model fit and validity with various techniques.

Key goals include: (1) learning how to estimate and choose between fixed-effects vs. random-effects models in panel data, (2) extending linear regression with quadratic terms (to capture non-linearity), categorical independent variables (using dummy encoding), and interaction terms, and (3) comparing multiple models to select the best-fitting or most parsimonious model while avoiding pitfalls like overfitting.

9.1 Panel Data Analysis

Panel data (also called longitudinal data or cross-sectional time-series data) refers to datasets where multiple entities (e.g. persons, firms, countries) are observed at several time periods. This structure allows us to control for both cross-sectional differences and time-series changes simultaneously. Panel data models can account for unobserved heterogeneity across entities and over time, helping to avoid biased results when estimating the impact of variables. In other words, by following the same entities through time, we can separate out the effects of time-invariant characteristics from the variables of interest, yielding more robust insights than analyzing pure cross-section or pure time-series data alone.

For example, in IB research, panel data might consist of annual observations on a set of companies across several countries. Using panel regression, we could control for unobserved country-specific factors (like culture or legal environment) and firm-specific traits that don’t change over time, while still assessing the influence of our main predictors on an outcome (such as firm performance). This gives a richer analysis than a single snapshot in time.

Foundations of Panel Data: Fixed vs. Random Effects

Two major modeling approaches for panel data are fixed effects (FE) and random effects (RE) models. The core difference lies in how they treat the unobserved, time-invariant characteristics of the entities:

  • Fixed-effects models (FE) include an intercept term for each entity (or equivalently, they subtract each entity’s mean) to control for all time-invariant differences between entities. Essentially, FE uses each entity as its own benchmark. By doing so, fixed effects “remove the effect of those time-invariant characteristics” on the outcome and estimate the relationship between predictors and outcome within each entity over time. The key assumption of FE is that any time-invariant trait (e.g. a country’s culture, a firm’s industry, an individual’s gender) may correlate with the independent variables. By using FE, we control for that potential omitted bias, because each entity’s unique attributes (that don’t change over time) are effectively absorbed by its individual intercept. In other words, FE allows consistent estimation even if the entity-specific constants are correlated with the regressors. The downside is that FE cannot estimate the effects of any variable that doesn’t change over time (since those get differenced out or absorbed by the intercept).

  • Random-effects models (RE), on the other hand, assume that the unobserved entity-specific effect is a random variable uncorrelated with the regressors. Instead of giving each entity its own intercept, RE treats those intercepts as coming from a distribution (with a mean and variance). This approach is more efficient (lower standard errors) if its assumption holds, because it uses both within-entity and between-entity variation in the data. An RE model can include time-invariant covariates (since those are not eliminated as they are in FE). However, if the unobserved factors are in fact correlated with the independent variables, the RE estimates will be biased. In practical terms, RE is appropriate only if you believe that the unobservable differences between entities (the part captured by the random intercepts) are unrelated to your predictors. For example, in a panel of countries, an RE model might assume that country-specific effects (say, cultural attitudes) are random noise uncorrelated with the predictors in your model – a strong assumption.

Mathematically, a simple panel data model can be written as:

\[ Y_{it} = \alpha_i + \beta X_{it} + u_{it}, \]

where \(Y_{it}\) is the outcome for entity i at time t, \(X_{it}\) is a vector of predictors, and \(\alpha_i\) represents the entity’s individual effect. In a fixed-effects model, \(\alpha_i\) is a fixed (non-random) parameter to be estimated for each entity. In a random-effects model, we assume \(\alpha_i = \alpha + \zeta_i\) where \(\zeta_i\) is a random disturbance specific to entity i with mean zero, and uncorrelated with \(X_{it}\). The composite error term in RE is then \(\epsilon_{it} = \zeta_i + u_{it}\).

Assumptions recap: Fixed-effects models allow each entity to have its own intercept (\(\alpha_i\)), which captures all time-invariant factors for that entity. We do not require \(\alpha_i\) to be uncorrelated with the regressors – in fact, FE is robust even if there is correlation (this is why FE helps control for omitted variable bias due to unobserved constants). In contrast, random-effects models assume the entity-specific effect is random noise uncorrelated with the regressors. Under RE, time-invariant variables can be included as explanatory variables, but if the no-correlation assumption is false, RE estimates become inconsistent (biased). Thus, the choice between FE and RE often hinges on whether you suspect omitted factors (captured by \(\alpha_i\)) are correlated with your independent variables. When in doubt, analysts often prefer FE for a more reliable (though potentially less efficient) estimation.

Within vs. between variation: Another way to understand FE vs RE is to consider within-entity and between-entity information. FE models use only within-entity variation over time – effectively comparing each entity with itself across different years. Any between-entity differences (e.g., one firm consistently being larger than another due to time-invariant reasons) are swept out. RE models use a mixture of within and between variation; they try to explain differences across entities as well, under the assumption that those differences are random. This is why RE can estimate effects of variables that don’t change over time (using the cross-sectional variation), whereas FE cannot. But if those cross-sectional differences violate the RE assumption (correlation with predictors), then RE will give misleading results.

Example: Imagine we examine the effect of R&D expenditure on firm productivity using a panel of companies. Some companies (like tech firms) might have inherently higher productivity due to unobserved culture. A fixed-effects model will control for each firm’s inherent productivity level by giving each firm its own intercept; it will estimate the impact of changes in R&D within each firm. A random-effects model would assume those inherent differences are random and uncorrelated with R&D spending; it would use both each firm’s deviation from its own mean and differences between firms to estimate the R&D effect. If, realistically, firms with certain cultures both spend more on R&D and have higher productivity (i.e., the firm effect correlates with R&D), FE would capture that whereas RE would attribute some of it incorrectly to R&D, biasing the estimate.

Step 1: Estimating a Panel Model (TSCS Regression)

To illustrate panel data modeling, imagine we have data on several countries observed over multiple years – a time-series cross-sectional (TSCS) dataset. We might want to model an outcome like government debt (grossdebt as % of GDP) as a function of predictors like business freedom, financial freedom, and investment freedom indices (busfreedom, finfreedom, invfreedom). These indices are measured annually for each country. Our data would have a structure such as country-year observations (e.g., USA-2010, USA-2011, …, Canada-2010, Canada-2011, etc.).

Using R’s plm package (which is designed for panel linear models), we can estimate both RE and FE models. First, we need to specify the panel structure by indicating the index (country and year):

mydata <- readr::read_csv("panel_data.csv")
library(plm)
# Random-effects model:
model_re <- plm(grossdebt ~ busfreedom + finfreedom + invfreedom, 
                data = mydata, index = c("country","year"), model = "random")
summary(model_re)

This code fits a random-effects model. The argument model = "random" tells plm to use the Swamy-Arora transformation (a standard RE estimator). The model effectively assumes a common intercept (overall constant) plus a random deviation for each country. It uses both the variations within each country over time and between countries. After running summary(model_re), we would examine the coefficients and their significance. We might see output like:

  • Coefficient on busfreedom: say -0.5 (interpreted as a one-unit increase in the Business Freedom index is associated with a 0.5 percentage point decrease in government debt, if negative).
  • Coefficient on finfreedom: say -1.2 (suggesting higher Financial Freedom leads to lower debt).
  • Coefficient on invfreedom: say 0.3 (indicating higher Investment Freedom is associated with slightly higher debt, if positive).
  • An intercept term (the baseline level of debt when all indices are zero — though zero might be outside the realistic range of those indices, so the intercept is not always substantively important).

Next, we can estimate a fixed-effects model for comparison:

model_fe <- plm(grossdebt ~ busfreedom + finfreedom + invfreedom, 
                data = mydata, index = c("country","year"), model = "within")
summary(model_fe)

Using model = "within" in plm fits a fixed-effects (within) model. This effectively demeans the data by country: for each country, it subtracts the country’s mean from each variable, so that we analyze deviations from each country’s own average. In practice, plm will report the coefficients for busfreedom, finfreedom, invfreedom, but not an intercept (because intercepts are absorbed by the fixed effects of each country). The interpretation of coefficients in the FE model is “holding constant all time-invariant characteristics of each country, the effect of a one-unit increase in X on Y is …”.

Continuing our hypothetical output, summary(model_fe) might show:

  • Coefficient on busfreedom: -0.8 (perhaps slightly different from the RE estimate if between-country differences were affecting the RE result).
  • finfreedom: -1.0.
  • invfreedom: 0.1 (these are just illustrative numbers).

It might also report the fixed effects (one for each country). Often, we don’t list all of these intercepts in a report when there are many entities; it’s enough to know they were included.

Step 2: Fixed Effects vs. Random Effects Decision

After fitting both models, how do we decide which one to use – FE or RE? A classical approach is to perform a Hausman test. The Hausman test statistically checks whether the unique errors (the entity-specific effects) are correlated with the regressors. It essentially compares the coefficient estimates from FE and RE:

  • Null hypothesis (H₀): The preferred model is random effects (RE). In other words, differences across entities are not correlated with the independent variables. Under H₀, both FE and RE are consistent, but RE is efficient (more precise). If H₀ is true, FE and RE should yield similar coefficient estimates.
  • Alternative (H₁): The RE assumption is violated – meaning the entity-specific effects are correlated with the regressors. In this case, RE estimates are biased/inconsistent, and FE is the safer choice. Under H₁, FE and RE estimates will significantly differ.

We perform the test in R by doing:

phtest(model_fe, model_re)

This yields a Chi-square statistic and a p-value. The decision rule is:

  • If the Hausman test p-value is > 0.05 (insignificant), we fail to reject H₀. This suggests that the RE model is acceptable – we don’t find evidence of problematic correlation. In this case, we may prefer the random-effects model for interpretation and inference, since it’s more efficient and can include time-invariant variables.
  • If the p-value is < 0.05 (significant), we reject H₀ in favor of H₁. This implies the RE assumption is likely false – the unique entity effects are correlated with predictors – so the RE model would be biased. Therefore, we should use the fixed-effects model. Essentially, a significant Hausman test indicates that FE provides consistent estimates while RE does not, so FE is the safer (and correct) choice.

In our example, suppose the Hausman test returns a Chi-square = 3.67 with p-value = 0.055. That p-value is slightly above 0.05. We would fail to reject the null, leaning toward the RE model (as long as this makes sense theoretically). We might conclude that differences across countries in this dataset are not strongly biasing the results, and thus the more efficient RE estimates are preferable. On the other hand, if the p-value had been, say, 0.01, we would conclude that the FE model is necessary to get unbiased results.

It’s important to note that the Hausman test is sometimes sensitive to technical issues (it can give undefined results if the variance of the difference is negative, etc.), and in practice many analysts default to FE especially in observational social science data where correlation is expected. Also, if a key predictor is time-invariant (e.g., a country’s political system type in a short panel), a FE model cannot estimate its effect (because it’s collinear with the country fixed effects), so researchers might use RE for that reason while acknowledging its assumption. There are more advanced methods (like Mundlak’s approach or correlated random effects models) to handle this, but those are beyond our scope.

(Additional advanced considerations in panel data include handling unbalanced panels – where entities are observed for differing numbers of time periods – and including time fixed effects or even two-way FE models (both entity and time effects) if needed. In our discussion we focus on one-way FE for entities. If your data have systematic time trends or shocks common to all entities, adding time dummies (year fixed effects) can control for those. The panel could also be part of a multi-level or hierarchical structure, but that ventures into mixed models, which we will not cover here.)

9.2 Additional Modeling Considerations

Beyond basic linear models with continuous predictors, we often need to address situations where relationships are non-linear, where predictors are categorical, or where the effect of one variable depends on another. This section covers three such extensions: quadratic effects (to model non-linearity), dummy variables for categorical data, and interaction terms for conditional relationships. We will also mention a brief note on model accuracy diagnostics which applies to all regression models.

1. Quadratic Effects (Non-Linearity)

Not all relationships between a predictor X and outcome Y are strictly linear (straight-line). Sometimes the effect of X on Y might increase or decrease in magnitude as X changes – suggesting curvature. A common way to capture simple non-linearity is by adding a quadratic term (X²) into the regression model. This allows the slope to change with the level of X.

Example: Consider a dataset of real estate sales where we want to predict house price (Y) from the living area in square feet (X). A linear model assumes each additional square foot adds a constant amount to the price, regardless of the house’s size. But perhaps large houses command a higher price per square foot than small houses (maybe due to luxury features in very large homes), indicating a nonlinear effect. We can model this by including \(X^2\):

  • Linear model: \(Price = \beta_0 + \beta_1 \times \text{LivingArea} + \varepsilon.\)
  • Quadratic model: \(Price = \beta_0 + \beta_1 \times \text{LivingArea} + \beta_2 \times (\text{LivingArea})^2 + \varepsilon.\)

The quadratic model adds curvature: if \(\beta_2 > 0\), the curve bends upward (convex), implying an accelerating increase in price for larger homes (each additional square foot is worth more when the house is already big). If \(\beta_2 < 0\), the curve bends downward (concave), implying diminishing returns – there’s an increasing slope initially up to a point, then beyond that size, extra square footage adds less and less to price (it could even peak and turn down if the parabola opens downward, indicating an optimal size beyond which price falls). If \(\beta_2 = 0\), it reduces to the simple linear case.

In practice, to see if a quadratic improves the model, we include the squared term and check its significance. Using R, we could do:

model_linear <- lm(price ~ living_area, data = houses)
model_quad   <- lm(price ~ living_area + I(living_area^2), data = houses)
summary(model_quad)

Here I(living_area^2) creates the square of living area as a term. We would look at the coefficient for the squared term:

  • If \(\beta_2\) is significant (p < 0.05), we conclude the quadratic term provides a better fit, confirming a non-linear relationship. The sign tells us the direction of curvature (positive for convex, negative for concave).
  • If \(\beta_2\) is not significant (p > 0.05), the quadratic may not be needed, and a simpler linear term is sufficient.

Interpretation: In a model \(Y = \beta_0 + \beta_1 X + \beta_2 X^2\), the effect of X on Y is not constant; the instantaneous slope (derivative) at a given X is \(\frac{dY}{dX} = \beta_1 + 2\beta_2 X\). This means the effect of a one-unit increase in X depends on the current value of X. For example, if \(\beta_1 > 0\) and \(\beta_2 < 0\), Y increases with X at first (for small X) but the slope decreases as X grows, potentially hitting zero at some turning point \(X^* = -\beta_1/(2\beta_2)\) (the peak) and then turning negative. Conversely, \(\beta_2 > 0\) means an accelerating increase – the slope gets larger as X increases.

In our house price scenario, suppose the quadratic model yields:

\[ Price = -50,000 + 150 \times \text{Area} + 0.02 \times \text{Area}^2. \]

Here \(\beta_1 = 150\) and \(\beta_2 = 0.02 > 0\), indicating a convex curve. At 1000 sq.ft, the slope would be \(150 + 2(0.02)(1000) = 150 + 40 = 190\). So around 1000 sq.ft, each extra square foot adds about $190 to the price. At 3000 sq.ft, slope = $150 + 2(0.02)(3000) = 150 + 120 = \(270\) per sq.ft. So larger homes are adding more value per sq.ft than smaller homes – an accelerating effect. If instead \(\beta_2\) were negative, we’d find the slope gets smaller as area increases, reflecting diminishing returns (e.g., very large houses might add less value per extra sq.ft, perhaps due to limited buyer pool or redundant space).

As a rule of thumb, always visualize or check residual plots when using a linear model. If you see a curved pattern in the residuals (e.g., residuals starting high, going low in middle, high again – a U-shape), that’s a sign a linear term is missing curvature. A quadratic (or another nonlinear transformation) might be warranted. Including polynomial terms like \(X^2\) – or even \(X^3, X^4\) for more complex curves – can substantially improve fit, but beware of overfitting with high-order polynomials. In many cases, a quadratic (second-order) term is enough to capture basic curvature in relationships.

2. Categorical Independent Variables and Dummy Coding

Many regression models include categorical predictors – for example, industry sectors, regions, or yes/no attributes (like whether a firm has a diversity policy). Since regression requires numerical inputs, we represent categories through dummy variables (a.k.a. indicator variables). A dummy variable is a binary (0/1) variable indicating the presence of a category.

How to create dummy variables: For a categorical variable with k possible categories, we create k-1 dummy variables. Each dummy represents one category (coded 1 if an observation is in that category, 0 if not). We leave out one category as the reference group to avoid redundancy. If we included k dummies for k categories along with an intercept, we’d fall into the dummy variable trap, which is the perfect multicollinearity that arises because the dummies would sum to a vector of all 1’s (completely collinear with the intercept). By omitting one category, we ensure the dummies are linearly independent.

Example – Gender: Suppose we have employee data with a categorical variable for Gender (Male, Female). We create a dummy Female that equals 1 for females and 0 for males. (Equivalently, we could create Male, but typically one category is enough since Male = 1 - Female in this binary case; we choose one reference, say Male as reference and Female as dummy). Now consider a model:

\[ \text{Salary} = \beta_0 + \beta_1 \times \text{Female} + \varepsilon. \]

Here, \(\beta_0\) is the intercept, which will represent the mean salary for the reference group (males, since Female=0 for males). \(\beta_1\) is the coefficient for the Female dummy; it represents the difference in salary between females and males. Specifically, it’s how much being female (Female=1) shifts the salary relative to males (Female=0). If \(\beta_1\) comes out negative, it indicates females earn \(|\beta_1|\) less than males on average (controlling for no other variables in this simple model). If positive, it means females earn more on average. For instance, if $_0 = \(100k\) and $_1 = -\(5k\) with p<0.05, we interpret that as: the average male salary is $100k, and females earn $5k less than comparable males on average, a statistically significant gap.

In regression output, one group is implicitly the baseline (when all dummies = 0). In our example, Male is the baseline; the intercept \(\beta_0\) is the predicted salary for a male. The coefficient \(\beta_1\) for Female tells us how much to add or subtract if the person is female.

Avoiding the dummy variable trap: Always use only k-1 dummies for k categories. For example, if Education Level has four categories (High School, Bachelor’s, Master’s, PhD), and we choose High School as the reference, we include dummies for Bachelor, Master, PhD. Including all four would cause perfect multicollinearity (because the four dummies would sum to 1 for each observation, duplicating the intercept). If you accidentally include all categories’ dummies and an intercept, most software will automatically drop one to fix the collinearity or will throw an error.

Interpreting coefficients with dummy variables: Each dummy’s coefficient is the estimated difference between that category and the reference category, holding other variables constant. For example, say we model salary by education:

\[ \text{Salary} = \beta_0 + \beta_1(\text{Bachelor}) + \beta_2(\text{Master}) + \beta_3(\text{PhD}) + \ldots \]

(with High School as omitted reference). Here:

  • \(\beta_0\) = expected salary for the reference group (High School grads, and also assuming any other numeric covariates are zero if present).
  • \(\beta_1\) = difference in salary for Bachelor’s vs High School. If \(\beta_1 = \$10k\), Bachelors earn $10k more than High School grads, on average.
  • \(\beta_2\) = difference for Master’s vs High School.
  • \(\beta_3\) = difference for PhD vs High School.

If \(\beta_3 = \$15k\), that means PhD holders earn $15k more than High School grads, ceteris paribus. We could also compare PhD to Bachelor by looking at \(\beta_3 - \beta_1\), but that difference isn’t directly in the output – you’d have to calculate it and possibly test it separately if needed. The regression only gives comparisons to the reference.

One must be careful interpreting the intercept in the presence of dummies. \(\beta_0\) is the expected outcome for an observation in the reference category (and at zero for all other covariates if any). Sometimes that scenario may be hypothetical or not of primary interest (e.g., a High School grad with zero years of experience if experience was another variable). So, focus on the differences indicated by dummy coefficients, as they are usually more meaningful.

Software note: Most statistical software will automatically handle categorical variables. For instance, in R, if education is a factor with levels {HighSchool, Bachelor, Master, PhD}, running lm(salary ~ education) will create three dummy variables internally (if HighSchool is reference by default, which R usually picks alphabetically unless set otherwise). The output will show coefficients for Bachelor, Master, PhD (each relative to HighSchool). Always check which category was treated as the baseline (it’s often listed or obvious from context) so you interpret correctly. In Python’s statsmodels or sklearn, you often have to create dummies manually or use one-hot encoding with a drop of one category. Excel’s regression tool also requires manual coding of dummies.

In summary, dummy variables let us include qualitative factors in regression. They shift the intercept for different groups. By comparing dummy coefficients, we can test for differences between categories. For example, a significant positive coefficient on PhD might indicate a PhD yields a salary premium over a high school education.

3. Interaction Terms (Effect Modification)

An interaction effect occurs when the impact of one independent variable on the dependent variable depends on the level of another independent variable. In other words, it’s a situation of “it depends.” We include an interaction term in a regression model as the product of two variables (X₁ * X₂) to allow their effects to be interdependent.

When to consider interactions: If you have two factors and you suspect that the effect of one differs based on the value of the other, you should include an interaction. Common examples:

  • The effect of experience on salary might depend on education level (perhaps additional years of experience boost pay more for those with higher education).
  • In marketing, the effect of advertising spend on sales might depend on whether a competitor is also advertising heavily.
  • In our context, maybe the effect of age on bonus might differ by gender (perhaps age is associated with higher bonuses for men but not for women, or vice versa).

Ignoring an important interaction can lead to misleading conclusions. If an interaction is present but not modeled, one might conclude “no effect of X₁” when in fact X₁ has an effect for certain values of X₂ but not others.

Example – Gender and Age on Bonus: Let’s revisit the example of employee bonus pay (Y) with predictors age (a continuous variable) and gender (a dummy: Female=1, Male=0). Suppose a preliminary additive model (no interaction) is:

\[ \text{Bonus} = \beta_0 + \beta_1 \text{(Age)} + \beta_2 \text{(Female)} + \varepsilon. \]

Here, \(\beta_1\) would be the change in bonus for each additional year of age (assumed same for men and women), and \(\beta_2\) would be the gender difference in bonus (assumed the same at all ages). This model says, for example, if \(\beta_1 = -50\), each year of age reduces the bonus by $50 (maybe reflecting that younger employees get bigger performance bonuses). If \(\beta_2 = 700\), then females, on average, get a $700 higher bonus than males (perhaps at the baseline age of 0 – which is not realistic, but that’s how the intercept and dummy work; at age=0 the model would predict a female bonus $700 higher than a male’s).

However, it might be that the gender gap varies with age – an interaction effect. Perhaps younger women get significantly lower bonuses than young men (negative gap), but among older employees the gap closes or reverses. To test this, we include an interaction between Age and Female:

\[ \text{Bonus} = \beta_0 + \beta_1 \text{(Age)} + \beta_2 \text{(Female)} + \beta_3 \text{(Age} \times \text{Female)} + \varepsilon. \]

In this model:

  • \(\beta_3\) is the key interaction term coefficient. It tells us how the slope of age differs for females compared to males.
  • For males (Female = 0), the equation reduces to $ = _0 + _1 $. So for men, \(\beta_1\) is the change in bonus per year of age.
  • For females (Female = 1), plug in Female=1: \(\text{Bonus} = \beta_0 + \beta_2 + (\beta_1 + \beta_3)\text{Age}.\) The intercept for women is \(\beta_0 + \beta_2\) (bonus at age 0), and the slope with respect to age is \(\beta_1 + \beta_3\). So \(\beta_3\) represents the difference in the age slope between women and men.

In our hypothetical results, say we get: \(\beta_1 = -50\) (age effect for men), \(\beta_2 = +700\) (female effect when age=0), and \(\beta_3 = -15\). The equations would be:

  • Men: \(\hat{Bonus}_{men} = \beta_0 - 50 \times \text{Age}.\)
  • Women: \(\hat{Bonus}_{women} = (\beta_0 + 700) + (-50 - 15) \times \text{Age}.\) Simplifying, women’s age slope is \(-65\) (\(-50 + \beta_3\)). This means each additional year of age is associated with a $50 decrease in men’s bonuses, but a $65 decrease in women’s bonuses. The negative \(\beta_3 = -15\) indicates that the decline with age is steeper for women by $15 per year. If we test \(\beta_3\) and it is significant (p < 0.05), we conclude there is a statistically significant interaction: the gender gap in bonuses changes with age. If \(\beta_3\) is not significant (say p = 0.1), we might conclude there’s no strong evidence of interaction, and a simpler model without the interaction might suffice.

Interpretation: When an interaction is present, you cannot interpret main effects in isolation. In the example above, \(\beta_2 = 700\) was the “female” coefficient, but that is the gender difference only at age 0 (because at age=0, the bonus difference = 700). Age 0 is outside the realistic range here (no employees aged 0!), so \(\beta_2\) alone isn’t meaningful by itself. The combined effect matters: e.g., at age 30, male effect vs female would be different than at age 20. The interaction plot or calculation is needed. Often, one might say: “At age 30, predicted bonus for men is … and for women is …; the gap at 30 is …; at age 50, the gap is …” etc., to illustrate the interaction.

A useful way to understand interactions is to create an interaction plot. For a categorical-by-continuous interaction like this, you would plot bonus against age, drawing separate lines for men and women. If the lines are non-parallel (different slopes), that’s the interaction. In our case, both lines slope downward (bonuses drop with age for both), but the female line declines faster. At younger ages, maybe women have higher bonuses (if at age=20, female line starts above male line due to that +700 intercept difference, but then declines faster). At some age, the lines might cross (so beyond that age, men might have higher bonuses than women). This “it depends on age” nature is exactly what the interaction captures.

In general, significant interaction = “the effect of X₁ on Y depends on X₂.” Always describe it as such. For instance, one might report: “We find a significant interaction between gender and age on bonus (p=0.03). The negative coefficient on the interaction term indicates that as employees age, the gender gap in bonuses widens in favor of men. At age 30, the model predicts women earn about $200 more bonus than men, but by age 50, women earn about $1000 less than men, holding other factors constant.” This kind of interpretation conveys the conditional relationship clearly.

It’s also worth noting interactions can exist between two continuous variables or two categorical variables as well:

  • Continuous × Continuous: e.g., the effect of advertising on sales might depend on price level. If you have an interaction between advertising spend and price, it means the slope of advertising changes at different prices.
  • Categorical × Categorical: e.g., the effect of a training program (yes/no) on productivity might depend on gender (male/female). That would be a two-way ANOVA style interaction.

Regardless of type, if an interaction is significant, focus on interpreting the combined effects, often using graphs or simple-slope analysis (effect of one variable at specific values of the other). Do not interpret main effects as if they were universal, because they are not – they are averages that mask the conditionality.

4. Model Accuracy and Validity

(This section applies to any regression model, whether simple or complex.) After building a model, especially with multiple predictors or special terms, it’s crucial to assess its accuracy and validity. Accuracy refers to how well the model fits the observed data, and validity refers to whether the model’s assumptions hold and if the model can generalize to new data. Here are key metrics and checks:

  • R-squared (R²): This is the proportion of variance in the dependent variable explained by the model. An R² of 0.80 means 80% of the variability in Y is accounted for by the X’s in the model. While higher R² indicates a better fit to the sample data, be cautious: adding more variables will always increase or at least not decrease R² (even if the added variables are irrelevant). Therefore, when comparing models with different numbers of predictors, use Adjusted R², which penalizes model complexity. Adjusted R² only increases if the new variable improves the model more than would be expected by chance. It can actually decrease if you add a variable that has little real contribution. In summary, R² is useful for description, but by itself it doesn’t tell you if the model is good or if it will predict new cases well.

  • Residual Standard Error (RSE): This is essentially the standard deviation of the residuals (the errors). It’s an estimate of \(\sigma\), the standard deviation of the true error term. It tells you, roughly, the average distance between the data points and the model’s predictions. For example, an RSE of 2.5 (in the units of Y) means the typical prediction error is about 2.5 (units of Y). Lower RSE indicates the model fits the data more closely. One should consider it relative to the scale of Y: an RSE of 2.5 might be great if Y ranges from 0 to 100, but terrible if Y ranges from 0 to 5. RSE can be used to construct confidence intervals for predictions, etc.

  • F-statistic (overall model test): This tests whether the model with all predictors provides a better fit than a model with no predictors (just an intercept). The null hypothesis is that all \(\beta\) coefficients (except the intercept) are zero. A model with at least one useful predictor will have a large F-statistic and a correspondingly small p-value (<< 0.05). A significant F-test means that, collectively, the predictors are associated with Y. If the F-test is not significant (p > 0.05), it means the model as a whole isn’t statistically better than a naive mean-only model, which implies none of the predictors have detectable effects (this could happen if sample size is small or effects are truly absent).

  • Diagnostic plots and tests: Always examine residual plots to check assumptions:

    • Plot residuals vs. fitted values: look for any systematic pattern. Any clear curve or structure suggests model mis-specification (e.g., missing a nonlinear term or an interaction). Ideally, residuals should be randomly scattered around 0.
    • Check for heteroskedasticity: if residuals’ spread grows or shrinks with fitted values, the constant variance assumption is violated. Formal tests include Breusch-Pagan or White’s test, but a visual inspection can suffice as a warning. If heteroskedasticity is present, you might use robust standard errors or transform the response.
    • Check for outliers or high-leverage points: Outliers are points with large residuals (model fits them poorly). High leverage points are those with extreme predictor values. Points that are both high leverage and outliers can unduly influence the model (check Cook’s distance or influence plots). If a single or few observations have disproportionate influence, assess if they are data errors or truly unusual cases – you might refit without them to see how results change, or use robust regression methods.
    • Normality of residuals: For large samples, this matters less (thanks to the Central Limit Theorem, the inference is okay even if residuals are slightly non-normal). But for small samples or if you need prediction intervals, you might check a Q-Q plot of residuals to see if they follow a straight line (which they should if normally distributed). Severe deviations (like very heavy tails or skewness) might suggest a transformation of Y or using a different error distribution model.
  • Multicollinearity check: If you have multiple predictors, check if some are highly correlated with each other. High multicollinearity (say, correlation > 0.8 or Variance Inflation Factor (VIF) > 10 for a predictor) can make coefficient estimates unstable (high standard errors, coefficients sensitive to small data changes). If present, you might remove or combine some variables, or just be cautious in interpretation (coefficients might not be well-identified). Note that adding polynomial terms or interactions inherently introduces correlation with the original terms (X and X^2 are correlated, for instance), so it’s expected to see higher VIFs, but that multicollinearity is “by design” and usually not problematic for prediction – it just means individual coefficients might be harder to interpret.

  • Out-of-sample validation: A model may fit the training data well, but the real test is how it performs on unseen data. If you have enough data, it’s a good idea to set aside a test set (or use cross-validation) to evaluate predictive performance. This guards against overfitting – when a model is too complex and starts modeling random noise in the training data as if it were a real pattern. Overfitting leads to poor generalization. We discuss this more in the next section, but as a diagnostic, you can compare metrics like R² or RMSE (Root Mean Square Error) on a training set vs a validation set. If performance drops a lot on the validation set, the model might be overfit.

Overfitting caution: Overfitting occurs when a model memorizes noise in the training data instead of learning the true underlying relationship. Such a model will have very low error on the training data, but high error on new data. It’s like fitting a curve that goes through every training point exactly – it may wiggle through random fluctuations that won’t repeat. Overfitting is more likely if:

  • The model is overly complex relative to the amount of data (e.g., too many predictors or too high-degree polynomial with limited observations).
  • We’ve done a lot of data dredging – trying many models and picking the best by training metrics alone.
  • The data contain some outliers or noise that the model is contorting to fit.

To avoid overfitting, one strategy is to favor simpler models unless the complexity is clearly justified by significantly better fit and theory. Use techniques like cross-validation to estimate how the model will do on new data. Regularization methods (like LASSO or Ridge regression) add penalties for complexity and can also help prevent overfitting by shrinking coefficients, though those are advanced topics. In our context, if we add polynomial terms or interactions, we increase complexity, so we should be careful. We might start with simpler models and only add complexity if it significantly improves adjusted R² or validation-set performance.

Another practical tip: if you have far more predictors than observations, overfitting is almost guaranteed; you’d need to use dimension reduction or regularization in such cases, rather than ordinary regression.

In summary, always check that your model is not just an artifact of peculiarities in your dataset. A model that is both accurate on training data and valid in generalization is the goal.

9.3 Comparing Models

Often in research, we try multiple models – adding or removing predictors, trying different functional forms – and we need to determine which model is “best” or most appropriate. This section covers methods for comparing models and selecting a parsimonious model that still explains the data well. Different techniques apply depending on whether models are nested or non-nested:

1. Nested Model Comparisons (Partial F-test)

Two models are nested if one is a special case of the other. In other words, the smaller (restricted) model can be obtained from the larger (full) model by imposing some coefficients to be zero. For example:

  • Model A: \(Price = \beta_0 + \beta_1 \text{Bedrooms}\).
  • Model B: \(Price = \beta_0 + \beta_1 \text{Bedrooms} + \beta_2 \text{Bathrooms} + \beta_3 \text{LivingArea}.\)

Here Model A is nested within B (Model B adds two extra variables). We want to test if those extra variables significantly improve the fit.

The appropriate test is a partial F-test (also known as an F-test for nested models). The hypotheses:

  • H₀: The additional parameters (β₂, β₃ in our example) are equal to zero. (The simpler Model A is sufficient; the added variables have no effect.)
  • H₁: At least one of the additional parameters is non-zero (the full Model B fits better).

We calculate the F-statistic by comparing the Residual Sum of Squares (RSS) of the two models and their degrees of freedom. One formula is:

\[ F = \frac{(\text{RSS}_{\text{restricted}} - \text{RSS}_{\text{full}})/(df_{\text{restricted}} - df_{\text{full}})}{\text{RSS}_{\text{full}}/df_{\text{full}}}, \]

where df refers to the residual degrees of freedom (roughly, n minus number of parameters). Intuitively, the numerator is the drop in RSS when going from the smaller to the bigger model (scaled by how many extra parameters were used), and the denominator is the RSS per degree of freedom in the full model (an estimate of noise variance). If the extra variables explain a lot of residual variance, RSS_full will be much smaller than RSS_restricted, making F large. If they explain very little, RSS_full will be only a little smaller, making F close to 1.

In practice, statistical software can do this easily. In R, we use the anova() function:

modelA <- lm(Price ~ Bedrooms, data = houses)
modelB <- lm(Price ~ Bedrooms + Bathrooms + LivingArea, data = houses)
anova(modelA, modelB)

The output (ANOVA table) might look like:

   Res.Df    RSS    Df  Sum of Sq      F    Pr(>F)    
1    97  12000                                
2    95   9000    2    3000       15.8   0.00001 ***

Interpretation: Model A had residual degrees of freedom 97 and RSS = 12000. Model B has df = 95 and RSS = 9000. The difference in RSS is 3000 over 2 extra parameters, yielding F = 15.8 and p = 0.00001. This very low p-value indicates that adding Bathrooms and LivingArea significantly improved the model’s fit. We reject H₀ and conclude Model B is superior. In words, there is extremely strong evidence that at least one of Bathrooms or LivingArea contributes to explaining house price beyond Bedrooms alone.

The F-test for nested models is essentially asking: “Does the more complex model reduce the unexplained variance enough to justify its additional complexity?” If yes (significant F), go with the complex model; if not, stick with the simpler model.

Note: The F-test requires that Model A is truly a subset of Model B and that both are fitted to the same dataset. You cannot compare models with different dependent variables, or where one has observations that the other doesn’t (due to missing data or intentional sample differences). Also, the test assumes the models are estimated via least squares and that the larger model’s assumptions hold (e.g., if you added a term that breaks linearity assumption, the test might not be valid in a strict sense). But in linear regression contexts, it’s standard.

Interpreting coefficient changes: It’s common to see coefficients shift when new variables are added. For instance, in our housing example, the coefficient on Bedrooms in Model A might have been high (because Bedrooms was partly proxying for house size). In Model B, once LivingArea is included, the Bedrooms coefficient might drop or even change sign because now it’s being interpreted as “holding living area constant, adding a bedroom (which might mean smaller other rooms or fewer common spaces)”. This doesn’t mean the original model was “wrong” – it just had an omitted variable, so the Bedrooms effect was conflated with overall size effect. The larger model provides a more nuanced story: one bedroom in a fixed-size house might actually reduce value (if it means cramping the layout), whereas in Model A it appeared positive because more bedrooms usually meant a bigger house overall. This underscores that context and theory should guide model building: statistical tests tell you if extra variables improve fit, but you as the analyst must decide if the more complex model makes sense and answers the research question better.

2. Non-Nested Model Comparisons (AIC, BIC)

Sometimes we have models that are not nested. For example, Model C might use a quadratic term instead of a linear term, or a different combination of variables altogether. You cannot use an F-test here because one model isn’t a restricted version of the other – they are alternative formulations. In such cases, we rely on information criteria like Akaike’s Information Criterion (AIC) or Bayesian Information Criterion (BIC) to compare models.

These criteria provide a score balancing model fit and complexity:

  • AIC is defined as \(\text{AIC} = -2\ln(L) + 2p\), where \(L\) is the likelihood of the model (a measure of fit, higher is better fit, so \(-2\ln L\) is like a penalty for bad fit) and \(p\) is the number of parameters (which is a penalty for complexity). The model with the lowest AIC is considered best. AIC is grounded in information theory; roughly speaking, it estimates the relative information loss when using a model to represent the true data-generating process. It is also related to out-of-sample prediction: the model with the lowest AIC is expected to perform best on new data on average. One important thing: AIC values by themselves are not meaningful, only differences in AIC between models matter. A rule of thumb: if Model1 has AIC 100 and Model2 has AIC 104, Model1 has substantially more support (ΔAIC = 4); if ΔAIC is 0–2, models are about equally good; 4–7 indicates moderate evidence against the higher AIC model; >10 indicates essentially no support for the higher AIC model.

  • BIC (Schwarz Criterion) is similar: \(\text{BIC} = -2\ln(L) + (\ln n) p\), where \(n\) is sample size. BIC imposes a larger penalty for complexity when n is large, because \(\ln(n)\) > 2 for n > 7. So for typical moderate or large samples, BIC heavily favors simpler models compared to AIC. The model with lowest BIC is preferred. BIC has a more direct Bayesian interpretation (it approximates the log of the Bayes factor for model comparison). In practice, one often sees BIC select a smaller model than AIC does.

Information criteria allow comparison of any models estimated on the same dataset, even if they are not nested. For example:

  • Model B (linear) vs Model C (quadratic) for our house price example. These are nested (linear is special case of quadratic with β₂=0), so we could use F-test. But imagine another scenario:
  • Model X: uses predictors A, B, C.
  • Model Y: uses predictors A, B, D.

Neither model’s predictor set is a subset of the other (C vs D differ), so they are non-nested. We can’t do an F-test. We can compare AIC/BIC though. Suppose:

  • Model X: AIC = 200, BIC = 215.
  • Model Y: AIC = 198, BIC = 210.

Model Y has a slightly lower AIC (198 vs 200) and BIC (210 vs 215), so Model Y would be preferred by both criteria. The lower AIC suggests Model Y is expected to predict a bit better on new data. The lower BIC suggests that, even accounting for complexity, Model Y is superior. We should ensure the difference is meaningful; an AIC difference of 2 is considered small but could be meaningful; here it’s 2 points in AIC, which is borderline—some might say it’s not a clear winner, but generally lower is better.

One caution: AIC and BIC are relative measures; a single model’s AIC is not interpretable in isolation. Also, they don’t tell you if a model is “good” in an absolute sense, just which is better among the candidates. It’s possible all models have poor fit but one is just less poor than the others.

In summary, use AIC/BIC when choosing among different model specifications, especially for prediction. If the goal is pure prediction accuracy, AIC is often favored (it’s essentially equivalent to leave-one-out cross-validation for large samples). If the goal is more explanatory and you want to be cautious about overfitting, BIC’s stricter penalty might steer you to a simpler model.

3. Automatic Model Selection and Stepwise Regression

When you have many potential independent variables and you’re not sure which ones to include, a common (but debatable) approach is stepwise regression. This is a greedy algorithm that adds or removes predictors iteratively based on some criterion (often AIC, BIC, or p-values).

  • Forward selection: start with no predictors, then add the predictor that most improves the model (e.g., lowest AIC or highest F-statistic) one by one, stopping when none of the remaining predictors improve the model significantly.
  • Backward elimination: start with all candidate predictors, then remove the least useful predictor one by one (highest p-value or smallest drop in AIC when removed), until removing any further hurts the model.

There is also a hybrid stepwise (both) which adds and removes as needed (this is the default in some software).

Example: Suppose we have 10 potential predictors for house prices (location, size, age of house, number of bedrooms, number of bathrooms, has garage or not, lot size, condition rating, etc.). We can use stepwise selection to find a simpler model if not all are needed. In R, one could do:

full_model <- lm(price ~ location + size + age + bedrooms + bathrooms + garage + lot + condition + X + Y, data=houses)
step_model <- step(full_model, direction="backward", k=2)  # k=2 => AIC criterion (default)
summary(step_model)

This will start at the full model and try dropping each variable to see the effect on AIC. It will remove the one that yields the largest improvement in AIC (decrease in AIC) if any, then repeat from that new model, and so on. The result might be a model with, say, location, size, and bathrooms remaining. The algorithm would have determined that other variables did not provide enough additional explanatory power (relative to the AIC penalty) to be worth keeping.

Pros: Stepwise can be useful for variable selection when dealing with very many predictors, providing a quick way to narrow down candidates. It’s computationally cheap (though it doesn’t guarantee finding the absolute best subset if there are many predictors; it finds a local optimum). It’s also easy to use and understand.

Cons: There are serious concerns with stepwise:

  • It ignores the fact that we are doing multiple hypothesis tests. Each step is effectively trying different model configurations. The reported p-values in the final model don’t reflect that we “shopped around” for the best model. Thus, they are overly optimistic (the true chance of seeing such an extreme t-statistic under null is higher than reported because we implicitly tried many models).
  • It can yield different final models depending on which criterion is used or minor fluctuations in data. It’s not very stable, especially if predictors are correlated – several models might have similar performance, and stepwise will just pick one path.
  • It might miss the optimal combination if a predictor that is individually weak becomes important in combination with others (forward selection might never include it because by itself it didn’t look good).
  • It tends to overfit if not properly tuned, because it will keep adding predictors until criterion no longer improves. Using AIC is somewhat better in that it explicitly penalizes complexity (so it stops when overfitting would start to hurt AIC).

Because of these issues, some statisticians are very critical of stepwise selection. They argue it can inflate Type I errors (finding false positives) and produce R² that are biased high. The model uncertainty is not accounted for.

However, in practice, stepwise can be a quick tool for exploratory analysis. If used, it should be coupled with validation. For example, one might use stepwise on a training set and then test the chosen model on a validation set to see if it generalizes.

Inference after selection: If you do a stepwise procedure and then present the final model’s coefficients and p-values as if you had hypothesized that model from the start, you’re technically cheating a bit. The correct inference would require adjustment for the selection process (which is complicated). A simpler approach: treat the final model’s results as descriptive, and perhaps validate key coefficients with a fresh sample if possible. Do not place too much trust in the exact p-values post-selection – they’re likely overly optimistic because of the multiple comparisons that occurred implicitly.

Alternative approaches: Modern methods like LASSO (Least Absolute Shrinkage and Selection Operator) are often preferred for variable selection. LASSO does continuous shrinking of coefficients and can set some exactly to zero, achieving a selection effect, but it’s less variable than stepwise and has built-in penalties to reduce overfitting. It’s beyond our scope, but worth noting. Also, sometimes all-subsets regression (trying every combination of predictors) with criteria like adjusted R², AIC or BIC is feasible for moderate p (like < 15 predictors) – this guarantees finding the best combination by brute force, but it’s computationally expensive for large p.

In summary, stepwise can be used as a heuristic to find a reasonably good model, especially when dealing with many predictors. Just be transparent that this was a data-driven model selection, and treat the results with some caution. If possible, report that “the model was selected using stepwise AIC” and perhaps do a robustness check that the main conclusions hold if a couple of alternative models are used.

4. Model Validation

Once a model is chosen (whether by theory or selection procedures), it’s wise to perform model validation to ensure it will perform well and is not overfit:

  • Train/Test split: If you have a decent-sized dataset, one straightforward approach is to split it into a training set (e.g., 70-80% of data) and a test set (20-30%). Build the model on the training data, then use it to predict the outcomes for the test data. Compute error metrics like Mean Squared Error (MSE) or Mean Absolute Error (MAE) on the test set. Compare with the training error. If the test error is only slightly higher, the model generalizes well. If it’s much higher, you likely overfit the training data. In classification contexts, one might look at accuracy or ROC curves similarly.

  • Cross-Validation (CV): When data is limited, k-fold cross-validation is very useful. For example, 5-fold CV means: split data into 5 roughly equal parts, use 4 parts to train, 1 part to test, and do this 5 times rotating the test part. Average the test errors. This gives a more stable estimate of model performance on new data without wasting too much data on a single holdout. You can use cross-validation to compare models (choose the one with lowest CV error) or to estimate how well your final model will likely do in practice.

  • Assumption checks: We mentioned diagnostic plots earlier – those are also part of validation. If assumptions like homoskedasticity or independence are violated, you might need to use robust methods or adjust the model (e.g., adding a variable you missed, transforming Y, etc.). In panel data, for instance, one often clusters standard errors by entity if there is autocorrelation within entities over time.

  • Stability checks: Sometimes drop one observation (or one entity in panel data) and refit the model – see if any coefficient changes drastically. That could reveal if your results hinge on one data point. Similarly, if you have many predictors, check for multicollinearity issues as mentioned, because unstable coefficients under collinearity mean the model may not predict well for new combinations of X.

  • Domain knowledge: Finally, validate the model against theoretical expectations or known results. Does the sign of each effect make sense? If something is counter-intuitive, is there a plausible explanation or could it be a spurious result? This kind of sanity check can save you from presenting a seemingly strong empirical result that is actually an artifact of data quirks.

The goal of validation is to ensure your model is not only fitting the current data but would be reliable in a new sample from the same population. In academic research, you often don’t get fresh data to test on (besides maybe out-of-sample time periods), but you can still do internal validation like cross-validation. In business or applied contexts, model validation is critical – for example, a predictive model might be tested in a pilot phase before full deployment.

By validating, you protect yourself against being misled by coincidences in the data and increase confidence that the patterns you’ve found are real and replicable.

9.4 Conclusion

In this chapter, we extended basic regression analysis to more complex scenarios common in IB research (and many other fields). We learned how to handle panel data by choosing between fixed-effects and random-effects models to control for unobserved heterogeneity across entities (like countries or firms) over time. We discussed how the Hausman test helps decide between FE and RE, and why controlling for entity-specific constants can be crucial in avoiding bias. We then explored adding quadratic terms to capture non-linear relationships, seeing that a significant quadratic indicates a curve in the data rather than a straight line. We showed how to include dummy variables to represent categorical predictors, why one must avoid the dummy trap by using k-1 dummies, and how to interpret dummy coefficients as differences from a reference group. We introduced interaction terms, emphasizing that they allow “it depends” effects and that a significant interaction means main effects cannot be interpreted in isolation – the combined effect must be understood.

Finally, we discussed model selection and validation. For comparing models, we saw that nested models can be evaluated with an F-test (looking at whether additional predictors significantly reduce residual variance). For non-nested models or when focusing on prediction, information criteria like AIC and BIC provide a convenient metric to balance fit and complexity, with lower values indicating a better trade-off. We touched on stepwise selection procedures as a way to search through many predictors, while cautioning about their pitfalls in terms of inference and overfitting. We stressed the importance of checking model assumptions and validating the model on new data or via cross-validation to ensure it generalizes – guarding against the ever-looming risk of overfitting.

By mastering these tools and considerations, you are better equipped to specify regression models that align with your research questions and data properties. In international business research, data often come in panel form (country-year panels, firm-year panels) and involve many potential influences on outcomes – making these techniques especially relevant. A well-fitted model that respects the data structure and avoids overfitting will yield insights that are both reliable and substantively meaningful. Remember that statistical modeling is as much an art as a science: the best model is not just the one with the highest R² or lowest AIC, but the one that makes sense, answers your question, and can stand up to scrutiny and new data. With the foundation from this chapter, you can approach empirical analyses that involve complex data structures and multiple variables with confidence and rigor.

References

  1. Panel Data and Fixed/Random Effects: Fixed and random effects of panel data analysis – UK Essays (2015). Explanation of FE vs. RE assumptions and the rationale for the Hausman test.
  2. Dummy Variables in Regression: StatTrek Tutorial – Dummy Variables in Regression. Defines dummy (indicator) variables and warns against the “dummy variable trap” of using redundant dummies. Also explains interpretation of dummy coefficients vs. a reference group.
  3. Interaction Effects: Jim Frost (2020) – Understanding Interaction Effects in Statistics (Statistics By Jim). Explains that a significant interaction means you cannot interpret main effects alone and uses the phrase “it depends” to describe interactions. Also discusses plotting interactions to visualize non-parallel lines.
  4. Overfitting Definition: Montesinos-López et al. (2022). Multivariate Statistical Machine Learning Methods for Genomic Prediction, Chapter 4. Describes overfitting as when a model learns the training data (including noise) so well that it performs poorly on unseen data, and contrasts it with underfitting.
  5. Model Selection and AIC: Cross Validated (StackExchange) answer by user meh (2018) on AIC vs R². States that AIC estimates out-of-sample prediction error – the model with the lowest AIC is expected to predict best on new data (unlike R² which just measures fit to training data).
  6. F-test for Nested Models: Duke University tutorial (archived). Using the F-test to Compare Two Models. Provides the formula for the partial F-test and an example of comparing a simple vs. full model. Illustrated by an R-bloggers example where adding group dummy variables significantly reduced RSS.
  7. Quadratic Term in Regression: George Choueiry (2022). Why Add & How to Interpret a Quadratic Term in Regression (Quantifying Health blog). Explains that adding \(X^2\) can model curvature and that the slope at a given X is \(\beta_1 + 2\beta_2 X\). Suggests checking quadratic term’s p-value to confirm non-linearity.
  8. Stepwise Selection Caution: Lecture notes by Frank Harrell (Vanderbilt University) & J. F. Knight (JHU Biostat) on model selection. Emphasize that p-values after stepwise should not be taken at face value due to extensive multiple testing. Model selection can inflate the significance of remaining predictors, so results must be validated on new data for credibility.