13  Structural Equation Modeling (SEM)

Chapter Overview: This chapter provides a comprehensive introduction to Structural Equation Modeling (SEM) using R. We begin with the theoretical foundations of SEM, including its historical origins in path analysis and factor analysis, and the unification of these ideas into modern SEM. We then detail the components of SEM – path analysis models (relationships among observed variables) and measurement models (latent variable models such as confirmatory factor analysis) – and show how they combine into full SEMs. Practical examples using the lavaan package in R are interwoven to illustrate model specification, estimation, and evaluation, including code for fitting models and producing path diagrams. Key issues of model identification, parameter estimation techniques, and assessing model fit are addressed in depth. We discuss how SEM can contribute to causal inference and model-based reasoning in the social sciences, emphasizing the assumptions required for causal interpretations. Throughout, we highlight best practices for transparency and reproducibility in SEM analysis (e.g., using R Markdown to integrate code and narrative) and we conclude by considering the advantages and limitations of SEM in applied research. Inline citations are provided in APA 7th edition style, and a full reference list is included at the end of the chapter.

13.1 Introduction and Theoretical Overview

Structural Equation Modeling (SEM) is a broad family of multivariate statistical techniques that merges aspects of factor analysis and systems of regression equations. It allows researchers to specify and test complex models involving multiple relationships between observed and latent variables, often depicted in the form of path diagrams. SEM has become a cornerstone of advanced data analysis in social and behavioral sciences, as well as in other fields like education, business, and epidemiology (Bollen, 1989; Kline, 2016). By a standard definition, SEM is “a class of methodologies that seeks to represent hypotheses about the means, variances, and covariances of observed data in terms of a smaller number of ‘structural’ parameters defined by a hypothesized underlying conceptual or theoretical model” (Kaplan, 2001, as cited in Bollen, 1989). In simpler terms, an SEM expresses theoretical hypotheses (often causal relationships) among variables and encodes them into a statistical model, then tests how well that model fits the observed data covariance structure.

SEM encompasses a variety of specific modeling techniques – including path analysis, confirmatory factor analysis (CFA), structural regression models, and more – under one general framework. A great strength of SEM is its ability to integrate measurement models (relationships between latent constructs and their observed indicators) with structural models (relationships among latent and/or observed variables) into a single cohesive analysis. This unity allows analysts to account for measurement error explicitly and to test complex theoretical systems in a confirmatory manner (Bollen, 1989). As Bentler (1980) famously remarked, SEM held “the greatest promise for furthering psychological science”, highlighting the early optimism that SEM would enable more rigorous tests of theoretical models in psychology and related fields. Over the decades, SEM has indeed become one of the most popular and powerful analytical approaches in the quantitative social sciences (Bentler, 1990; Kaplan, 2000).

Historical Origins: SEM as we know it evolved from two primary lines of research. The first is path analysis, introduced by geneticist Sewall Wright in the 1920s (Wright, 1934). Wright developed path coefficients to partition correlation into hypothesized direct and indirect causal effects among observed variables. Path analysis provided a way to encode causal assumptions in diagrams and equations, laying the groundwork for later SEM. The second origin is factor analysis, developed in psychology (Spearman, Thurstone) to model latent traits or factors that give rise to observed test scores. By the mid-20th century, researchers sought to combine these ideas – to model latent variables and allow relationships (regressions) among them. This synthesis was achieved in the 1970s by Jöreskog, Keesling, and Wiley, who independently formulated the modern linear structural equation model (the “Jöreskog–Keesling–Wiley” model) (Jöreskog, 1973; Keesling, 1972; Wiley, 1973). Jöreskog’s work, in particular, led to the development of LISREL, the first widely used SEM software. The general SEM as outlined by Jöreskog (1973) consists of two parts: (a) a structural model linking latent variables to each other via a system of simultaneous equations, and (b) a measurement model linking latent variables to observed indicator variables through confirmatory factor analysis. This framework unified path analysis and factor analysis, and it remains the basis for most SEM today.

Uses of SEM: In practice, SEM is used to test theoretical models in a variety of domains. Social scientists use SEM to evaluate models of attitudes, behaviors, or social processes that involve latent constructs (e.g., intelligence, socio-economic status, satisfaction) measured by multiple observed indicators. SEM is also employed for longitudinal data (to analyze change over time via latent growth models), multilevel data (combining SEM with hierarchical models), and for testing measurement invariance across groups (multi-group SEM). The flexibility of SEM allows specifying complex hypotheses, such as mediation (indirect effects), reciprocal relationships (with appropriate caution for identification), and latent interactions or nonlinear effects (with advanced techniques). Crucially, SEM encourages researchers to think carefully about causal structure and to represent their theories explicitly in a model – a practice that enhances model-based reasoning in the social sciences. While SEM models are often applied to observational (non-experimental) data, the approach is fundamentally rooted in causal modeling tradition: the arrows in an SEM diagram typically reflect hypothesized causal directions (Pearl, 2000; Bollen & Pearl, 2013). Indeed, one can view a structural equation model as a set of causal assumptions encoded in both equations and a path diagram. This does not by itself prove causation – a point we will revisit – but SEM provides a formal language to articulate causal hypotheses and to test whether data are consistent with those hypotheses.

SEM and Causality: A recurring theme in SEM is its connection (and tension) with causal inference. Historically, SEM emerged from attempts to infer causality from correlations under strong theoretical assumptions (Wright’s path analysis was explicitly a causal approach). Today, with the advent of modern causal inference frameworks (such as directed acyclic graphs and the do-calculus; Pearl, 2000), there has been renewed clarity on what SEM can and cannot do for causal claims. Bollen and Pearl (2013) note that SEM can yield estimates interpretable as causal effects if one invokes substantive assumptions that the model is correctly specified (no omitted confounders, correct directionality, etc.). In other words, SEM is a tool for causal modeling – it allows researchers to encode causal hypotheses and see if the data support them – but it does not prove causality solely from model fit. As Pearl (2012) emphasizes, “causal effects in observational studies can only be substantiated from a combination of data and untested theoretical assumptions, not from the data alone”. We will later discuss how to responsibly interpret SEM results in light of this principle, and how SEM complements other approaches to causal inference by enabling complex model-based reasoning (for example, modeling mediating pathways and latent confounders in a single framework).

Chapter Roadmap: The remainder of this chapter is organized as follows. In §14.2, we introduce the building blocks of SEM, starting with path analysis (SEM with observed variables only) and then measurement models for latent variables, culminating in full structural equation models that integrate both. We demonstrate each with R examples using the lavaan package (Rosseel, 2012). In §14.3, we delve into issues of model identification – a critical prerequisite for SEM – and explain how to ensure a model is identified (able to yield a unique solution). In §14.4, we discuss estimation methods for SEM (primarily maximum likelihood) and the assumptions they require, as well as alternatives for non-normal or categorical data. In §14.5, we cover how to assess model fit, including the chi-square goodness-of-fit test and popular fit indices (CFI, TLI, RMSEA, SRMR), with guidance on interpreting these metrics (Hu & Bentler, 1999; Kline, 2016). In §14.6, we address the interpretation of results: understanding parameter estimates (regression weights, factor loadings, variances), distinguishing direct and indirect effects, and interpreting latent variable scores and correlations. We also illustrate how to visualize SEM results (e.g., using path diagrams via the semPlot package (Epskamp, 2015)) for better communication of model findings. §14.7 discusses how SEM contributes to causal inference and model-based reasoning, highlighting what conclusions can (and cannot) be drawn and emphasizing the importance of theoretical grounding. In §14.8, we outline best practices for transparency and reproducibility in SEM analyses – for instance, reporting complete model specification, sharing data and code, and using open-source tools (like R and R Markdown) to document the analysis pipeline. Finally, §14.9 summarizes the advantages of SEM (such as handling latent constructs and testing complex theories) and its limitations (such as potential identification problems, sensitivity to model misspecification, and the need for large samples), providing a balanced perspective for applied researchers.

Throughout this chapter, all examples will use R code in an R Markdown style. We assume the reader has R and the relevant packages (notably lavaan) installed. The data and models chosen for illustration reflect typical social science research scenarios, including a classic dataset on industrialization and democracy (Bollen, 1989) and a famous educational testing dataset (Holzinger & Swineford, 1939). By the end of the chapter, readers should understand both the theory behind SEM and the practical steps to implement SEM analyses in R, interpret the output, evaluate model fit, and report results in a transparent manner.

13.2 Components of SEM: Path Analysis, Measurement Models, and Full SEM

SEM models are built from two fundamental component types: (1) structural (path) models that relate variables to each other (often representing causal hypotheses among either observed or latent variables), and (2) measurement models that relate latent variables to their observed indicators. We introduce each component in turn, then discuss how they combine into a full SEM. In this section, we also demonstrate each type of model with R code using lavaan (a leading R package for SEM; Rosseel, 2012).

Path Analysis (Structural Model with Observed Variables)

Path analysis is the simplest special case of SEM, involving only observed (measured) variables. There are no latent factors in a pure path analysis model; instead, the model consists of a set of regression equations potentially with correlated residuals. Path analysis can be seen as an extension of multiple regression to accommodate multiple dependent variables and chains of influence (mediation). It enables modeling of direct and indirect effects among observed variables according to a theorized causal structure.

Example and Notation: Consider three observed variables X, M, and Y, where a researcher hypothesizes that X influences Y both directly and indirectly through M (a mediation model). This can be depicted in a path diagram with arrows \(X \to M\) and \(M \to Y\) (forming the indirect path \(X \to M \to Y\)), as well as a direct arrow \(X \to Y\). In SEM notation, we could write two structural equations:

  • \(M = a \cdot X + \zeta_M\),
  • \(Y = b \cdot M + c' \cdot X + \zeta_Y\),

where \(a, b, c'\) are path coefficients (regression weights) to be estimated, and \(\zeta_M, \zeta_Y\) are error terms (disturbances) for the equations. The indirect effect of X on Y through M is the product \(a \times b\), and the total effect is \(c' + a \times b\). Path analysis allows estimation of these effects simultaneously in one system.

In lavaan syntax, we can specify this mediation model as follows. We first simulate a simple dataset to illustrate the analysis:

# Simulate a simple mediation dataset
set.seed(123)
X <- rnorm(100)
M <- 0.5*X + rnorm(100, sd=1)
Y <- 0.7*M + 0.2*X + rnorm(100, sd=1)
simData <- data.frame(X = X, M = M, Y = Y)

# Specify the mediation model in lavaan syntax
med_model <- '
  # Structural (path) model
  M ~ a*X       # path from X to M with coefficient labeled a
  Y ~ b*M + c*X # paths from M to Y (b) and X to Y (c)
  
  # Define indirect and total effects using := 
  indirect := a*b
  total := c + a*b
'

In the model specification above, M ~ a*X denotes a regression of M on X, labeling that coefficient as a. Y ~ b*M + c*X denotes Y regressed on M and X, labeling those coefficients b and c respectively. We then use lavaan’s ability to define new parameters: the line indirect := a*b creates a new parameter named “indirect” equal to the product of a and b (i.e., the indirect effect \(X \to M \to Y\)), and total := c + a*b defines the total effect. These defined parameters will be computed after model fitting.

Next, we fit the model and request a summary:

library(lavaan)
fit_med <- sem(med_model, data = simData)
summary(fit_med, standardized = TRUE, ci = TRUE)

Let us interpret the (typical) output of such an analysis. The summary will include the estimates for paths a, b, and c (often called \(c'\) in mediation literature for the direct effect) along with their standard errors, z-values, and p-values. The output also lists the defined parameters “indirect” and “total” with their estimates. For example, we might see something like:

  • Estimate of a (X → M) around 0.47 (significant, p < .001),
  • Estimate of b (M → Y) around 0.79 (significant, p < .001),
  • Estimate of c (X → Y direct) around 0.04 (n.s., p ~ .73),
  • Indirect effect a*b ≈ 0.47*0.79 = 0.37 (significant, p < .001),
  • Total effect c + a*b ≈ 0.41 (significant, p < .01).

(These values are illustrative; actual estimates depend on the simulated random data. In the toy data generated above, the true indirect effect is \(0.5 \times 0.7 = 0.35\) and the true direct effect is 0.2.) The key point is that lavaan computes the indirect effect and its standard error (using the delta method or optionally bootstrapping), allowing us to formally test the mediation hypothesis. In our simulation, the indirect effect was significant while the direct effect was near zero, consistent with partial or full mediation.

Model Fit in Path Analysis: In a simple mediation with three variables and three paths as above, the model is just-identified (df = 0) because essentially we have as many free parameters as there are non-redundant covariances in the data. In our lavaan output, we see “Degrees of freedom: 0” and a chi-square test statistic of 0. A just-identified model will always fit the data perfectly (chi-square = 0), which means we cannot test overall fit – this model reproduces the covariance matrix exactly by definition. This is common in small path models; it’s not problematic per se, but it means we rely on theory and significance of paths rather than overall fit to evaluate the model. If we had an over-identified path model (df > 0), we would examine fit indices (we’ll discuss these in §14.5).

Even when degrees of freedom are zero, path analysis is useful to estimate effects and their confidence intervals. However, one should be cautious not to “overfit” by including every possible path; models that are just-identified or saturated offer no test of whether the structure is correct. Often, researchers will omit certain direct paths (due to theory) making the model over-identified so that overall fit can be tested. For instance, a pure mediation hypothesis would omit the direct path X → Y, in which case our model above would have df = 1 (one fewer parameter than sample moments) and we could test if adding that direct path significantly improves fit.

Interpreting Path Coefficients: In path analysis, each single-headed arrow corresponds to a regression coefficient. These can be interpreted much like regression betas: an unstandardized coefficient indicates the expected change in the outcome per unit change in the predictor, holding other predictors constant. Lavaan by default reports unstandardized estimates; by using standardized=TRUE in summary(), we get standardized coefficients (usually labeled Std.all), which are often easier to interpret in SEM context (since variables may be on different scales). In a standardized solution, a path coefficient is the expected number of standard deviation changes in the outcome for a 1 SD change in the predictor. In our example, the standardized indirect effect of X on Y is \((a \times b)_{\text{std}}\), which lavaan also computes. The significance of an indirect effect can be tested by its z or by bootstrapping (many prefer bootstrapped confidence intervals for indirect effects to account for non-normal sampling distribution of a product).

Covariances and Model Errors: Path models can include not only direct paths but also covariance links (represented by double-headed arrows in diagrams) between independent variables or error terms. For example, if X and M were exogenous covariates, one could allow X ~~ M (a covariance) in the model syntax. In mediation models, it is often assumed that the exogenous predictor X is uncorrelated with the error terms of M and Y (this is required for unbiased estimation of causal effects in observational mediation analysis). If one had multiple exogenous variables, allowing their covariance (i.e., specifying them as correlated) is typical. Lavaan’s syntax uses ~~ for (co)variances; e.g., X ~~ M would freely estimate the covariance between X and M. By default, lavaan automatically includes variances of each variable and covariances among exogenous variables, so the user often does not need to explicitly specify those unless certain covariances are fixed to zero or constrained.

In sum, path analysis is a straightforward SEM that focuses on observed variables. It is a confirmatory approach: one must specify which paths (directed arrows) are present or absent according to theory. The major benefit over running separate regressions is that path analysis estimates the whole system simultaneously, yielding correct standard errors and allowing latent (unobserved) error covariances to be specified. A major conceptual benefit is that all relationships are tested in concert, so one can assess the coherence of a complex hypothesis (e.g., multiple mediators or feedback loops, though non-recursive loops require special caution and techniques for identification).

Good Practices: When using path analysis, ensure the model is identified (see §14.3). Typically, recursive (acyclic) path models with proper handling of exogenous covariates are identified. Non-recursive models (with feedback loops or correlated errors between endogenous variables) may not be identified without additional constraints (for instance, one may need instrumental variables type assumptions). It is also essential to ground the model in theory – because with only observed variables, equivalent models (with different causal directions) can often fit the data equally well (MacCallum, Wegener, Uchino, & Fabrigar, 1993). For example, the correlation between X, M, Y could be explained by a model where X causes M and Y (our mediation hypothesis) or by a model where M causes X and Y, etc., unless temporal or experimental evidence favors one direction. Thus, while path analysis can quantify a postulated causal chain, the interpretation that “X causes Y” via M is only as valid as the assumptions (no omitted confounders, correct directionality) hold. We reinforce that researchers should not solely rely on model fit to establish causality – instead, use subject-matter knowledge, study design, and perhaps additional tests (such as tests of alternative models) to build confidence in the causal interpretation (Bollen & Pearl, 2013).

Measurement Models: Confirmatory Factor Analysis (CFA)

The second key component of SEM is the measurement model, which relates latent variables (constructs that are not observed directly) to measured indicator variables. When we specify a measurement model without a structural component, we are essentially doing Confirmatory Factor Analysis (CFA) – testing hypotheses about how measured variables load on latent factors.

Latent Variables: Latent variables (often depicted as circles or ovals in diagrams) represent theoretical constructs such as intelligence, socioeconomic status, depression, etc., that cannot be measured directly but manifest through multiple observed indicators (test scores, survey items, etc.). By modeling latent variables, SEM allows us to account for measurement error explicitly: the idea is that each observed indicator is an imperfect reflection of the underlying construct, and CFA separates the true score variance (common factor) from the error variance (unique factors).

CFA Example: A classic dataset in SEM literature is the Holzinger and Swineford (1939) data, which contains scores of students on various mental ability tests. It has been widely used to illustrate CFA (Jöreskog, 1969; Harman, 1976; and in the lavaan documentation). The typical CFA model for this dataset posits 3 latent factors: a Visual factor measured by tests x1, x2, x3; a Textual factor measured by x4, x5, x6; and a Speed factor measured by x7, x8, x9. Each test loads on one factor (and not on others) in the confirmatory model, and the factors are allowed to correlate (since abilities may be correlated).

In lavaan syntax, we specify a CFA model using the =~ operator, which means “is manifested by.” For our example, the model could be written as:

HS_model <- '
  visual  =~ x1 + x2 + x3
  textual =~ x4 + x5 + x6
  speed   =~ x7 + x8 + x9
'
fit_cfa <- cfa(HS_model, data = HolzingerSwineford1939)  # built-in dataset in lavaan
summary(fit_cfa, standardized = TRUE, fit.measures = TRUE)

Here, visual =~ x1 + x2 + x3 indicates a latent factor visual that loads on indicators x1, x2, x3. Lavaan by default fixes the first loading of each factor to 1 (to set the scale of the latent variable) unless otherwise specified, and estimates the remaining loadings freely. Alternatively, one can fix the latent variance to 1 and estimate all loadings – both approaches establish a measurement scale for the latent (this is necessary for identification, see §14.3). The other two lines define textual and speed factors similarly. By default, lavaan will allow these latent factors to correlate (it assumes all latent exogenous variables are correlated). If we wanted them uncorrelated, we would specify visual ~~ 0*textual etc., but theory expects ability factors to be intercorrelated, so we keep them free.

Interpreting CFA Output: The summary of the CFA model will list, under “Latent Variables,” the estimated factor loadings for each indicator on its posited factor. For example, we might see something like:

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  visual =~                                                           
    x1                1.000                              0.770    0.869
    x2                0.554    0.100    5.54    <.001    0.426    0.748
    x3                0.728    0.120    6.07    <.001    0.561    0.779
  textual =~
    x4                1.000                              0.862    0.917
    x5                1.183    0.092   12.85    <.001    1.019    0.964
    x6                0.921    0.080   11.52    <.001    0.793    0.902
  speed =~
    x7                1.000                              0.452    0.651
    x8                1.183    0.201    5.88    <.001    0.534    0.742
    x9                1.001    0.180    5.57    <.001    0.451    0.583

(Note: These numbers are illustrative, not exact; they resemble outputs from similar models.) Each loading has an unstandardized estimate and a standardized value (Std.all). For instance, for the visual factor, x1 is a reference indicator (loading set to 1.0 by default, so no Std.Err or z for it) – its standardized loading is 0.869, meaning visual explains about \((0.869)^2 ≈ 75%\) of the variance of x1. The second indicator x2 has an estimated loading ~0.554, meaning in raw scale x2 is about 0.554 times as sensitive to visual as x1 is (since x1 was the unit loading). Its standardized loading is 0.748, implying about 56% variance explained (since \(0.748^2≈0.56\)). All loadings shown are significant, indicating that each test is indeed related to its intended factor.

The output also lists Variances of latent variables and residual variances of observed variables (often labeled with a dot, e.g., .x1 for the residual of x1). For identification, one latent per factor had either variance fixed or first loading fixed. In lavaan’s default (marker variable) identification, the latent variances will be estimated (since we fixed first loading=1). The standardized latent variances are not shown in Std.all because by definition a latent in Std.all has variance 1 (in standardized solution, all observed and latent variables are scaled to var=1).

We will also see Covariances among the factors (since by default factors are free to correlate). For example, there may be entries like:

Covariances:
             Estimate  Std.Err  z-value  P(>|z|)   Std.all
  visual ~~                                            
    textual    0.408    0.073    5.59   <.001      0.708
  visual ~~                                            
    speed      0.262    0.071    3.69   <.001      0.567
  textual ~~                                           
    speed      0.243    0.054    4.50   <.001      0.655

This indicates, for instance, the correlation between the latent visual and textual factors is around 0.708 (Std.all). All three factors are positively correlated, which makes sense (students who do well on visual tests also tend to do well on verbal/textual tests, etc.). Significant factor correlations suggest overlapping constructs, whereas non-significance (or small magnitude) might suggest factors are distinct. Researchers sometimes impose that certain factors be uncorrelated (orthogonal) if theory dictates; this can be tested by comparing model fit with and without those covariance constraints.

Model Fit for CFA: Unlike the simple path analysis earlier, CFA models are often over-identified and thus yield meaningful goodness-of-fit statistics. In our 3-factor CFA with 9 indicators, the number of observed variances/covariances is \(9(10)/2 = 45\). The model estimates: each factor has 3 loadings (but one fixed), so 2 free loadings per factor = 6, plus 3 latent variances, 3 latent covariances, and 9 residual variances (one per indicator) = total free parameters = 6 + 3 + 3 + 9 = 21 free parameters. Degrees of freedom = 45 – 21 = 24. The chi-square test of model fit (likelihood ratio test) will appear in the lavaan output. Suppose it is \(\chi^2(24) = 36.5, p = 0.05\) – that indicates a borderline fit (p ~ .05 means the model’s implied covariance matrix differs from the sample covariance with marginal significance). We would then look at approximate fit indices: lavaan, if fit.measures=TRUE, reports indices like CFI, TLI, RMSEA, SRMR. For a good fitting model, we typically expect CFI and TLI near or above 0.95, RMSEA around or below 0.06, SRMR below 0.08 (Hu & Bentler, 1999). If our hypothetical output gives CFI = 0.97, TLI = 0.95, RMSEA = 0.045, SRMR = 0.040, we conclude the model fit is good. (We will discuss these indices more in §14.5.) If fit was poor, we would consider if the model misses some cross-loadings or residual correlations – modification indices can inform this, but modifications should be theory-driven to maintain confirmatory approach integrity.

Implications of CFA Results: In a measurement model, significant factor loadings support the validity of the proposed latent construct: each indicator is indeed measuring the underlying factor to some extent. High loadings (close to 1 standardized) indicate that most variance in that item is explained by the factor (little error), whereas moderate loadings indicate substantial unique variance or possibly that the item might have some unrelated content. Researchers use CFA to assess construct validity – does the data support that these items measure the intended separate constructs? If some loadings are very low or not significant, or if modification indices suggest an item would load better on a different factor, the researcher might reconsider the measurement model (perhaps the item is not pure or there is a secondary factor, etc.). Additionally, factor correlations inform discriminant validity (if factors are extremely highly correlated, one might question if they are truly distinct constructs or just one factor).

Identifying the Measurement Model: A CFA model must be identified. The general rule for factor models is that each latent factor’s scale must be set (either fix one loading = 1 or fix factor variance = 1). Also, a single-factor model needs at least 3 indicators to be just-identified or better; with 2 indicators, the model is technically identified if one loading is fixed and residuals assumed uncorrelated, but it provides only 1 df (and such a model is weak – essentially just a correlation). With 1 indicator per factor, you cannot identify a latent factor’s variance and the residual variance separately without strong assumptions (effectively it would require fixing the reliability). In our example, each factor had 3 indicators, which is sufficient (actually just-identified submodel per factor, but with multiple factors and covariances, overall we got df=24).

Mediation of Latents? (We note as an aside, one can have a purely latent mediation model too – e.g., X->M->Y where X, M, Y are latent constructs each measured by multiple indicators. That would combine CFA and path analysis; such is a full SEM discussed next.)

Using R for CFA – additional options: The cfa() function in lavaan is a wrapper to sem() that defaults certain options appropriate for pure measurement models. We can request modification indices by modindices(fit_cfa) to see where model misfit might be improved (e.g., an item cross-loading). However, adding parameters based on modification indices must be theoretically justified; otherwise one risks capitalizing on chance (MacCallum, 1986). We emphasize transparency: if modifications are done, they should be reported, ideally tested on a fresh sample if possible to validate. Another useful function is reliability() from semTools package, which can compute composite reliability (like Cronbach’s alpha or Omega) for the factors based on CFA results.

CFA as a stepping stone: Often, one conducts a CFA to establish a satisfactory measurement model before proceeding to a structural model that relates those latent variables. Good practice in SEM is the two-step approach (Anderson & Gerbing, 1988): first, verify the measurement model (CFA) has acceptable fit and sensible loadings; second, then test the structural relationships among the latent constructs (while usually retaining the measurement model specifications). We will follow this approach conceptually in the next section when examining a full SEM example.

Full Structural Equation Models (SEM with Latent Variables)

A full SEM merges one or more measurement models (CFA components) with a structural path model among the latent variables (and possibly some observed covariates). This is the most powerful form of SEM, as it allows one to test hypotheses about relationships between constructs while accounting for measurement unreliability.

Illustrative Example – Political Democracy model: We will use a classic example from Bollen (1989) concerning the effect of industrialization on political democracy, measured at two time points. This dataset, PoliticalDemocracy, is included in the lavaan package. The model (depicted by Bollen and often reproduced in SEM textbooks) posits three latent variables:

  • ind60: Industrialization in 1960, measured by three observed indicators (x1, x2, x3 which might be things like GNP per capita, percentage labor force in industry, etc.).
  • dem60: Democracy in 1960, measured by four indicators (y1–y4, perhaps ratings of freedoms, etc.).
  • dem65: Democracy in 1965, measured by four indicators (y5–y8, similar measures at a later time).

The structural part of the model says:

  • dem60 is regressed on ind60 (industrialization → democracy within 1960).
  • dem65 is regressed on both ind60 and dem60 (industrialization in 1960 and prior democracy both contribute to later democracy in 1965).

Additionally, based on theory, a few error covariances between certain democracy indicators at different time points are included (because the same survey items measured at different times may have correlated measurement errors). For example, y1 (democracy indicator in 1960) is allowed to covary with y5 (the parallel indicator in 1965); similarly y2’s error with y4 and y6, etc., as specified by Bollen’s model.

We specify this full SEM in lavaan syntax as shown (this corresponds to the model given in the lavaan documentation example):

model_poldem <- '
  # Measurement model
  ind60 =~ x1 + x2 + x3
  dem60 =~ y1 + y2 + y3 + y4
  dem65 =~ y5 + y6 + y7 + y8

  # Regressions (structural model)
  dem60 ~ ind60       # dem60 regressed on ind60
  dem65 ~ ind60 + dem60   # dem65 regressed on both ind60 and dem60

  # Residual covariances (correlated measurement errors)
  y1 ~~ y5
  y2 ~~ y4 + y6
  y3 ~~ y7
  y4 ~~ y8
  y6 ~~ y8
'
fit_poldem <- sem(model_poldem, data = PoliticalDemocracy)
summary(fit_poldem, standardized = TRUE, fit.measures = TRUE)

Let’s interpret the specification:

  • Under “Measurement model”, we define three latent factors with their respective indicators.

  • Under “Regressions”, we define the structural relations among latents (dem60 ~ ind60 and dem65 ~ ind60 + dem60). These correspond to two regression equations:

    • dem60 = β * ind60 + ζ (one predictor),
    • dem65 = γ1 * ind60 + γ2 * dem60 + ζ’ (two predictors).
  • Under “Residual covariances”, we allow certain pairs of y errors to covary. For example, y1 ~~ y5 means the error terms for y1 and y5 (which measure similar aspects of democracy at different times) are allowed to correlate. This adds parameters to the model to improve fit if those indicators share time-invariant characteristics not captured by the latent.

Interpreting Output of Full SEM: The lavaan output will be divided into sections. The measurement part will look similar to a CFA output for each factor (with loadings, etc.), and the structural part will show the regressions among latents. For instance, one might see:

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  ind60 =~                                                           
    x1                1.000                               0.67    0.920
    x2                2.180    0.139   15.742   <.001     1.46    0.973
    x3                1.819    0.152   11.967   <.001     1.22    0.872
  dem60 =~
    y1                1.000                               2.223   0.850
    y2                1.257    0.182    6.889   <.001     2.794   0.717
    y3                1.058    0.151    6.987   <.001     2.351   0.722
    y4                1.265    0.145    8.722   <.001     2.812   0.846
  dem65 =~
    y5                1.000                               2.103   0.808
    y6                1.186    0.169    7.024   <.001     2.493   0.746
    y7                1.280    0.160    8.002   <.001     2.691   0.824
    y8                1.266    0.158    8.007   <.001     2.662   0.828

(This is in line with results reported in Rosseel, 2012). These loadings show, for example, that for ind60 factor, x2 has a loading of 2.180 (unstandardized), indicating x2 is on a different scale; standardized, all three x’s have high loadings (0.87–0.97), meaning they are very strong indicators of industrialization (perhaps x2 was an indicator with smaller variance and thus got a larger raw loading). For dem60 and dem65, the loadings are also all significant and reasonably high (Std.all mostly 0.72–0.85, except y2 at 0.717 and maybe one slightly lower), indicating the democracy indicators are adequate measures of the latent democracy variable at each time.

Next in output, we see the Regressions:

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  dem60 ~                                                               
    ind60            1.483    0.399    3.715   <.001     0.447    0.447
  dem65 ~
    ind60            0.576    0.219    2.629   0.009     0.171    0.173
    dem60            0.926    0.251    3.690   <.001     0.506    0.519

Interpreting these: The coefficient for dem60 ~ ind60 is 1.483 (p < .001). Standardized, it’s 0.447, meaning in 1960, countries one standard deviation higher on industrialization are about 0.45 SD higher on democracy (the positive value suggests higher industrialization is associated with more democracy). This is a sizable effect. For dem65, the effect of ind60 is 0.576 (p < .01), standardized 0.173. This indicates that even five years later, initial industrialization has a positive association with democracy in 1965, though smaller (perhaps because much of the effect is indirect via raising 1960 democracy which then carries over). The effect of dem60 on dem65 is 0.926 (p < .001), standardized ~0.519, indicating considerable stability or autocorrelation in democracy: countries more democratic in 1960 tend to be more democratic in 1965. In fact, we could compute the indirect effect of ind60 on dem65 via dem60 as 1.483 * 0.926. Lavaan did not automatically give that (we could add a line like indirect := ind60.dem60 * dem65.dem60 if we labeled them, or simply compute manually). That indirect effect would be 1.4830.926 ≈ 1.374 (on unstd scale). The total effect of ind60 on later democracy would then be ~0.576 + 1.374 = 1.950 unstd (standardized total ~0.173+ (0.4470.519) ≈ 0.406). So about 43% of the total influence of 1960 industrialization on 1965 democracy is mediated through 1960 democracy (these numbers for illustration; one could formally include them as defined parameters if desired).

Then, the output will show Covariances for latent exogenous factors and any specified error covariances:

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  ind60 ~~                                                               
    dem60            0.000    (fixed)   --       --       0.000    0.000
  ind60 ~~                                                               
    dem65            0.000    (fixed)   --       --       0.000    0.000
  dem60 ~~                                                               
    dem65            0.000    (fixed)   --       --       0.000    0.000
  y1 ~~ y5           0.193    0.080    2.414    0.016     0.193    0.193
  y2 ~~ y4           0.249    0.084    2.955    0.003     0.249    0.249
  y2 ~~ y6           0.285    0.088    3.233    0.001     0.285    0.285
  y3 ~~ y7           0.262    0.077    3.397   <.001      0.262    0.262
  y4 ~~ y8           0.324    0.090    3.597   <.001      0.324    0.324
  y6 ~~ y8           0.215    0.081    2.653   0.008      0.215    0.215

Here, ind60, dem60, dem65 covariances show as fixed zero – because in the structural model we modeled those relations as regressions (and lavaan by default sets exogenous covariances among those not connected by ~ if any; but since ind60 is the only exogenous latent, and dem60 and dem65 are endogenous with their predictors specified, there are no free covariances among them). The rest are the error covariances we added: all are significant, meaning indeed y1&y5, y2&y4, etc., have residual correlations not explained by the factors. Including them improved model fit. These correlations might represent item-specific stability or method effects.

Finally, fit indices would be reported. In the lavaan tutorial output, the chi-square was 38.1 with df=35, p=0.329. That indicates a very good fit (p > .3, we fail to reject that model = data, which is desirable). CFI and TLI would likely be extremely high (maybe ~0.99), RMSEA very low (~0.02) given such a p-value, and SRMR likely low. Indeed, Bollen’s model was known to fit the data well, which strengthens confidence in the specified causal structure (though of course, equivalent causal models could exist, but at least this one is consistent with the data).

Interpretation: We would summarize that this SEM suggests industrialization has a positive effect on democracy both immediately and in the longer term, and that democracy exhibits substantial stability over time (the 1960 level strongly predicts 1965 level). The measurement part tells us the latent constructs were measured reliably by their indicators (all loadings high and significant). Because we built a full SEM, the parameter estimates have been adjusted for measurement error: for example, the relationship between ind60 and dem60 (Std.all 0.447) is likely lower than the raw correlation between some industrialization proxy and a democracy index might be, because here we are relating latent constructs (free of measurement error). This is a key advantage of SEM – it provides disattenuated estimates of relationships between constructs. If one naively correlated an observed X1 indicator with Y1 indicator, that correlation would mix true association and measurement error. SEM, by modeling error, aims to estimate the true correlation/effect at the construct level.

Model Modification and Re-specification: Suppose the model fit had not been good. One might examine modification indices (MIs) to see if any omitted paths or cross-loadings are suggested. For example, an MI might suggest that y2 has a secondary loading on dem65, or that an additional covariance between y3 and y7 errors should be added (which we did add). Any modifications should be substantively justified (perhaps y2 and y6 shared wording leading to correlated error, which indeed was posited). In confirmatory analysis, it is generally better to have theory drive modifications rather than purely data-driven fishing, to avoid overfitting (MacCallum, 1986). If major modifications are needed, it may imply the initial theory was incomplete.

Advantages of Full SEM: By estimating the measurement and structural parts together, we get correct standard errors and can do tests like the indirect effect significance within the model. Also, we can explicitly test complex hypotheses (e.g., whether the effect of ind60 on dem65 is fully mediated by dem60 – which we could test by seeing if direct path ind60→dem65 is significant or needed). We could constrain parameters and do chi-square difference tests. For instance, a question might be: did the measurement model parameters stay the same over time? We could test if the factor loadings for dem60 vs dem65 are equal (metric invariance) by a constrained model and a chi-square difference test. Such multi-group or longitudinal invariance tests are a strength of SEM, though beyond this chapter’s scope to detail, they are straightforward extensions where one adds equality constraints and compares model fit.

Visualizing the SEM: It is often helpful to present a diagram of the model. The semPlot package (Epskamp, 2015) can plot the fitted lavaan model. For example:

library(semPlot)
semPaths(fit_poldem, what="std", intercepts=FALSE, residuals=FALSE,
         layout="tree", nCharNodes=0, sizeMan=5, sizeLat=7)

This will produce a path diagram of the Political Democracy model with standardized estimates on the paths. Latent variables will be ovals, indicators as rectangles, single-headed arrows for regressions with their beta values, double-headed for covariances (the residual covariances among y’s will show as curved double-headed arrows connecting those indicators). An example path diagram from semPlot is shown in Figure 14.1.

Figure 14.1: Path diagram for the Political Democracy SEM example. Latent variables (ovals) include industrialization in 1960 (ind60), democracy in 1960 (dem60), and democracy in 1965 (dem65). Observed indicators (x1x3 for ind60; y1y4 for dem60; y5y8 for dem65) are shown as rectangles. Single-headed arrows indicate hypothesized regressions: ind60 → dem60, and ind60 → dem65 & dem60 → dem65 (the latter two forming a mediation chain). These paths correspond to the structural coefficients in the model (e.g., the standardized effect of ind60 on dem60 was ~0.45, and on dem65 via dem60 about 0.450.52). Double-headed curved arrows represent covariances: here, covariances among latent exogenous variables (none in this model since ind60 is the only exogenous latent) or between error terms of indicators (e.g., y1 ~~ y5, y2 ~~ y4, etc., as indicated by the curved arrows connecting specific indicators across time). The numeric values (not shown in this caption) on paths would indicate standardized loadings or regression weights, and on curved arrows would indicate correlation values. Overall, the diagram provides a visual summary of the SEM, showing how the measurement model and structural model components combine.*

In presenting SEM results, one often provides such a diagram annotated with key parameter estimates (usually standardized) for clarity. Additionally, one would report fit indices (for our example: χ²(35)=38.1, p=0.33, CFI≈0.99, RMSEA≈0.03, etc.) to demonstrate the model fits well, and perhaps compare it to alternative models if relevant.

By now, we have covered specifying and interpreting each piece of an SEM. The examples illustrated:

  • Path analysis with observed variables (mediation),
  • Confirmatory factor analysis (measurement model),
  • Full SEM with latent variables and structural relations (Political Democracy example).

In the process, we touched on important concepts like indirect effects, measurement error, and model fit. Next, we delve more formally into some of those concepts: model identification (how do we know a model’s parameters can be uniquely estimated?), estimation methods (how the parameters are computed from data), and evaluating model fit with various criteria.

13.3 Identification of SEM Models

Before an SEM can be estimated, it must be identified. Identification refers to whether there is a unique solution for the model parameters given the population covariance matrix (or given the sample data). In practical terms, an identified model has enough information (data points) to estimate its unknown parameters. If a model is not identified (under-identified), one cannot obtain trustworthy or unique estimates – the model is essentially too complex for the data.

Basics of Identification: A necessary (though not sufficient) condition for identification is that the number of free parameters to be estimated is no more than the number of distinct pieces of information in the data. For covariance structure modeling, the data information comes from the covariance matrix of observed variables. If there are \(p\) observed variables, there are \(p(p+1)/2\) distinct covariances/variances (the size of the covariance matrix’s upper triangle). Let \(q\) be the number of free model parameters. We must have \(q \le p(p+1)/2\) to even possibly identify. If \(q = p(p+1)/2\), the model is just-identified (df = 0): it can reproduce the covariance matrix exactly and yields a unique solution but with no degrees of freedom to test fit. If \(q < p(p+1)/2\), the model is over-identified (df > 0): there are more data points than parameters, so potentially a unique best-fitting solution exists and we can test goodness-of-fit. If \(q > p(p+1)/2\), the model is under-identified (negative df): there are infinitely many solutions that can reproduce the data equally well – such a model is not useful.

The above counting rule is necessary but not sufficient because it assumes all parameters relate linearly independently to data moments. There are more subtle identification issues; for example, a model might pass the count rule but still not be identified due to multicollinearity or certain nonlinear constraints.

Identification in CFA/SEM: Each latent variable’s scale must be set for identification. Typically:

  • Factor loading identification: For each latent factor, either fix one loading to 1 or fix the latent variance to 1. This prevents the “factor variance could be anything and loadings scale accordingly” indeterminacy. In our examples, lavaan fixed the first loading = 1 for each factor by default.
  • Latent variable count: A single latent with only two indicators is technically identified if one loading is fixed and no correlated errors, but it’s a fragile identification (df = 0 case usually). With three indicators, a single factor is over-identified (df > 0) and one can test model fit (for one factor, 3 indicators yields df = 0 actually, since 3 variances + 3 choose 2 = 6 data points, parameters: 2 loadings + 3 errors + 1 factor var = 6; okay exactly just-identified with 3; need 4 indicators for df > 0). So generally, >=3 indicators per factor is recommended for identification and model test.
  • Structural paths: If there are feedback loops (reciprocal causation), the model can become non-recursive and special conditions apply for identification (e.g., each reciprocal effect might need an instrumental variable – a variable that predicts one equation but not the other’s error – akin to econometric simultaneous equations). Recursive models (acyclic directed graphs) are usually identified given the measurement model identification.
  • Equality constraints or second-order factors: These can also affect identification. For example, a factor with only two indicators could still be over-identified if you fix the two loadings equal (then effectively one parameter less), but that’s an arbitrary constraint rarely justified.

A formal definition: “The parameters in Ω are identified if they can be expressed uniquely in terms of the elements of the covariance matrix Σ.” In other words, each parameter must correspond to a unique aspect of the data’s covariance structure. If two different parameter configurations produce the same implied covariance, those parameters are not identifiable individually.

Empirical Under-identification: Sometimes a model is mathematically identified in theory but data issues (like almost collinear indicators or zero variance) cause empirical non-identification (e.g., non-invertible information matrix). One might detect this if lavaan issues warnings like “matrix not positive definite” or if standard errors come out huge or NA. This typically means the model needs respecification or additional constraints.

Checking identification in practice: Lavaan provides a function lavInspect(fit, "rank") vs number of parameters to see if information matrix rank equals number of parameters (full rank suggests identified). If rank < parameters, model is under-identified (some linear dependency exists among equations). Additionally, if you run summary(fit) and see NA standard errors or not enough degrees of freedom, that’s a clue.

Just-Identified vs Over-Identified Models: It’s worth noting that just-identified models (df=0) will always “fit” perfectly (chi-square = 0), but that does not mean the model is true; it means with enough parameters one can always reproduce the data. Over-identified models allow a test of fit, which is valuable for validation. So, in SEM one often prefers over-identified models – degrees of freedom provide a test of how well the theoretical structure holds beyond trivial reproduction of data.

Example revisit: In our mediation example (§14.2.1), the model with direct and indirect path was just-identified (df=0). If we had not included the direct path X→Y, that model would be over-identified with df=1 (one less parameter) and we could test if excluding that path worsens fit significantly (this would be a chi-square difference test with 1 df; essentially testing if the direct effect is zero in population). In that case, one could do a nested model comparison to see if adding the direct path significantly improves model fit (which is equivalent to seeing if the direct path’s estimate is significant). Such logic applies broadly: adding or removing paths corresponds to freeing or fixing parameters, affecting identification and fit.

Under-identification examples: A notorious under-identified case is the “factor indeterminacy” if no scale is set. If we forgot to fix a factor loading or variance, the model would not run (lavaan would detect infinite equivalent solutions). Another example: if one has two latent variables with only two indicators each and allows those latents to correlate, that model is under-identified without further restrictions (essentially, you have 2 factors * 2 indicators = 4 indicators total, yields 10 data moments; parameters: each factor has 1 fixed loading but one free loading, each factor variance, each factor has 2 error variances, and one covariance between factors = 1+1+2+2+1=7 params, 10 data points, df=3 so maybe identified actually; however if factors have only 2 each, each factor’s measurement part is just-identified with 0 df on its own, but combined they have a few df for covariances – so that is identified actually. A true under-identification example might be: a single factor with 2 indicators and no constraints beyond one loading fixed – that yields p(p+1)/2=3 data points, parameters: 1 loading free + 1 factor var + 2 error var = 4 >3, under-identified. Indeed, 2 indicators is under-identified unless you impose something like equal loadings or zero cross-loading which is trivial in single factor.)

Identification of Structural Paths: If latent exogenous variables are perfectly collinear or a certain path is trying to estimate something that data cannot distinguish, identification issues arise. For example, if one tries to estimate two separate paths that always co-occur (like two predictors that sum to a total that’s constant), you might have trouble distinguishing their effects. In SEM with latent means (mean structure), identification of intercepts and means requires certain constraints (like fixing one intercept per factor to 0). Our chapter focuses on covariance structure (not means), but in a full SEM including means, one must also identify the means (e.g., fix factor mean to 0 in one group for reference, etc.).

In summary, identification is a crucial prerequisite. When building SEMs:

  • Always count degrees of freedom and ensure df ≥ 0.
  • Apply known rules (each latent needs a scale, need at least 3 indicators for a factor model to be safe, no “floaty” constructs without sufficient anchors).
  • Be wary of feedback loops and consult advanced sources for identification conditions in those cases (Bollen, 1989, ch. 7; Kaplan, 2000).
  • If a model is under-identified, it often helps to add constraints or data (e.g., fix small loadings to zero if theory permits, or add an extra indicator if possible).

13.4 Parameter Estimation in SEM

Once a model is specified and identified, the next step is to estimate the parameters. Parameter estimation in SEM is typically done by finding values of the free parameters that make the model-implied covariance matrix \(\Sigma(\Omega)\) as close as possible to the sample covariance matrix \(S\). Different estimation methods use different definitions of “close.” The most common estimation method (under multivariate normal assumption) is Maximum Likelihood (ML).

Maximum Likelihood Estimation: ML assumes the data (indicators) have a multivariate normal distribution. It derives a likelihood function for the observed sample given model parameters and finds parameter values that maximize this likelihood (or equivalently minimize the discrepancy between \(S\) and \(\Sigma(\Omega)\)). Under ML, the fitting function to minimize is essentially the log determinant and trace difference between \(S\) and \(\Sigma\) (the so-called ML discrepancy function \(F_{ML} = \log|\Sigma| + \text{tr}(S \Sigma^{-1}) - \log|S| - p\) for normal theory ML). The ML estimates have desirable large-sample properties: they are consistent (converge to true values as N→∞ if model is correct), efficient (minimum variance among unbiased estimators under normality), and asymptotically unbiased. ML also provides a natural test statistic: \(N \cdot F_{ML}\) approximately follows a \(\chi^2\) distribution (the model chi-square test) under the null hypothesis that the model holds in the population.

In lavaan, by default, estimator="ML" is used for continuous data. The output we saw was mostly ML estimates. For example, in the political democracy model, ML estimation yielded those coefficients after 68 iterations (the optimizer converged). The “Estimator” field in lavaan output confirms ML and often which optimizer was used (e.g., NLMINB or some other algorithm).

Other Estimation Methods:

  • Generalized Least Squares (GLS): An alternative that also assumes normality and uses a different fitting function, but under normal theory it gives same estimates as ML (though different efficiency if assumptions violated). Jöreskog & Goldberger (1972) developed GLS for SEM.
  • Weighted Least Squares (WLS): Used especially for non-normal or categorical data. If indicators are ordinal (Likert scales, etc.), one often uses Diagonally Weighted Least Squares (DWLS, also called WLSMV in Mplus terminology) which is robust to non-normality. Lavaan supports estimator="WLSMV" for ordinal data.
  • Robust ML: If data are continuous but not normally distributed, one can use “MLR” in lavaan (maximum likelihood with robust standard errors and scaled test statistic) or “MLM”. These adjust the chi-square and SEs using a sandwich correction (Satorra-Bentler correction).
  • Bayesian estimation: SEM can also be estimated via Bayesian MCMC methods (e.g., using blavaan package). This is useful if sample sizes are small or one wants to incorporate prior information. It yields a posterior distribution for parameters rather than just point estimates.
  • ULS (Unweighted Least Squares): Minimizes raw differences without weighting by precision; rarely used, but sometimes for ordinal data with small sample where WLSMV might struggle.
  • PLS (Partial Least Squares Path Modeling): Not covariance-based but variance-based approach – often considered outside the SEM (covariance structure analysis) family, used when focus is on prediction rather than confirmatory testing (we won’t delve into PLS as it’s somewhat separate, but note it exists).

In this chapter, we focus on ML and related robust methods, as they are most common in confirmatory SEM.

Iteration and Convergence: SEM parameters are usually solved via iterative numerical methods (except some special cases like saturated or independence model which have closed form). The estimation starts with initial guesses (e.g., often based on ADF or instrumental variables methods for start) and iteratively improves the fit. Methods like Newton-Raphson, Fisher scoring, or quasi-Newton (BFGS) are used. Lavaan’s output “ended normally after X iterations” indicates convergence. If a model fails to converge (iteration limit reached without improvement), it may be due to poor start values, a problematic model (maybe near-unidentified or likelihood surface flat), or data issues. Remedies include simplifying the model or providing better start values (lavaan allows start= or std.lv=TRUE which sometimes aids by setting factor var =1 initially). One should not blindly trust estimates if convergence warnings appear.

Standard Errors and Test Statistics: Under ML, standard errors of estimates are obtained from the inverse of the Hessian (second derivative) of the log-likelihood – basically the Fisher information matrix. If data are not normal, those SEs can be off, hence robust methods are available. Lavaan output by default gives “standard errors: Standard” and “Information: Expected” (or “Observed” depending on setting). “Expected” info means it used the model-implied information matrix for SEs. Robust methods (MLM, MLR) might give Satorra-Bentler scaled chi-square and adjusted SEs.

Goodness-of-Fit Testing: As mentioned, ML provides a chi-square test of exact fit: \(T_{\chi^2} = (N-1)F_{ML}\) (for sample size N) is asymptotically \(\chi^2_{df}\). This is reported in lavaan as “Test Statistic” with df and p-value. In our examples: mediation model had df=0 so no test, CFA had e.g. \(\chi^2(24)\), and the political demo model had \(\chi^2(35)=38.1, p=0.329\) which shows no significant misfit. This test, however, is known to be sensitive to sample size and the null hypothesis of perfect fit is often too strict in practice (Browne & Cudeck, 1993). That’s why approximate fit indices (CFI, RMSEA, etc.) are used, which we cover in next section.

Estimation under Missing Data: Many SEM programs (including lavaan with missing="ML") can directly handle missing data via Full Information Maximum Likelihood (FIML). It uses all available data under an assumption of missing at random (MAR) to estimate model without listwise deletion. This is beyond the basic but important to note: one can fit SEM with missing values without imputation in lavaan by specifying FIML.

Computational Difficulty: Estimating SEM can be demanding if models are large or complex (many latents, etc.). Convergence issues might arise if the model is nearly unidentified or starting values far off. One common issue is Heywood cases – negative variance estimates (especially a residual variance that goes negative). A negative variance is not theoretically possible (except it indicates model forcing a parameter out of bounds to best fit data). This could result from small sample or an indicator that correlates so strongly with factor that it tries to give factor >100% variance of it. In such cases, one might set a variance to a small positive value or constrain it (some use Bayesian priors to avoid it). Respecifying the model (maybe drop that indicator or allow an extra parameter like a secondary loading) can sometimes fix Heywood cases.

In summary, estimation in SEM typically aims to minimize the discrepancy between model and data. ML is predominant due to its statistical properties, but one must check assumptions (especially normality). If assumptions are violated, one should use robust or alternative estimators – for instance, with ordinal data, using WLSMV is crucial (ML treating ordinal as continuous can lead to biased estimates of loadings and underestimated chi-square). Lavaan conveniently picks WLSMV if variables are declared ordered factor.

For teaching purposes, it’s good to show that ML estimation of a regression in SEM yields same coefficients as OLS regression if assumptions hold – SEM is essentially doing system-wise least squares under the hood. However, SEM’s ML can simultaneously estimate multiple equations optimally rather than equation-by-equation.

Finally, note that certain parameters might be on boundaries (like variance = 0). In such cases, standard errors might be non-symmetric. Advanced users can request profile likelihood CIs or use bootstrapping for more accurate intervals, especially for things like indirect effects where normal approximation might be poor.

13.5 Assessing Model Fit

Evaluating how well the model reproduces the observed data is a critical step in SEM. Unlike in standard regression (where \(R^2\) or residual plots indicate fit), SEM provides both an absolute fit test (the chi-square test of model vs saturated model) and various fit indices to judge approximate fit.

Chi-Square Test of Exact Fit: We have already introduced the \(\chi^2\) test. It tests \(H_0\): “The model-implied covariance matrix \(\Sigma(\Omega)\) equals the population covariance matrix” (i.e., the model perfectly holds in population). A non-significant chi-square (p > .05) indicates we do not reject that hypothesis – meaning the model could be true, and the data are consistent with the model. A significant chi-square (p < .05) indicates the model is an unlikely explanation for the covariances (significant misfit). In practice, with large N, even tiny discrepancies yield significant chi-square. Conversely, with small N, even big misfits might not be detected. Thus, reliance solely on chi-square is discouraged (Bentler & Bonett, 1980; Marsh et al., 1988). Instead, researchers consider approximate fit indices that are less sensitive to N and can quantify degree of misfit.

Common Fit Indices: The most widely reported indices include:

  • Comparative Fit Index (CFI) – compares the model’s chi-square to that of a baseline (usually independence model where variables are uncorrelated). It ranges 0–1, with higher meaning better fit. By convention, CFI ≥ 0.95 is considered indicative of good fit (Hu & Bentler, 1999).
  • Tucker-Lewis Index (TLI) (also called Non-Normed Fit Index, NNFI) – similar to CFI but penalizes model complexity more heavily. TLI ≥ 0.95 is also a common benchmark.
  • Root Mean Square Error of Approximation (RMSEA) – a parsimony-adjusted index measuring discrepancy per degree of freedom, with a penalty for model complexity. An RMSEA < 0.05 indicates close fit, ~0.05-0.08 acceptable fit, and >0.10 poor fit (Browne & Cudeck, 1993). Hu & Bentler (1999) suggested < 0.06 as a stringent cut-off for good fit.
  • Standardized Root Mean Square Residual (SRMR) – the standardized difference between observed and model correlation matrices (on average). SRMR < 0.08 is a common rule of thumb for good fit (Hu & Bentler, 1999).
  • Chi-square/degrees of freedom ratio – sometimes reported as e.g. “\(\chi^2/df\)”. Rules of thumb vary (some say < 2 or < 3 indicates acceptable). But this is basically a crude measure; better to use RMSEA which formalizes something similar.

Lavaan can report these via fitMeasures(fit) or in summary if fit.measures=TRUE. For our previous examples:

  • Mediation model (just-identified) had no fit indices (everything would be perfect by default).
  • CFA example likely had CFI ~ .97, TLI ~ .95, RMSEA ~ .05, SRMR ~ .04 (we posited).
  • Political Democracy model had excellent fit: e.g., CFI ~ 0.99, RMSEA ~ 0.02, SRMR ~ 0.05 (implied by non-significant chi-square and moderately large df).

Interpreting Fit Indices: Ideally, multiple indices are considered together. One looks for a consistent message. For instance, if CFI and TLI are high (> .95) and RMSEA is low (< .05), we conclude model fit is very good. If some indices conflict (e.g., CFI > .95 but RMSEA = .10), one must diagnose why. It could happen with small sample size or if model has small df (RMSEA penalizes low df models harshly; indeed RMSEA can be misleading when df is small, as noted by Kenny, Kaniskan, & McCoach, 2015). In such cases, rely more on SRMR and CFI perhaps. Recent literature (e.g., Shi, Lee, & Terry, 2019) warns against rigid cut-offs and encourages looking at the overall pattern and theory.

Modification Indices: If fit is not good, one can inspect modification indices (MI) for suggestions of model improvements. A modification index for a fixed or omitted parameter (like a cross-loading or error covariance not in model) tells how much chi-square would drop if that parameter were freed. A high MI (relative to chi-square) suggests a potential improvement. But adding parameters risks overfitting and data mining – it’s essentially doing exploratory modeling. Best practice: if an MI makes theoretical sense (e.g., two items share wording -> correlated error), one could adjust the model and report it. Do not add modifications solely to boost fit without theoretical justification; this leads to models that fit idiosyncrasies of sample and may not replicate.

Residuals: Another way to examine misfit is to look at residual correlations (difference between observed and model-implied covariances). Lavaan’s resid(fit) can list largest residuals. If, say, the correlation between x2 and y3 is much higher in data than model implies (and model didn’t allow them to be related), that residual indicates a path or common cause missing. Large residuals (like > .1 in correlations) suggest model strain.

Model Fit Example Interpretation: In presenting results, one might write: “The CFA model fit the data well, \(\chi^2(24)=36.5, p=0.05\). Other fit indices indicated close fit (CFI=0.97, TLI=0.95, RMSEA=0.048 (90% CI [0.01, 0.08]), SRMR=0.040).” Providing a 90% CI for RMSEA is common; it helps understand precision of RMSEA. If the upper bound of RMSEA CI is below 0.08, that’s a good sign. If it crosses 0.1, some would question fit.

Comparative Fit (Alternative Models): Sometimes researchers compare multiple models. For example, a one-factor model vs a three-factor model for the HS data – one can do a chi-square difference test or compare CFI. A significant chi-square drop or increase in CFI indicates the more complex model fits better. Chi-square difference testing is done by anova(fit1, fit2) in lavaan (requires models be nested). If non-nested, one might use AIC or BIC for model comparison (smaller is better). Our chapter focus is one model at a time, but in practice, model comparisons are used to test hypotheses (e.g., measurement invariance steps, testing if a path can be set to zero).

When Models Don’t Fit: If you have poor fit (CFI < 0.90, RMSEA > 0.10 or such), it suggests the model is missing something important. Solutions include:

  • Reevaluate your theory – maybe a latent factor was omitted, maybe items measure multiple factors.
  • Check data – any outliers or assumption violations? Perhaps use robust estimation if outliers or multi-normality fails.
  • Increase model complexity carefully (add covariance or cross-loading) if justified.
  • Consider that the hypothesized model might simply be wrong for this data; one might then test a different model.

Overfitting Concern: It’s possible to add enough parameters to get too good fit (especially with many indicators relative to sample size). One should be cautious of just aiming for fit indices at thresholds – the goal is a substantively interpretable and parsimonious model that fits well. There is an ongoing debate: Some methodologists emphasize not to blindly chase .95 CFI or .05 RMSEA as “magic numbers” (Marsh, Hau, & Wen, 2004; Hayduk et al., 2007). They argue to also consider if the model makes sense, whether modifications are theoretically sound, and to report all this transparently.

Summary on Fit: We recommend reporting:

  • The chi-square and df (and p-value, though one might note it’s sensitive to N).
  • CFI and TLI.
  • RMSEA with CI and SRMR.
  • Possibly also the number of parameters and N, because fit can be interpreted in context (e.g., a complex model with borderline fit might be acceptable if N is huge because trivial misfit can trigger chi-square).
  • If any modifications were made, report them (e.g., “We allowed error of item 5 and 6 to correlate as modification indices suggested and because those items were similarly worded; this improved fit from CFI .90 to .95.”).

According to the latest recommendations (e.g., APA 7th style reporting standards for factor analysis/SEM), one should justify that the model achieved adequate fit before interpreting parameters. In our examples, we proceeded to interpretation because we ensured the models fit well.

13.6 Interpretation of Estimated Parameters and Latent Constructs

After achieving a well-fitting model, researchers interpret the parameter estimates to make substantive conclusions. Interpretation in SEM occurs at two levels: measurement level (the meaning of latent constructs and quality of measures) and structural level (the relationships among constructs).

Interpreting Measurement Model (Loadings and Latent Variables): Factor loadings indicate how strongly each observed variable is related to the latent factor. A high standardized loading (say 0.8 or 0.9) suggests the indicator is a good measure of the factor (it shares ~64% or 81% variance with factor, respectively). Low loadings (say 0.4) suggest the indicator has more unique variance than common variance – perhaps it’s not a pure measure or is noisy. One might consider dropping very low-loading items in scale development contexts, though in confirmatory settings usually the items are predetermined.

The sign of a loading matters too: If an item is reverse-coded relative to factor definition, it might come out with a negative loading unless you modeled it appropriately. Ideally, all items should be oriented such that higher latent means higher observed scores (or vice versa consistently).

Latent variable variances are often less directly of interest (since one can arbitrarily scale latent, variance is scale-dependent). If one fixed first loading =1, the latent variance is in raw units of the first indicator. Standardized latent variance is always 1 by definition of standardized solution (if we standardize both latent and observed). More interesting are latent correlations (covariances): these tell how much constructs overlap. If two latent factors correlate at 0.9, they are almost indistinguishable empirically – one might consider if they are indeed separate constructs or a single factor would suffice. If they correlate moderately (0.5), they share some common causes but also have distinct elements. A correlation near zero would indicate discriminant validity (the factors capture different things). In our HS example, factor correlations were around 0.7, showing substantial but not complete overlap among ability domains.

Composite Reliability and AVE: As part of measurement interpretation, one might compute reliability indices like coefficient omega or composite reliability for each factor. For example, for a factor with standardized loadings \(\ell_i\) and indicator error variances \(d_i\), composite reliability CR = \((\sum \ell_i)^2 / [(\sum \ell_i)^2 + \sum d_i]\). This can be obtained with semTools. Also Average Variance Extracted (AVE) = mean of \(\ell_i^2\) usually, indicating the average proportion of variance in indicators explained by the factor. A rule is AVE > 0.50 indicates the factor explains at least half the variance of its indicators on average, a sign of convergent validity.

Interpreting Structural Paths: A structural regression coefficient (between two latent constructs, or a latent and observed covariate) represents the predicted change in the dependent variable (in SD units if standardized) for a 1 (SD) change in the predictor, holding other predictors constant. It’s analogous to regression slope interpretation. For instance, in the democracy example, the standardized path from ind60 to dem60 was ~0.45. So one could say “Industrialization in 1960 has a moderately strong positive effect on democracy in 1960 (β = .45, p < .001): countries one standard deviation higher in industrialization score about 0.45 standard deviations higher in democracy.” The path from dem60 to dem65 was ~0.52, which we interpret as “prior level of democracy strongly predicts subsequent democracy (β = .52).”

Direct, Indirect, Total Effects: When mediation or more complex indirect pathways exist, one should interpret not just direct effects but also indirect effects. In our mediation example, we found X had no significant direct effect on Y controlling M, but a significant indirect effect through M. Interpretation: “X influences Y primarily through its effect on M. The indirect effect (ab) was 0.37 (p < .001), whereas the direct effect (c’) was near zero (0.04, n.s.), indicating full mediation.”* For clarity, giving standardized values or even percentage mediated can be useful (e.g., 100% of total effect was via M in that case, since direct was ~0). In the democracy example, one might note “The effect of industrialization on later democracy is largely mediated by initial democracy levels. The indirect effect (ind60 → dem60 → dem65) is significant, about 2–3 times the direct effect ind60 → dem65.” If defined in lavaan, one could quote those numbers.

Latent Means (if any): Our examples didn’t include latent means or mean structure (we only considered covariances). If one had multiple group SEM or growth models, interpreting latent means would be another aspect (like mean differences between groups or intercept terms). In CFA with a reference group, latent means can be compared. This goes into measurement invariance testing territory. We won’t expand on it here, but note that if one extends SEM to include means (by modeling intercepts and factor means), interpretation of an intercept is the expected value of indicator when factor = 0, and latent mean differences indicate group differences on constructs. It’s analogous to ANOVA/regression intercepts but on latent level.

Standardized vs Unstandardized: Reporting standardized estimates is generally recommended for ease of interpretation, especially if variables are on different scales or not inherently meaningful units. However, unstandardized estimates are important if one wants to compare across samples or when doing formal hypothesis tests of differences (since standardization changes variance). In output, we usually look at Std.all (which standardizes both exogenous and endogenous). There’s also Std.lv (standardizing latent variables only, not observed, sometimes reported when one is interested in factor loadings if items in original scale). APA recommends providing either standardized coefficients or some effect size measure. Kline (2016) suggests reporting both for key paths if space permits. Often, tables will have unstd (SE) and beta.

Interpreting Error Variances: Each observed variable has an error variance (for indicators, it’s often denoted θ(ε) for items). If standardized, error variance = 1 - (loading)^2 for a single-factor case. So it tells the proportion of variance unexplained by the factor. If an error variance is very high, the item had low reliability in context of model. Also, if two indicators share a residual covariance (like y1 ~~ y5 0.19 in our example), that covariance can be interpreted: e.g., the correlation between the errors of y1 and y5 is 0.19, meaning after accounting for latent democracy, those two measures still share some 19% correlation due to some common factor (could be method or wording). We included those precisely because ignoring them hurt fit, implying that indeed some residual correlation exists (and we acknowledge it in interpretation as something like “item-specific stability across time” possibly).

Significance and CIs: We should interpret significance of parameters (the z or p values). Typically, one focuses on whether structural paths differ from zero significantly, as that addresses hypotheses (e.g., “does X affect Y controlling for M?”). We reported those in examples. If a path is not significant, one might drop it in a trimmed model (though if theory says it should be there, one might leave it but report it’s not significant). When multiple comparisons, caution as always, but in SEM usually each parameter tested was specified a priori, so one doesn’t always adjust for multiplicity (except in exploratory modification phases).

Confidence intervals for estimates can also be given (lavaan can provide them via ci=TRUE). For important effects (like an indirect effect), reporting a 95% CI is good practice, especially if using bootstrap (e.g., “the 95% bootstrap CI for the indirect effect was [0.18, 0.55]”).

Practical vs Statistical Significance: With large N, small effects can be significant. So consider effect sizes (β values). A β of 0.1 may be significant with N=1000, but it’s a weak relationship. Researchers should discuss whether an effect is meaningful substantively, not just significant. Conversely, a β of 0.3 might fail significance in small sample but is moderate in size – careful interpretation needed.

Multiple-group SEM: If analyzing multiple groups, interpretation includes seeing if parameters differ by group. For example, if we fit the democracy model to two regions, we might compare whether the effect of ind60 on dem65 is stronger in one region vs another. That would involve invariance testing and group-specific estimates. The interpretation would then note differences: “In group A, the path was β=0.6** (p<.01), while in group B it was β=0.2 (n.s.), a significant moderation by group as Δχ² indicated.” – showing possibly the effect is culture-dependent, etc.

Transparency in Interpretation: It is essential to tie parameter interpretations back to theory. For instance, rather than just stating numbers, say “This finding supports the theory that early industrial development fosters democratic institutions, as indicated by the significant positive path from Industrialization 1960 to Democracy 1960 and indirectly to Democracy 1965.” Similarly, for measurement “The confirmatory factor analysis shows that our survey items have high loadings on the intended latent constructs, supporting the construct validity of our measures of social support and stress.”

13.7 SEM and Causal Inference in the Social Sciences

One of the appeals of SEM is that it aligns with researchers’ causal theories: the path diagram is often a causal diagram. However, drawing causal conclusions from SEM requires careful consideration of assumptions. We reiterate some points and expand on how SEM contributes to causal inference:

  • Causal Interpretation Requires Assumptions: SEM coefficients can be interpreted as causal effects if we assume the model structure is correct and there are no unmodeled confounders (no omitted variables causing spurious associations) and errors are uncorrelated for good reasons, etc. These are strong assumptions. In an experiment, randomization can justify such assumptions (if X is randomized, an SEM of X→M→Y could potentially estimate mediation effects causally). In observational studies, one must argue that control for certain variables (or the model’s structure) is sufficient to satisfy conditional independence assumptions. Bollen (1989) and Bollen & Pearl (2013) discuss how SEM connects to the logic of causal graphs: each omitted arrow implies a causal independence assumption.

  • Model-Based Reasoning: SEM encourages researchers to explicitly lay out a causal model (even if they don’t call it that). By doing so, it forces clarity about what is influencing what and what is merely correlated. This is valuable in social science, where theories are often about latent constructs affecting each other over time or across levels. SEM provides a way to test if a hypothesized causal structure could be true given the data (if fit is good and estimates align with predictions). However, note the mantra “correlation is not causation” remains true: a well-fitting SEM does not prove the causal story – it only shows it’s consistent with one possible causal explanation. There could be alternative models that fit equally well (equivalent models problem; MacCallum et al., 1993). For example, a model where democracy leads to industrialization might also fit the data nearly as well; without temporal precedence or other theory, one should be cautious.

  • Handling Causal Direction and Feedback: Standard SEM handles recursive (unidirectional) relations well. If we believe feedback loops exist (X ↔︎ Y influence each other), one can specify a non-recursive model, but identification is tricky and one often needs external instruments. It’s usually better, if possible, to incorporate time: measure X at T1 and Y at T2, then a recursive path X1→Y2 is more defensible as causal direction (still assumption that no Y1 effect on X1 occurred or is fixed etc., but longitudinal data help ordering). Longitudinal cross-lagged panel models in SEM are commonly used to study reciprocal causation over time.

  • Latent Variables and Causality: By using latent constructs, SEM can remove measurement error bias in causal effect estimates. Measurement error in predictors typically attenuates regression coefficients. By modeling the predictor as latent (with multiple indicators), SEM yields a more accurate estimate of its effect on outcome (adjusted for unreliability). This is a major boon for causal inference – it’s easier to detect true effects if measurement error is accounted for. In our examples, if industrialization or democracy were measured with error, failing to account would have biased their relationship, but using latent factors corrects for that (assuming the CFA is valid). This is something regression without SEM can’t easily do (except using instrumental variables for reliability, which is akin to factor model).

  • Causal Discovery vs SEM: SEM is typically confirmatory – you posit a causal model and test it. It doesn’t discover causal structure from data by itself. However, researchers sometimes use modification indices or compare alternative models in an exploratory fashion to infer what causal links might exist (this is risky without strong theory). There are algorithms in machine learning (like SEM-GS) that search for SEM structure that fits well, but those are advanced and require assumptions too. Generally, the role of SEM in causal inference is: given a hypothesized causal graph (with latents), SEM can estimate the causal effects (path coefficients) and test model fit. It is then up to the researcher to argue the plausibility of the causal assumptions.

  • Integrating with Directed Acyclic Graphs (DAGs): One can translate an SEM path diagram into a DAG (if no latent variables or treating latents as nodes too with certain implications). Using Pearl’s do-calculus, one could in principle derive which coefficients represent causal effects. For example, in a mediation, the product a*b is the indirect causal effect if the DAG has no confounding between X and M or M and Y that isn’t accounted. SEM itself doesn’t do do-calculus; it just estimates correlations and partial out structure. But conceptually, a properly specified SEM corresponds to a structural causal model (SCM) in Pearl’s terms (Pearl, 2000). In an SCM, each structural equation is a causal mechanism. If one further assumes that the errors (ζ’s) are independent (no unmodeled common causes), then SEM coefficients equal causal effects. Bollen & Pearl (2013) emphasize that many “myths” exist, like “SEM proves causation” (myth, it doesn’t without assumptions) or the opposite “SEM is useless for causation” (also a myth; SEM is useful when combined with causal assumptions).

  • Causal Effects and Interventions: One limitation of SEM is that it typically assumes linear relations and no strong interactions unless modeled. If one wanted to simulate an intervention (say increase X by 1), the SEM suggests Y would change by β. If model assumptions hold, that prediction is valid. But if there’s unobserved confounding, that β wasn’t actually causal, so an intervention might not yield predicted change. This is why one cannot automatically take SEM estimates as policy guidance without careful thought. Some have combined SEM with propensity score methods or other causal adjustment to strengthen causal claims (e.g., SEM with a selection model etc.).

In practice, many social scientists use SEM to support causal narratives, especially when randomized experiments are infeasible. For example, an economist might use SEM to model how education (latent construct measured by years and quality) affects earnings, controlling for ability (latent measured by IQ etc.) – that’s essentially a structural equation approach to causal inference under certain assumptions of no omitted variables. The credibility comes from whether all major confounders are included and whether model tests hold.

To conclude on this point: SEM’s contribution to causal inference is providing a rigorous framework to encode causal theories (with latent variables) and test if the data are consistent with those theories. It helps in estimating causal effect sizes while accounting for measurement error. It does not automatically prove causation – but it is a powerful tool for evaluating causal hypotheses, especially in combination with longitudinal data, experimental manipulations (where parts of model are randomized), or strong theoretical rationale that certain paths should be zero. Transparency in assumptions is crucial; SEM forces you to state “we assume no direct effect of X on Y except through M” (by omitting that path) – a causal assumption that can then be scrutinized. In sum, SEM is as much a thinking tool as a statistical tool for causality, structuring the way we reason about complex cause-effect systems in social science.

13.8 Transparency and Reproducibility in SEM Analysis

Conducting SEM in an open and reproducible manner is increasingly emphasized. There are several aspects to this:

  • Reporting and Documentation: One should report the model specification clearly – ideally providing the lavaan model syntax or a path diagram so that readers know exactly which paths were estimated, which were fixed to 0, and which errors covaried. The APA style guidelines for SEM reporting (e.g., from Hoyle & Isherwood, 2013) suggest including a diagram or a detailed description of model equations. Additionally, all key results (estimates, SEs, fit indices) should be included either in text or a table. By providing this information, others could replicate the model.

  • R Markdown for Reproducibility: Using an R Markdown (or Quarto) document to integrate code and narrative, as we have done in this chapter, is an excellent practice. It ensures that the results presented (figures, values) are directly generated from the code and data. This avoids transcription errors and makes it easier for others to rerun analyses. For example, all code blocks in this chapter can be executed to reproduce the outcomes (assuming the same dataset). This end-to-end reproducibility is critical in science. We encourage sharing such R Markdown or script files as supplementary material in publications.

  • Version control and collaboration: Tracking changes to SEM syntax or data preprocessing using version control (Git) can help in auditing how results may change with different decisions (e.g., if you tried a model without a certain covariance and later added it). Keeping a record of these analyses decisions is part of being transparent.

  • Open Data and Open Materials: Where possible, providing the dataset (or a covariance matrix of the data) and analysis code allows others to verify results or apply the model to other samples. Some journals now mandate this. With SEM, sometimes data cannot be fully shared due to privacy, but at least providing the covariance matrix and means used would allow others to reproduce the SEM results (since SEM only needs summary stats typically). If not the raw data, a full correlation matrix with SDs and N is good.

  • Model Modifications Disclosure: If you modified your model after initial fit (data-driven adjustments), it is important to disclose that. E.g., “We added a covariance between Item 3 and 5 errors based on modification indices; this was a post-hoc modification to improve fit and should be interpreted with caution.” Noting such decisions helps others understand the degree of confirmatory vs exploratory analysis in your SEM.

  • Pre-registration: In some cases, researchers pre-register their SEM analysis plan (especially in fields like psychology where replication is a concern). This can involve specifying the model a priori, any planned comparisons or invariance tests, etc. Pre-registration adds credibility that the model wasn’t tuned excessively to the data.

  • Sensitivity Analyses: For robust results, one might report if conclusions hold under different reasonable modeling choices (e.g., did removing an outlier or using robust ML change things? Did an alternative measure of a construct lead to similar structural path estimate?). While not always done, it’s a good practice when possible, enhancing trust in the findings.

  • Software and Package Versions: Because SEM results can sometimes differ slightly with different software (due to optimizers or defaults for handling missing data, etc.), it’s wise to note the software and version (e.g., “Analyses were conducted in R 4.2.1 using lavaan 0.6-9 (Rosseel, 2012)”). This allows exact replication. The open-source nature of R and lavaan facilitates reproducibility much more than proprietary software where exact algorithms might be opaque.

  • Reproducible Examples: When learning, we often rely on examples in documentation. For instance, the lavaan package has several demo models (like the PoliticalDemocracy) which are documented so anyone can run the same model. This fosters a culture where results are not black boxes; anyone can run the example and see how it works, then apply to their data.

Transparency also extends to discussing limitations: if the model has equivalent alternatives, say so. If certain assumptions (like multivariate normality) were checked (or not), mention it. Perhaps you performed a normality test or saw some skew and thus chose robust SEs; report that choice and rationale.

In summary, using tools like R Markdown for literate programming, sharing data or at least model-implied structures, and fully describing your model and estimation process are all part of making SEM analyses transparent and reproducible. This not only increases trust in your results but also helps accumulate knowledge, as others can build on or critique your model with clarity about what was done.

13.9 Advantages and Limitations of SEM in Applied Research

To wrap up, we reflect on why one would use SEM and what to be cautious about when using it.

Advantages:

  • Latent Variable Modeling: SEM allows explicit modeling of latent constructs, which controls for measurement error. This yields more reliable estimates of relationships between constructs than using raw observed scores (assuming the measurement model is correct). It also enables testing of construct validity within the analysis.
  • Complex Models: SEM can handle multiple dependent variables and interrelated equations simultaneously. For example, one can model a whole theoretical network (like in our democracy example, one latent affecting another which affects another). Traditional regression would require multiple separate models and wouldn’t capture the covariance of outcomes or indirect effects naturally.
  • Testing Mediation and Indirect Effects: SEM is ideal for mediation analysis, especially with multiple mediators or multistep pathways. It provides formal tests for indirect effects and easily extends to multiple mediator chains or parallel mediators.
  • Model Fit Assessment: Unlike many other statistical techniques, SEM provides a holistic test of how well the model fits the data. This can prevent overfitting in a sense that if the model doesn’t fit, you know your theory might be missing something. It pushes researchers to consider alternative models or added paths if justified.
  • Flexibility: The SEM framework encompasses many statistical models: regression, ANOVA (via multiple group SEM), factor analysis, path analysis, growth curve models, etc., are all special cases of SEM. Once you learn SEM, you have a unifying framework for many analyses. For example, a multivariate regression is an SEM with all manifest variables and no latent, a confirmatory factor analysis is SEM with no regressions among factors, etc.
  • Handling Missing Data: Modern SEM software implements full information ML for missing data, allowing analysis without ad-hoc imputation or listwise deletion, often under MAR assumption.
  • Multiple Group Analysis: SEM makes it straightforward to test if a model holds equivalently across groups (e.g., genders, cultures) by imposing equality constraints and comparing fit. This is invaluable for testing measurement invariance and structural invariance – something classic regression struggles to do in one coherent analysis.
  • Longitudinal Modeling: Latent growth curve models (a form of SEM) enable sophisticated analysis of change over time, handling unequally spaced time points, individual differences in trajectories, etc., all within the SEM framework.
  • Integration of Data Types: SEM has extensions to handle categorical data (through WLSMV or robust methods), count data (via generalized SEM in other packages like Mplus or lavaan with MLM, though lavaan’s abilities with counts are limited, one might use Mplus or OpenMx), and non-normal data. Also, multi-level SEM (for nested data) is possible (though lavaan has a separate function for two-level SEM).
  • Theory Development: Perhaps the biggest advantage in social sciences: SEM forces researchers to be explicit about their theory (which variables affect which). This clarity can improve theory over time, as misfit points to theory needing refinement. It encourages thinking in terms of systems of relationships rather than bivariate links.

Limitations:

  • Requirement of Large Sample Sizes: SEM is a large-sample technique. The ML chi-square test assumes asymptotic distribution. Many parameters need stable estimation. A common heuristic is at least N = 200 for a simple model, more if the model is complex (Bentler & Chou, 1987 recommended 5 or 10 cases per parameter, others have refined this). With small N, SEM can yield convergence problems or very unstable estimates (and fit statistics aren’t reliable). There are small-sample corrections and Bayesian approaches as alternatives, but generally sample size is a concern. Simulation studies (e.g., Wolf et al., 2013) show that for moderate models, N=100 or less can lead to nonconvergence or inaccurate SEs.
  • Model Uncertainty and Equivalent Models: A given covariance matrix can often be explained by different causal models. SEM won’t tell you if an alternative model fits equally well unless you test it. For example, X → Y → Z vs X ← Y → Z might both fit if correlations are symmetric. Without experimental directionality or temporal ordering, SEM can’t distinguish cause from mere association – you must bring in theory or other evidence. Equivalent models (models that produce identical covariance predictions) exist in many cases (MacCallum et al., 1993 found they often occur). This is a limitation because one might be lulled into thinking one’s model is “proven” whereas an equally good but causally different model is possible. The researcher must acknowledge this and perhaps test key alternatives or argue why their chosen model is preferred.
  • Sensitivity to Model Misspecification: If key variables are omitted or wrong assumptions made (like forcing zero covariances that aren’t zero), results can be biased. Omitted latent common causes can bias path estimates (just like omitted confounders in regression). While SEM tests overall fit and can hint at problems, sometimes sample size issues or model constraints can mask problems (e.g., if power is low, chi-square might not detect moderate misfit).
  • Complexity and Interpretability: SEM output can be complex. For non-initiated readers, the myriad of parameters and fit indices can be daunting. There is a risk of misinterpretation (e.g., taking a high CFI as proof of theory, or over-focusing on fit and neglecting what estimates mean). It requires skill to properly interpret and present SEM results without confusing the audience. Also, with many parameters, there’s a potential for p-hacking (fitting many models to find one that fits well, then not reporting the others). This is addressable by transparency as noted, but a concern.
  • Difficulty of Estimating Certain Models: Some models (especially with feedback loops or with many latent interactions) are hard or impossible to estimate with standard SEM software. Nonlinear effects can be incorporated (there are methods for latent interactions, e.g., latent moderated structural equations), but they add complexity and often need large samples. SEM assumes linear relations (or specific forms if non-linear specified), so if true relations are non-linear, model fit may degrade or one might not realize non-linearity because forcing linear might still fit okay in small range but give wrong extrapolation.
  • Data Quality Requirements: Being an correlational technique, SEM inherits any garbage in → garbage out issues. Outliers, missing not at random, non-normal heavy tails – these can affect results. While robust methods exist, one must be careful about data screening and perhaps complement SEM with more exploratory analysis first to see if assumptions hold.
  • Computational Intensity: With very large datasets or very big models (say 100 observed variables, 10 factors), estimation can be slow or even memory-intensive. Modern computers handle moderately large SEMs fine, but extremely large models might need specialized approaches (like component-based SEM or divide-and-conquer strategies).
  • Overfitting Concern: With enough parameters (especially if sample size is not huge), one can overfit idiosyncrasies. For instance, adding a bunch of correlated errors will always improve fit, but doing so without theory might just be soaking up noise. Overfit models will have poorer replication. Simplicity and parsimony are thus virtues – a model should be as simple as possible but not simpler (Chi-square difference tests or information criteria can guide in not over-parameterizing).
  • Interpretation of Latent Variables: While an advantage, latent factors also present the challenge of definition – they are theoretical constructs, sometimes what exactly the factor represents is debatable. Two different sets of indicators might model similar but not identical constructs. So one has to ensure the latent is well-defined conceptually and that all indicators are appropriate. Otherwise you get a “latent” that is statistically there but conceptually muddy.

Despite limitations, SEM remains a powerful approach when used judiciously. Many limitations can be mitigated: large sample – plan studies to collect more data or use simpler models if N is small; equivalent models – test alternatives, use longitudinal data to break symmetry; model misspecification – do careful theory work and perhaps collect additional measures to guard against omitted variables. The benefits of SEM often outweigh the downsides, especially for research questions involving complex networks of relationships and unobservable constructs.

Final Thoughts: In this chapter, we demonstrated how to conduct SEM analyses in R with the lavaan package, providing both the theoretical background and practical implementation. SEM is both an art and a science – it requires theoretical insight to specify meaningful models and statistical rigor to evaluate them. For graduate-level readers, mastering SEM opens up many possibilities in research design and analysis, allowing one to test comprehensive theories rather than isolated effects. By understanding the assumptions, careful about interpretation, and adhering to transparent, reproducible practices, one can harness SEM to draw insightful conclusions about causal processes in the social sciences and beyond.

References

Bentler, P. M. (1980). Multivariate analysis with latent variables: Causal modeling. Annual Review of Psychology, 31(1), 419–456.

Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107(2), 238–246.

Bollen, K. A. (1989). Structural Equations with Latent Variables. New York: Wiley.

Bollen, K. A., & Pearl, J. (2013). Eight Myths About Causality and Structural Equation Models. In S. L. Morgan (Ed.), Handbook of Causal Analysis for Social Research (pp. 301–328). New York: Springer. https://doi.org/10.1007/978-94-007-6094-3_15

Epskamp, S. (2015). semPlot: Unified visualizations of structural equation models. Structural Equation Modeling: A Multidisciplinary Journal, 22(3), 474–483. https://doi.org/10.1080/10705511.2014.937847

Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6(1), 1–55.

Jöreskog, K. G. (1973). A general method for estimating a linear structural equation system. In A. S. Goldberger & O. D. Duncan (Eds.), Structural Equation Models in the Social Sciences (pp. 85–112). New York: Academic Press.

Kaplan, D. (2000). Structural Equation Modeling: Foundations and Extensions. Newbury Park, CA: Sage.

Kline, R. B. (2016). Principles and Practice of Structural Equation Modeling (4th ed.). New York, NY: Guilford Press.

Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge, UK: Cambridge University Press.

Pearl, J. (2012). The causal foundations of structural equation modeling. In R. H. Hoyle (Ed.), Handbook of Structural Equation Modeling (pp. 68–91). New York: Guilford Press.

Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. https://doi.org/10.18637/jss.v048.i02

Wright, S. (1934). The method of path coefficients. The Annals of Mathematical Statistics, 5(3), 161–215.