13 Geospatial Statistics and Econometrics in R
Social phenomena are inherently spatial. Where events occur and how places relate to one another can profoundly shape economic, political, and social outcomes. Ignoring the spatial context of data can lead to biased analyses and missed insights. As Waldo Tobler famously stated in the First Law of Geography, “everything is related to everything else, but near things are more related than distant things”. This notion of spatial dependence implies that observations from nearby locations tend to be more similar than those further apart, a pattern that recurs in many social science contexts – from clustered voting behaviors to regional economic convergence.
Spatial analysis provides tools to measure and model these geographic interdependencies. In quantitative social sciences, spatial analysis helps answer questions such as: Do conflicts diffuse across neighboring countries? Is there a clustering of high-income regions or poverty “hot spots”? How do policies or shocks spill over across borders? For example, civil conflicts often exhibit regional clustering, with neighboring countries experiencing higher conflict likelihoods than distant ones. Policies like trade liberalization or public health interventions can generate spillover effects in adjacent areas, meaning one region’s outcomes may be influenced by its neighbors’ characteristics. By incorporating space explicitly, researchers can move beyond aspatial models (which assume observations are independent) to capture interdependence and diffusion processes critical in fields like political science, economics, and public health.
Importantly, spatial analysis in social sciences encompasses both spatial statistics – describing and visualizing spatial patterns – and spatial econometrics – modeling spatial relationships and dependencies in regression frameworks. Embracing spatial thinking encourages researchers to consider phenomena like spatial clustering, regional heterogeneity, and diffusion networks in their theories and empirical strategies. It also forces careful attention to issues such as the definition of neighborhood or distance (often formalized via a spatial weights matrix W), and the modifiable areal unit problem (MAUP), wherein results may depend on the choice of spatial aggregation. Throughout this chapter, we will introduce foundational concepts of geospatial analysis with applications to the social sciences, demonstrate key methods for spatial statistical analysis (e.g. measures of spatial autocorrelation, geostatistical interpolation), and explore spatial econometric models that extend conventional regression to account for spatially dependent data. We will use the R programming language, taking advantage of modern spatial packages, to illustrate these techniques with real-world examples in economics and geopolitics. Each section integrates R code (formatted as it would appear in an R Markdown document) to encourage reproducibility and to show how analysis and documentation can be combined in a single workflow.
By the end of the chapter, readers should appreciate why “location matters” – analytically and substantively – in quantitative social research. They will gain practical knowledge of handling spatial data in R, computing indicators like Moran’s I, building spatial econometric models (such as spatial lag and error models), and interpreting results in light of policy-relevant questions. The overarching goal is to equip social scientists with tools to incorporate geographic context into their analyses, leading to more accurate inferences and a deeper understanding of spatial processes shaping social and economic outcomes.
13.1 Types of Spatial Data and Sources
Spatial data come in many forms, chiefly categorized into vector and raster data. Vector data represent discrete features using geometries: points (e.g. cities or conflict event locations), lines (e.g. roads, rivers), and polygons (e.g. country or district boundaries). These are often stored in formats like ESRI Shapefiles or GeoPackage. Raster data, by contrast, represent continuous spatially distributed phenomena (or imagery) as a grid of cells or pixels (each with a value), commonly stored in formats like GeoTIFF. For example, a raster might represent satellite imagery or a continuous surface such as population density or elevation. In social science research, rasters could be used for things like conflict intensity surfaces or interpolated economic indicators across space, whereas vectors are used for delineated units like administrative regions or specific locations of interest.
Each data type has different sources and use-cases. High-quality boundary files for administrative units can be obtained from databases like GADM (Global Administrative Areas), which provides detailed shapefiles for all countries and their subdivisions. GADM is frequently used to map country-, state- or district-level data (for instance, plotting subnational GDP or poverty rates) because of its high resolution and comprehensive coverage. Another widely used source is Natural Earth, a public domain collection of vector and raster maps at multiple scales. Natural Earth is ideal for creating base maps (such as country outlines, coastlines, or urban markers) and is often used for global or continental-scale mapping due to its cartographic quality and permissive license.
Social scientists also rely on specialized open data portals. For instance, the UNHCR Refugee Data Finder provides country-level and sometimes subnational data on refugee populations, internally displaced people, and asylum seekers worldwide. These data can be joined with geographic boundaries to analyze the spatial distribution of refugee populations. Notably, as of 2024, 67% of refugees lived in countries neighboring their country of origin, underscoring the spatial clustering of refugee crises. The World Bank’s DataBank and WDI (World Development Indicators) offer socioeconomic indicators by country (and increasingly by subnational regions in some cases), which can be linked to shapefiles for mapping and spatial analysis. For conflict research, georeferenced event datasets are crucial: the ACLED (Armed Conflict Location & Event Data Project) and UCDP/PRIO Armed Conflict Dataset provide granular point data on conflict incidents (with coordinates, dates, and attributes). These point data can be aggregated to regions or analyzed as spatial point patterns to study conflict diffusion.
Other useful sources include Natural Earth rasters (for elevation, land cover, etc.), NASA SEDAC or WorldPop for gridded population and socioeconomic data, and national statistical agencies or GIS portals for country-specific spatial datasets (such as census boundary files or health district maps). Many of these datasets are freely available and designed for compatibility with open-source tools. For instance, both GADM and Natural Earth can be easily read in R using the sf
or terra
packages. The GADM project even distributes R-ready files, and Natural Earth provides convenient downloads for shapefiles that can be directly used.
R and Spatial Data Formats: In R, the sf
package (Simple Features for R) has become the standard for vector data handling. It represents spatial objects (points, lines, polygons) as data frames with geometry columns, making it easy to filter, join, and plot spatial data with the familiar data frame syntax. For rasters, the terra
package is a modern alternative to the older raster
package, supporting large rasters and providing methods for raster algebra, extraction, and transformation. Both sf
and terra
interoperate well (for example, you can convert an sf
polygon to a SpatRaster
mask or rasterize points).
Example (Reading Spatial Data in R): To illustrate, we might load country boundaries from Natural Earth and join with a socioeconomic indicator from the World Bank:
library(sf)
# Load Natural Earth country boundaries (scale 1:50m) via the rnaturalearth package
library(rnaturalearth)
<- ne_countries(scale = "medium", returnclass = "sf")
countries
# Suppose we have a data frame of GDP per capita by country (iso3 code and value):
<- read.csv("world_bank_gdp_percap_2021.csv")
gdp_data # Join the GDP data to the spatial frame by ISO country code
<- merge(countries, gdp_data, by.x="iso_a3", by.y="ISO3")
countries
# Plot GDP per capita choropleth
library(tmap)
tm_shape(countries) +
tm_fill("GDP_per_capita", palette="Blues", style="quantile", title="GDP per Capita") +
tm_borders()
In an R Markdown document, the above code chunk could be embedded to produce a map figure. The result would be a world choropleth where countries are colored by GDP per capita, allowing us to visually inspect spatial patterns such as high-income regions clustering (e.g. in North America and Western Europe) and lower-income regions in other parts of the world.
Beyond country-level data, researchers often deal with subnational data (e.g., provinces, counties, grid cells) for finer analysis. In such cases, sources like GADM (for administrative boundaries) or statistical agencies become vital. For example, one could obtain district-level shapefiles of a country from GADM and merge them with district-level unemployment rates to map regional inequality. The key is to ensure that the spatial data and the statistical data share a common identifier (such as country code, region name, or a consistent index) for merging.
In summary, assembling spatial datasets in R typically involves: identifying authoritative spatial data sources for the region/unit of analysis, reading those data (using sf::st_read()
for shapefiles/GeoPackages or terra::rast()
for rasters), and joining with attribute data from sources like the World Bank, UN agencies, or other data portals. Table join operations (merge
or dplyr::left_join
) are used to associate statistical indicators with geometries. With data in hand, analysts can then proceed to explore spatial patterns and perform statistical analyses as we discuss next.
13.2 Spatial Statistical Methods
Spatial statistical methods help quantify and model two key aspects of spatial data: spatial autocorrelation (the correlation of a variable with itself across space) and spatial heterogeneity (differences in statistical relationships across space). In this section, we cover foundational techniques including global and local measures of spatial autocorrelation (e.g. Moran’s I, Geary’s C, and local indicators like LISA), geostatistical concepts like semivariograms and kriging for continuous spatial processes, and hot spot analysis for identifying clusters of high or low values.
Spatial Autocorrelation: Moran’s I and Geary’s C
Spatial autocorrelation refers to the degree to which observations from nearby locations are more alike (positive autocorrelation) or more dissimilar (negative autocorrelation) than would be expected by chance. It is essentially the spatial analog of temporal autocorrelation. Formally, a global measure of spatial autocorrelation for a variable \(x\) over \(N\) locations can be evaluated by Moran’s I, defined as:
\(I = \frac{N}{S_0} \frac{\sum_{i}\sum_{j} w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_{i}(x_i - \bar{x})^2},\)
where \(w_{ij}\) are elements of a spatial weights matrix W (defining the neighborhood structure), \(\bar{x}\) is the mean of \(x\), and \(S_0 = \sum_{i}\sum_{j}w_{ij}\) is the sum of all weights. Intuitively, Moran’s I is a weighted correlation coefficient between a variable and the spatially lagged average of itself at neighboring locations. Values of I typically range from about –1 (strong negative autocorrelation, indicating a checkerboard pattern of high-low values) through 0 (no spatial autocorrelation) to +1 (strong positive autocorrelation, indicating clustering of similar values).
A related metric is Geary’s C (Geary 1954), which is inversely related to Moran’s I. Geary’s C focuses on differences between neighboring values rather than covariance. Its formula (omitted here for brevity) yields C ≈ 1 under spatial randomness; C < 1 indicates positive autocorrelation (small differences among neighbors), and C > 1 indicates negative autocorrelation (large differences among neighbors). In practice, Moran’s I is more commonly reported because of its easier interpretation as a correlation-like measure. Both statistics require inference via permutation tests or normal approximations to assess significance – essentially checking whether the observed clustering/dissimilarity could be due to random chance given the spatial arrangement.
Example – Moran’s I in R: Suppose we have data on regional GDP per capita for 50 regions and a spatial neighbors list nb
defining which regions border which. We can compute Moran’s I using the spdep
package (now maintained as spatialreg
and spdep
):
library(spdep)
# Suppose nb is a neighbors list and lw is the listw (spatial weights) object:
<- nb2listw(nb, style="W") # row-standardized weights
lw # x is a vector of GDP per capita for each region
moran.test(regions_sf$gdp_per_capita, listw = lw)
This will output Moran’s I statistic, an expectation under null hypothesis (E(I)), and a p-value from a permutation test. For instance, we might see output indicating I = 0.45 and p < 0.001, suggesting a strong positive spatial autocorrelation (rich regions tend to neighbor other rich regions, and poor regions cluster with poor neighbors). We could also compute Geary’s C using geary.test(regions_sf$gdp_per_capita, listw = lw)
similarly.
It is often informative to visualize spatial autocorrelation through a Moran scatterplot. This is a scatterplot of \(x_i\) (horizontal axis) against the spatially lagged values \(\sum_j w_{ij}x_j\) (vertical axis), often standardized. The slope of the OLS fit line in this scatterplot is Moran’s I. Points in the upper-right quadrant (high values surrounded by high values, often labeled HH) and lower-left quadrant (low surrounded by low, LL) indicate clustering of similar values, while upper-left (low surrounded by high, LH) and lower-right (high surrounded by low, HL) suggest spatial outliers.
Local Indicators of Spatial Association (LISA) and Hot Spots
Global measures like Moran’s I provide a single summary of spatial autocorrelation over the entire study area. However, spatial patterns are often not uniform: there may be localized clusters or hotspots. Local Indicators of Spatial Association (LISA), introduced by Anselin (1995), measure spatial autocorrelation at each location i, allowing one to map where significant clustering or outliers occur. The Local Moran’s I for location i is one such measure. It essentially evaluates \(I_i = z_i \sum_j w_{ij} z_j\) (where \(z_i\) are deviations from the mean) and yields a LISA cluster map identifying high-high clusters (HH), low-low clusters (LL), and spatial outliers (high value surrounded by lows, HL; or low surrounded by highs, LH). Locations with significant local Moran’s I (based on permutation inference) can be colored to show these categories. For example, an election results dataset might show HH clusters of high support for a party in certain neighboring counties, and LL clusters of low support in another region, indicating regional strongholds and weak areas, with perhaps a few HL outlier counties that buck their neighbors’ trend.
Another popular local statistic is the **Getis-Ord \(G_i^*\) (Gi*)** for hot spot analysis. The Getis-Ord \(G_i^*\) identifies concentrations of high values (hot spots) or low values (cold spots) by comparing the local sum of values in a neighborhood of i to the global sum. Essentially, \(G_i^*\) is high when i and its neighbors have collectively high values relative to the dataset, and low when they have low values. Unlike Local Moran, Gi* does not detect HL or LH outliers; it is focused on direct clustering of high or low magnitudes. Hotspot analysis is widely used in criminology (e.g., to find crime hot spots), public health (disease clusters), and development (poverty clusters).
Figure: Hot and Cold Spot Map. The map below shows an example of a hot spot analysis on a socio-economic indicator – the unemployment rate by county in the contiguous United States. Red areas indicate hot spots (counties with high unemployment surrounded by high-unemployment neighbors), while blue areas indicate cold spots (low unemployment clusters), all computed using the Gi* statistic with a fixed distance neighborhood. Gray areas are not significant at the 95% confidence level.
*Hot and cold spots of unemployment rate by county in the contiguous U.S., identified by the Getis-Ord \(G_i^*\) statistic (red = high-unemployment hot spots, blue = low-unemployment cold spots). Higher intensity shades indicate 99% confidence level significance.*
Interpreting LISA and hot spot outputs requires careful consideration of the scale of analysis (e.g., choice of neighbors or distance band) and multiple comparison issues (many local tests are performed simultaneously). Tools like GeoDa, ArcGIS, and R’s spdep
/sf
can compute these indicators. In R, one can use spdep::localmoran()
for Local Moran’s I, which returns for each region i its local I, Z-score, and p-value. There are also higher-level wrappers (e.g., the rgeoda
package or sfdep
package) that integrate with sf
objects for ease of mapping results.
Example – Local Moran in R: Continuing the earlier GDP per capita example, after computing localI <- localmoran(regions_sf$gdp_per_capita, lw)
, we could augment our regions_sf
data frame with the results and plot a map highlighting significant HH, LL, HL, LH areas:
<- localmoran(regions_sf$gdp_per_capita, lw)
localI $Ii = localI[,"Ii"] # Local Moran's I
regions_sf$Iz = localI[,"Z.Ii"] # Z-score
regions_sf$pval = localI[,"Pr(z > 0)"] # p-value
regions_sf
# Determine cluster type
$cluster <- NA
regions_sf<- regions_sf$pval < 0.05
is.sig $cluster[is.sig & regions_sf$gdp_per_capita > mean(regions_sf$gdp_per_capita)
regions_sf& lag.listw(lw, regions_sf$gdp_per_capita) > mean(regions_sf$gdp_per_capita)] <- "High-High"
$cluster[is.sig & regions_sf$gdp_per_capita < mean(regions_sf$gdp_per_capita)
regions_sf& lag.listw(lw, regions_sf$gdp_per_capita) < mean(regions_sf$gdp_per_capita)] <- "Low-Low"
$cluster[is.sig & regions_sf$gdp_per_capita > mean(regions_sf$gdp_per_capita)
regions_sf& lag.listw(lw, regions_sf$gdp_per_capita) < mean(regions_sf$gdp_per_capita)] <- "High-Low"
$cluster[is.sig & regions_sf$gdp_per_capita < mean(regions_sf$gdp_per_capita)
regions_sf& lag.listw(lw, regions_sf$gdp_per_capita) > mean(regions_sf$gdp_per_capita)] <- "Low-High"
# Map the cluster results:
tm_shape(regions_sf) + tm_fill(col="cluster", palette=c("red","blue","pink","lightblue"), title="LISA Cluster")
This code classifies each region into the four categories or non-significant. The resulting map would highlight, say, a red cluster of High-High wealthy regions (e.g., a prosperous metropolitan belt) and a blue cluster of Low-Low regions (e.g., a geographically concentrated poverty pocket), if such patterns exist in the data.
Semivariograms and Kriging (Geostatistics)
The above autocorrelation measures deal mostly with areal data (values attached to regions) or discrete points for cluster detection. In many cases, especially with environmental or demographic continuous variables measured at point locations, we are interested in modeling a continuous spatial field (e.g., rainfall, pollution levels, or economic activity intensity across space). Geostatistics provides tools for this, chief among them the semivariogram and kriging.
A semivariogram (or simply variogram) is a function describing how similar observations are as a function of distance. Formally, the semivariance \(\gamma(h)\) at distance \(h\) is defined as half the average squared difference between values separated by \(h\): \(\displaystyle \gamma(h) = \frac{1}{2}\text{Avg}_{d(i,j)=h} [, (x_i - x_j)^2 ,]\). In practice, one computes an empirical variogram by binning pairs of points by their separation distance and computing this average for each distance bin. The variogram typically starts near 0 at small distances (points very close to each other have similar values, so differences squared are small) and rises with distance – reflecting Tobler’s law that similarity decays with distance. It often levels off at a plateau (the sill), beyond which distance is so large that points are essentially uncorrelated. The distance at which the variogram reaches the sill is called the range (beyond the range, spatial autocorrelation is negligible). Sometimes there is a non-zero intercept at distance 0 (the nugget effect), representing measurement error or microscale variation.
The variogram cloud is the raw plot of all pairwise \((d_{ij}, \frac{1}{2}(x_i-x_j)^2)\) points. The sample variogram is derived by averaging those points into bins of distance. We then often fit a theoretical variogram model (such as spherical, exponential, Gaussian models) to the empirical variogram. For instance, a spherical model might be fit with parameters for nugget, sill, and range. Figure 4.1 in the gstat manual (not shown here) provides an example: a variogram model “8 Nugget + 12 Spherical(10)” means a nugget of 8 (semi-variance at lag 0), and a spherical curve that rises to a sill of 8+12=20 at range 10 distance units.
Example of a sample semivariogram (blue points) with a fitted theoretical model (green line). The semivariance on the y-axis increases with distance on the x-axis, leveling off at the sill. (Adapted from scikit-gstat documentation).
Kriging is a spatial interpolation method that leverages the variogram to provide best linear unbiased predictions (BLUP) of a variable at unsampled locations. Developed originally in mining geology (named after Danie Krige), kriging predicts an unknown value \(Z(s_0)\) at location \(s_0\) as a weighted sum of neighboring observations: \(\hat Z(s_0) = \sum_{i=1}^N \lambda_i Z(s_i)\). The weights \(\lambda_i\) are chosen to minimize prediction variance and achieve unbiasedness, under the covariance structure implied by the variogram. In essence, points closer to \(s_0\) (and with higher covariance with \(s_0\)) get larger weight, and the weights collectively reflect the spatial correlation structure. Under the assumptions of the model, kriging provides the most accurate linear interpolator (in a mean squared error sense), hence the “best linear unbiased prediction” claim. There are variants: ordinary kriging assumes an unknown constant mean and solves for weights accordingly; simple kriging assumes a known mean; universal kriging allows a spatial trend (mean varying with location).
Beyond predictions, kriging also yields a measure of prediction uncertainty (kriging variance) at each location, which depends on the arrangement of nearby data points and the variogram. This is extremely valuable in policy contexts – for example, in environmental pollution mapping, we can identify not only where high pollution is predicted but also where predictions are very uncertain due to sparse data, guiding where to collect more samples.
Example – Kriging in R: The R package gstat
provides tools for variogram modeling and kriging. A classic example is the Meuse river dataset (heavy metal concentrations measured at locations along a river floodplain in the Netherlands). Here is a streamlined example using that data:
library(gstat)
library(sf)
data(meuse, package="sp") # load a data frame of sample points
data(meuse.grid, package="sp") # prediction grid
# Convert to sf objects
<- st_as_sf(meuse, coords = c("x","y"), crs = 28992)
meuse_sf <- st_as_sf(meuse.grid, coords = c("x","y"), crs = 28992)
grid_sf # Compute empirical variogram for zinc concentrations
<- variogram(log(zinc) ~ 1, meuse_sf) # ~1 means just use the variable with no trend
vgm_cloud plot(vgm_cloud) # visualize the variogram cloud (optional)
<- fit.variogram(vgm_cloud, model = vgm("Sph")) # fit a spherical variogram
vgm_model print(vgm_model)
Suppose the fitted model is (for illustration) Nugget=0.04, Sill=0.25, Range=900 meters. We can then perform ordinary kriging:
<- krige(formula = log(zinc) ~ 1, locations = meuse_sf, newdata = grid_sf, model = vgm_model)
krg # 'krg' now contains predictions and variance for each grid point
The result krg
is an sf
object where each point (from meuse.grid
) has a predicted log(zinc) value and a variance (variance of prediction). We could plot the kriging prediction map and the variance map:
library(tmap)
tm_shape(krg) + tm_raster("var1.pred", palette="YlOrRd", n=7, title="Predicted log(Zinc)")
tm_shape(krg) + tm_raster("var1.var", palette="Purples", n=7, title="Kriging variance")
In the context of social science, kriging might be used for spatial interpolation of survey data (e.g., interpolating a socioeconomic indicator between surveyed locations), though caution is needed since many social variables are not realizations of smooth spatial processes in the way environmental variables are. Nonetheless, techniques like kriging have been applied, for example, to interpolate conflict intensity in unobserved areas or to create continuous surfaces of economic activity (e.g., using nightlights as a proxy and kriging the residuals).
Hot Spot Analysis and Cluster Detection Recap
It is worth noting that the terms “hot spot analysis” in different fields can refer to either the point process perspective (detecting clusters in event data using methods like Kernel Density Estimation or scan statistics) or the LISA/Gi perspective (detecting clusters in aggregated data using spatial autocorrelation statistics). In this chapter, we focused on the latter. If one had individual event data (e.g., conflict event locations from ACLED), one approach is to convert those into counts per region or cell and then apply the methods above (LISA or Gi* on those counts). Another approach (beyond our scope) would be spatial point pattern analysis (using packages like spatstat
), which treats event locations as realizations of a point process and can detect clustering without aggregating to predefined zones.
13.3 Spatial Econometrics
Spatial econometrics extends traditional regression modeling to contexts where residuals or outcomes are spatially dependent. Classic econometric models assume observations are independent, but when data come from units on a map (cities, districts, countries), this assumption often fails – neighboring units may influence each other or share unobserved common factors. Spatial econometric models introduce terms to capture these dependencies, improving inference and forecasting in the presence of spatial autocorrelation. In this section, we outline the most common spatial regression specifications – the Spatial Autoregressive (SAR) model, Spatial Error Model (SEM), Spatial Durbin Model (SDM), and briefly mention spatial panel models – and discuss how to implement and interpret them in R.
Spatial Autoregressive (Lag) Model (SAR/SLM)
The Spatial Autoregressive model, often called the spatial lag model (SLM), includes a spatially lagged dependent variable to capture endogenous interaction effects. The basic form for \(N\) observations can be written as:
\(y = \rho \, W y + X\beta + \varepsilon,\)
where \(y\) is the \(N \times 1\) outcome vector, \(X\) the \(N \times k\) matrix of exogenous covariates with coefficients \(\beta\), and \(\varepsilon\) an error term (i.i.d. noise). Here \(W y\) is the spatial lag of \(y\) – essentially a weighted average of neighboring values of the outcome for each observation (with \(W\) being the spatial weights matrix). The scalar parameter \(\rho\) (often denoted as \(\rho_{\text{Lag}}\)) measures the strength of spatial dependence in the dependent variable. If \(\rho>0\), a high value in unit j tends to push its neighbors i to also have high values (after controlling for \(X\beta\)), creating spatial clusters in the outcome. If \(\rho<0\), outcomes tend to alternate (neighbors offset each other). The SAR model is analogous to a spatial auto-regression, akin to how a time-series AR(1) model includes lagged time points.
The SAR model implies that \(y = (I - \rho W)^{-1} X\beta + (I - \rho W)^{-1}\varepsilon\). The presence of \((I - \rho W)^{-1}\) (which expands to an infinite series of spatial influence) means that a shock to one unit can propagate to others: an increase in \(x_k\) in region i will have indirect effects on \(y\) in other regions through the spatial lag term. Thus, interpretation of coefficients in SAR involves distinguishing direct effects (impact on the same unit) and spillover effects on neighbors (and neighbors’ neighbors, etc.). Specialized summary measures (impact measures) are often used to quantify total, direct, and indirect effects (LeSage & Pace, 2009).
Spatial Error Model (SEM)
The Spatial Error Model assumes spatial correlation in the error term rather than in the outcome. The typical form is:
\(y = X\beta + u, \qquad u = \lambda \, W u + \varepsilon,\)
where \(u\) is a spatially autocorrelated error (with spatial autoregressive coefficient \(\lambda\), sometimes denoted \(\rho_{\text{Err}}\)). Equivalently, \((I - \lambda W)u = \varepsilon\), meaning the error for each observation is a function of errors in neighboring units. The SEM does not have the feedback loop in \(y\) that SAR has; instead, it says that after accounting for \(X\beta\), any remaining spatial pattern is in the error term. This model is appropriate when omitted variables or shocks have a spatial structure – for example, unobserved regional factors that cause neighboring areas to have similar outcomes.
In practice, a significant Moran’s I on the OLS residuals or a significant Lagrange Multiplier test for spatial error suggests using an SEM. The SEM can be estimated by generalized least squares or maximum likelihood. While \(\beta\) estimates remain unbiased in OLS even if spatial error correlation exists, their standard errors are off; SEM corrects that and provides an estimate of \(\lambda\).
Spatial Durbin and Other Extended Models (SDM, SDEM, SAC)
The Spatial Durbin Model (SDM) extends the SAR by including spatially lagged independent variables as well. The SDM specification is:
\(y = \rho \, W y + X \beta + W X \theta + \varepsilon,\)
where \(W X\) are the neighbor averages of the covariates (each column of \(X\) spatially lagged) with their own coefficients \(\theta\). This allows the effect of a covariate in region j on region i’s outcome to be explicitly modeled, rather than all spillover working through \(y\). The SDM is useful if, for example, one independent variable is a regional policy that could have spillover effects on neighboring regions’ outcomes – the \(W X \theta\) term captures that directly. If \(\theta = 0\), SDM reduces to SAR. If \(\rho = 0\), it reduces to an SLX model (spatially lagged X only).
There is also the Spatial Durbin Error Model (SDEM), which is like an SEM but also includes lagged X terms (setting \(\rho=0\) but allowing \(\theta \neq 0\) and \(\lambda \neq 0\)). Lastly, the SAC (Spatial Autoregressive Combined) or SARAR model includes both a spatial lag in \(y\) and spatial error correlation (two autocorrelation parameters). It’s the most general of these linear specifications, sometimes called the Cliff-Ord model.
Which model to use? Theory should guide whether we expect endogenous interactions (justifying a spatial lag of \(y\)) or merely spatially correlated shocks (error model) or exogenous spillovers (\(WX\) terms). In practice, one might start with OLS, run diagnostic tests (Moran’s I on residuals, LM tests) to see if spatial lag or error is indicated. If both, SAC or SDM may be candidates. A common strategy is to compare SAR vs SEM by Lagrange Multiplier tests (the one with larger robust LM statistic is chosen). The SDM can be tested against SAR or SEM via likelihood ratio tests, and it has the advantage of being flexible; a downside is more parameters. Elhorst (2010) recommends that if SDM’s additional lagged-X terms are not significant, one can simplify to SAR or SEM for parsimony.
Spatial Panel Models
When data have both spatial and temporal dimensions (e.g., yearly data on multiple countries or regions), spatial panel models allow incorporating space-time dependence. For example, one can have a panel SAR model: \(y_{it} = \rho W y_{it} + X_{it}\beta + \mu_i + \tau_t + \varepsilon_{it}\), which might include spatial lags and possibly time fixed effects \(\tau_t\) and spatial fixed effects \(\mu_i\). Spatial panels can capture dynamic diffusion processes (e.g., how a shock in one region at time \(t\) affects others in subsequent periods) and account for unobserved heterogeneity. Estimation is more complex (methods include maximum likelihood and generalized method of moments). The R package splm
(Spatial Panel Linear Models) provides functions like spml
for spatial panel models. Detailing spatial panel mathematics is beyond our scope, but they are increasingly used for topics like regional growth convergence, where one might test if growth in one region is influenced by the growth or levels of neighboring regions over time.
Estimation and Inference in R
The spatialreg
package in R (which separated from spdep
) includes maximum likelihood estimators for SAR, SEM, SDM, SAC, etc. Key functions include: lagsarlm()
for SAR (spatial lag) models, errorsarlm()
for SEM (spatial error) models, and lagsarlm(..., type="mixed")
or spatialreg::lagsarlm(..., Durbin=TRUE)
for SDM. The spdep
package also offers spatialreg::lmSLX()
for SLX models and spatialreg::sacsarlm()
for SAC models. For example:
library(spatialreg)
# neighbors list 'nb' and data frame 'df' with outcome 'Y' and predictors 'X1','X2'
<- nb2listw(nb, style="W")
listw <- lagsarlm(Y ~ X1 + X2, data=df, listw=listw) # SAR
sar_model <- errorsarlm(Y ~ X1 + X2, data=df, listw=listw) # SEM
sem_model <- lagsarlm(Y ~ X1 + X2, data=df, listw=listw, Durbin=TRUE) # SDM
sdm_model summary(sar_model)
The summary of a spatial regression model will report the spatial parameter (rho or lambda), coefficient estimates, and fit statistics (LogLik, AIC, etc.). One should check the significance of \(\rho\) or \(\lambda\) – if not significant, a non-spatial model may suffice. If using SDM, the output will show both \(\beta\) and \(\theta\) (lagged X) coefficients.
Interpreting coefficients: In SAR and SDM, due to feedback loops, the raw coefficients are not marginal effects. It is common to compute impact measures: summary(sar_model, Impacts=TRUE)
will yield the average direct effect, average indirect effect, and total effect for each predictor. For example, a direct effect of X1 might be 0.5 (within-region effect), while the indirect (spillover) effect might be 0.2, indicating that an increase in X1 in other regions also has a positive (though smaller) impact on the region’s outcome via spatial diffusion. The ratio of indirect to direct effect indicates how pronounced spillovers are.
Spatial econometric models thereby enrich analysis by quantifying these spillovers. For instance, in a study of regional economic growth, a significant \(\rho\) in a SAR model suggests endogenous growth interactions – growth in one region raises growth in neighbors, creating a virtuous (or vicious) cycle spatially. In a spatial error model of policy diffusion, a significant \(\lambda\) could imply that unobserved factors (like regional culture or common institutions) are driving a shared policy adoption pattern in neighboring states.
Model selection and diagnostics: One can compare models using AIC or likelihood ratio tests. It’s important to ensure no residual spatial autocorrelation remains: after fitting a spatial model, it’s good practice to run moran.test(residuals(sar_model), listw=listw)
to see if residual autocorrelation is gone. If not, the model may be misspecified (perhaps a higher-order spatial process or missing covariates). Another diagnostic is to check for multicollinearity between \(X\) and \(W X\) in SDM models, which can inflate variance. Centering or standardizing variables can sometimes help interpretability but does not remove multicollinearity inherent in \(X\) vs. \(W X\).
In summary, spatial econometrics provides a suite of models (SAR, SEM, SDM, etc.) to tackle different kinds of spatial dependence. When applied judiciously, these models correct bias in coefficient estimates (if spatial lag present, OLS is biased; if spatial error, OLS is inefficient) and allow researchers to distinguish between direct effects and spillovers – critical for policy implications. We have, for example, better tools to answer: “If region A increases its education spending, how much of the benefit spills over to neighboring region B’s outcomes?” or “Does a country’s conflict risk increase when its neighbors are in conflict (contagion effect), controlling for its own characteristics?”
13.4 Packages for Spatial Analysis in R
Throughout this chapter, we have utilized various R packages; here we summarize the key ones and their roles:
sf
(Simple Features): Provides an efficient data frame for vector spatial data. It handles reading/writing shapefiles, GeoJSON, etc. (viast_read
andst_write
), spatial operations (joins, buffers, intersections), and integrates well with the tidyverse. It is now the go-to for any vector data in R.spdep
andspatialreg
: Originally one package, now split.spdep
focuses on spatial dependence tools – creating neighbors lists (poly2nb
,knn2nb
for k-nearest neighbors, etc.), weights (nb2listw
), and tests for spatial autocorrelation likemoran.test
,geary.test
,localmoran
, etc.spatialreg
contains regression modeling functions for spatial econometric models (SAR, SEM, SDM, etc.) as described above. These packages are maintained by Roger Bivand and colleagues and implement many methods described in standard texts.terra
: A package by Robert Hijmans (successor ofraster
) for raster data. It reads large rasters, performs raster algebra, and spatial interpolation. For example, you can useterra
to read a GeoTIFF of population density and extract values to points or aggregate up to regions.gstat
: Provides geostatistical modeling, including variogram computation (variogram
) and kriging (krige
). It can also do simulation and variogram model fitting. The example with the Meuse data above leveragedgstat
for kriging.spatialpanel
/splm
: Tools for spatial panel data econometrics.splm
(available on CRAN) implements maximum likelihood and GM (Generalized Moments) estimators for spatial panels. It extends theplm
(panel linear models) framework to include spatial effects, supporting fixed or random effects along with SAR or SEM components.tmap
: The Thematic Mapping package for R, which makes it easy to create static or interactive maps with spatial data. It supportssf
andSpatRaster
objects directly. We usedtm_shape()
andtm_fill()
for choropleth maps in examples. It is an excellent tool for quick visualization of spatial analysis results (such as cluster maps or predicted surfaces).tidyverse
(dplyr
,ggplot2
, etc.): Not spatial per se, but essential for data manipulation in conjunction with spatial packages. Sincesf
objects are also data frames, one can usedplyr
verbs to filter or mutate attributes, then useggplot2
(withgeom_sf
) ortmap
to plot. The tidyverse makes cleaning and joining data (e.g., joining a CSV of indicators to an sf boundary file) much more convenient.Other tools: There are many more specialized packages (e.g.,
rgeoda
for an R interface to GeoDa’s methods,spatstat
for spatial point pattern analysis,leaflet
ormapview
for interactive maps,CARBayes
for Bayesian spatial models, etc.). However, the ones listed above cover the core functionality needed for most introductory through advanced spatial analyses in social sciences.
In practice, one often uses these packages in combination. For example, a typical workflow might use sf
to read data and handle geometries, spdep
to create weights and test for autocorrelation, spatialreg
to fit a spatial lag model, and tmap
to visualize residuals or random effects spatially. The R ecosystem’s strength is that these packages interoperate using standard classes (sf
for vectors, SpatRaster
for rasters), allowing a smooth workflow from data import to analysis and visualization.
13.5 Case Studies and Applied Examples
To solidify the concepts, we discuss several brief case studies from geopolitical and economic domains, highlighting how the methods above can be applied in practice. These examples use open-source datasets, illustrating typical analyses in an R Markdown narrative.
Case 1: Regional Economic Inequality and Spatial Clustering
Context: Economic output often concentrates in specific regions (e.g., capital cities, coastal trading hubs) leading to regional disparities. We examine subnational GDP per capita to see if there is clustering of high and low values, and run a spatial regression to test spillover effects in growth.
Data: Suppose we use GDP per capita of European NUTS2 regions (an EU statistical unit) for a certain year, along with subsequent growth rates. We obtain NUTS2 shapefiles (from Eurostat or GADM) and regional GDP data from Eurostat.
Spatial analysis: Mapping GDP per capita reveals a core-periphery pattern – for instance, a band of high-income regions in Western Europe (southern Germany, Austria, northern Italy, etc.) and lower-income clusters in Eastern Europe or parts of the Mediterranean. We compute Moran’s I and find a strongly positive value (e.g., \(I \approx 0.6\), \(p<0.001\)), confirming clustering of similar-income regions. A LISA cluster map might show High-High clusters in the core and Low-Low clusters in peripheral areas (with some transitional outliers like a capital city that is high in a low-income neighborhood).
We then run a SAR model for GDP growth (say over the next decade) as a function of initial GDP and other covariates (education, investment rates, etc.). The SAR model can test the convergence hypothesis accounting for spatial spillovers. We find \(\rho\) is significantly positive – indicating that growth in a region is positively influenced by growth of its neighbors. In other words, there are spatial spillover effects in regional economic growth. Perhaps a rich region surrounded by other rich regions tends to grow faster (maybe due to knowledge diffusion or infrastructure connections), whereas an isolated poor region struggles even if it has the same initial conditions.
From a policy angle, this suggests that tackling regional inequality may require coordinated development in clusters of lagging regions, rather than one region at a time, because neighbors influence each other. It also suggests that a prosperous cluster can have resilience – a concept seen in the literature on regional club convergence.
Case 2: Spatial Distribution of Refugee Populations (UNHCR Data)
Context: Refugee movements are geopolitically important and highly spatial – people often flee across the nearest borders. We analyze the distribution of refugees by host country and the burden-sharing across regions.
Data: Using UNHCR’s refugee statistics for a recent year (e.g., 2024), we have the number of refugees hosted by each country. We join this with a world countries shapefile (from Natural Earth or GADM).
Spatial analysis: A choropleth map of refugee counts (or per capita refugees) shows clear spatial patterns. For example, countries adjacent to conflict zones host the largest numbers – e.g., Turkey, Iran, and Pakistan for Afghan and Syrian refugees; Uganda and neighbors for South Sudanese refugees. We calculate Moran’s I on (log-transformed) refugee counts per capita and find significant positive autocorrelation – refugee burdens are spatially clustered rather than randomly dispersed globally. In fact, UNHCR reports that 67% of refugees are in countries neighboring their origin country, which our analysis confirms.
We can perform a hot spot analysis (Getis-Ord Gi*): this identifies hot spots such as the Middle East (Syria’s neighbors), East Africa, and perhaps parts of Europe (if considering the 2015–2016 influx) as significant clusters of high refugee presence. Cold spots appear in regions like South America or East Asia, far from major conflict sources.
For a simple spatial regression, we might examine what factors correlate with a country hosting more refugees. Using a spatial lag model, dependent variable = number of refugees (per capita), predictors = country’s GDP, population, conflict in neighbors, etc., plus a spatial lag of refugees. We might find \(\rho\) is positive, reinforcing that if neighbors host many refugees, a country likely does too (due to proximity to crises). Covariates like GDP per capita might have negative sign (richer countries, surprisingly, host fewer refugees per capita, as many refugees stay in adjacent developing countries). Such findings echo the reality that low- and middle-income countries host 73% of the world’s refugees, and geography heavily dictates refugee flows.
This case illustrates using spatial data (country polygons), global statistics (Moran’s I), and local clusters (Gi*) to understand a policy issue. The R Markdown output could include a world map figure and cited statistics from UNHCR to provide context, all within a reproducible document.
Case 3: Conflict Diffusion and Clustering (PRIO/ACLED)
Context: Conflicts often cluster regionally and diffuse across borders (through refugee flows, rebel sanctuaries, or contagion of instability). We analyze whether conflict incidence shows spatial autocorrelation and attempt to model conflict onset with spatial lags.
Data: We might take country-year data from the Uppsala Conflict Data Program (UCDP) indicating which countries have civil conflict in a given year. Or use PRIO’s grid dataset that has conflict events count in each 0.5° grid cell. For illustration, consider country-level data for a specific year where we mark conflict presence (1/0).
Spatial analysis: Using a world neighbors definition (countries sharing a border), we calculate the Join Count Statistic for binary conflict incidence (an alternative to Moran’s I for binary data). This reveals whether having a conflict increases the likelihood of neighbors having conflict. We find significantly more HH joins (neighboring pair both with conflict) than expected under randomness, indicating clustering of conflicts (e.g., in West Africa or the Middle East). A LISA map (using Local Moran’s I on conflict=1/0 interpreted as a rate) might highlight clusters like the Sahel where multiple adjacent countries face conflict (HH), versus peaceful clusters (LL) in other regions, and a few outliers (perhaps a lone conflict in an otherwise peaceful region).
We then estimate a spatially lagged logistic regression (which could be done via specialized packages or by creating a spatial lag variable) to predict conflict onset. The spatial lag of conflict (i.e., proportion of neighbors in conflict) is a strong positive predictor, even controlling for factors like poverty or governance. This suggests a contagion effect: conflicts tend to diffuse regionally. Substantively, this implies regional security cooperation is vital – conflict prevention in one country has positive externalities for its neighbors. The spatial model quantifies these externalities.
Alternatively, using ACLED point data, we could perform a kernel density estimate to visualize “hotspots” of conflict events, or aggregate events to a grid and run a hot spot analysis. Either approach would show, for example, a high-intensity cluster of events in, say, eastern DR Congo and spreading around the Great Lakes region, or in the Syria-Iraq region during peak conflict against ISIS, etc.
This case demonstrates how spatial statistics can be applied to binary/political event data and how spatial econometric thinking (spatial lags) helps in modeling diffusion processes in international relations.
Case 4: Trade Flow Spillovers
Context: International trade volumes between countries might exhibit spatial patterns – e.g., countries in the same region often have interrelated trade (through regional trade agreements or shared infrastructure). We consider whether a country’s trade openness (trade/GDP) is influenced by its neighbors’ openness.
Data: Using World Bank data on trade as % of GDP for all countries, and perhaps a contiguity-based neighbors definition, we examine spatial autocorrelation and run spatial regression.
Spatial analysis: Mapping trade/GDP shows, for example, high values clustering in Europe (high intra-EU trade) and low values clustering in some landlocked or isolated regions. Moran’s I could be moderately positive. We then run a spatial lag model: outcome = trade/GDP, predictors = GDP per capita, population, and spatial lag of trade/GDP. If the spatial lag is significant, it suggests spatial spillovers in trade – perhaps due to regional integration policies or network effects (countries surrounded by trading nations trade more themselves). Indeed, researchers have applied spatial models to gravity trade equations. One study (Zofío et al. 2025) found that ignoring spatial dependence in regional trade flows can bias elasticity estimates, and their spatial autoregressive gravity model captured spillovers of trade between neighboring regions. For example, improvements in trade infrastructure in one country can increase not only its trade but also its neighbors’ trade (through logistic hubs, etc.), a spatial externality.
This example underscores how spatial econometrics can enrich even classic economic models (the gravity model of trade) by accounting for spatial interactions. In an R Markdown, one could present the regression table and then interpret the spatial coefficient in terms of a multiplier effect: “a 1% increase in neighbors’ trade openness is associated with a 0.X% increase in a country’s own trade/GDP, holding other factors constant,” indicating regional clustering of globalization.
13.6 Modeling Workflow: Weights, Diagnostics, and Interpretation
To effectively conduct a spatial analysis or econometric modeling, it’s useful to follow a systematic workflow:
Define the Spatial Weights Matrix (
W
): Decide how to quantify “neighborliness.” Common options:- Contiguity: Neighbors if sharing a border (for polygons). Use
poly2nb
(queen or rook adjacency) for regions. This is appropriate for countries or districts. - Distance bands: All pairs within a certain distance are neighbors. Good for point data or when contiguity is too irregular. Use
dnearneigh
orknn2nb
for k-nearest neighbors if needed. - K-nearest neighbors: Each unit has k neighbors (useful to ensure each unit has some neighbors, especially if some are isolated).
- Economic or network weights: Sometimes weights based on trade volumes or some connectivity measure might be used (beyond this chapter’s scope).
After choosing neighbors (
nb
object), convert to weights list (typically row-standardized) withnb2listw
. It’s crucial to inspect weights – e.g., ensure no region is an isolate with zero neighbors (if so, one might give it a self-neighbor or use k-nearest to force at least one neighbor).- Contiguity: Neighbors if sharing a border (for polygons). Use
Exploratory Spatial Data Analysis (ESDA): With
W
defined, calculate global Moran’s I (moran.test
) for key variables to see if spatial autocorrelation is present. Plot maps and Moran scatterplots to visually diagnose patterns. If no significant autocorrelation, spatial models might not be necessary. If strong autocorrelation is present, note the magnitude and pattern (clusters/outliers).Local analysis (if needed): Compute LISA (
localmoran
) or Gi* (localG
in spdep) to pinpoint where clustering occurs. This could inform model specification – e.g., if clusters align with some known regional grouping, maybe include a regional dummy or think about spatial regimes (different models in different areas).Model specification: Decide on SAR vs SEM vs SDM based on theory and perhaps diagnostics. A common approach is:
- Run an OLS regression of your outcome on covariates.
- Do a Moran’s I test on OLS residuals (
lm.morantest
in spdep). - Run LM tests (
lm.LMtests
) which provide LM-lag and LM-error statistics. If LM-lag is significant and bigger than LM-error, a spatial lag model is indicated; if vice versa, spatial error might be better. Robust versions of these tests account for each other. - You might also consider SEM if theory suggests omitted spatial factors, or SDM if you suspect omitted spatially lagged covariates.
Estimation: Fit the chosen spatial model with
spatialreg
functions. If unsure, one can start with an SDM (general model) then test if simplifications are valid (e.g., if \(\theta\) are all ~0, it reduces to SAR; a LR test can compare SDM vs SAR). Check the coefficient signs and significance, especially the spatial parameter. If \(\rho\) or \(\lambda\) is ~0 and not significant, the spatial aspect might not be important (or W was mis-specified).Diagnostics of model: After fitting a spatial model, obtain residuals and run
moran.test(resid, listw)
to ensure residual autocorrelation is eliminated (should be insignificant). Check pseudo-\(R^2\) or AIC to see improvement over OLS. Ensure the model converged (spatialreg uses ML which usually converges quickly unless \(\rho\) is near the boundary).Interpretation of Results: If SAR/SDM, compute impact measures (
impacts
function on the model) to interpret effects. For example, if in a SAR model of crime rates you find a direct effect of poverty = 0.5 and an indirect effect = 0.2, you interpret that increasing poverty in a given city significantly increases its crime (direct) and also has a smaller but non-negligible effect on neighboring cities’ crime (indirect). For SEM, interpretation is mostly on \(\beta\) (like OLS) but note \(\lambda\) indicates how strongly shocks propagate – a high \(\lambda\) (say 0.7) means unobserved factors create substantial neighbor similarity.Presentation: In an R Markdown, one can present maps of residuals (to show no spatial pattern remaining, ideally) or include a table of results (there are packages like
stargazer
orsjPlot
that can format spatialreg outputs, or one can manually compile a table including \(\rho\)). Visualizing spillovers is also insightful – e.g., plot the cumulative impact of a hypothetical policy across the map (if one region implements it, how do others benefit?).
Following this workflow ensures a rigorous approach: first understand the spatial characteristics of data, then appropriately incorporate them into modeling, and finally verify that the model has addressed the spatial structure.
13.7 Best Practices for Reproducible Workflows using R Markdown
Reproducibility is a cornerstone of modern data science, and R Markdown is a powerful tool for combining narrative, code, and output in one document. When conducting geospatial analysis and econometrics, a few best practices help ensure the workflow is transparent and repeatable:
Document Data Sources and Preparation: Clearly state where spatial data and statistical data come from (with citations or URLs). In R Markdown, you might include a chunk that downloads data (if feasible) or at least a description in text. For example, note “Shapefiles from GADM v4.1, refugee data from UNHCR (retrieved via their API on DATE)”. This ensures someone else can obtain the same inputs.
Use Code Chunks Effectively: Show the key steps of analysis in code chunks. For instance, reading data, computing Moran’s I, and regression outputs can all be in separate chunks with relevant comments. Set chunk options like
echo=TRUE
to show code, and perhapsmessage=FALSE, warning=FALSE
to hide verbose messages (but don’t hide important messages). If a chunk takes a long time (like fitting a very heavy model or reading a huge file), consider caching it (cache=TRUE
) or precomputing results and loading them.Results as Tables/Figures: Rather than copying numbers into text (which can introduce errors), let R output results directly. For example, use
r round(moran_obs$i,3)
inline to report a Moran’s I statistic computed in a prior chunk. Use packages likekableExtra
orgt
to format tables of regression results in a reader-friendly way. For maps or plots, code chunks can directly produce them; ensure to label them (with fig.cap) so you can reference them (e.g., “Figure 1 shows…”) using the R Markdown cross-referencing.Version Control and Seeds: For simulations or procedures involving randomness (e.g., permutation tests for Moran’s I or Monte Carlo simulations), set a random seed (
set.seed(...)
) for reproducibility. Use version control (git) to track changes in your R Markdown and note package versions perhaps by includingsessionInfo()
or usingrenv
to record exact package versions. Spatial data often changes (e.g., GADM updates boundaries), so archiving the data or noting the version is good practice.Combining Analysis and Narrative: R Markdown encourages explaining each step. For instance, after a code chunk computing a spatial weights matrix, write a sentence interpreting it: “We construct a queen contiguity matrix, meaning each region is neighbor to those sharing a border or vertex. Region X has 5 neighbors, while region Y (an island) was given a single nearest neighbor to avoid isolation.” This contextual information is valuable for readers.
Visualization: For geospatial work, maps are key. Use
tmap
in R Markdown in either static mode (tmap_mode("plot")
) for print or interactive mode (tmap_mode("view")
) for HTML outputs. Ensure maps have clear legends, scale bars, and source credits if needed. Keep color schemes colorblind-friendly if possible.Workflow Organization: Often, spatial data processing can be heavy. Organize your R Markdown with sections (using
##
and###
headings) corresponding to this chapter’s sections: data loading, exploratory mapping, statistical analysis, model results, etc. This makes it easier to navigate. Use caching wisely to avoid re-running slow steps every knit.Troubleshooting in Document: It can be helpful to include small diagnostic outputs in the document. For example, after reading a shapefile, output
st_crs(shp)
to confirm the coordinate reference system (CRS). After a join, do a quicksummary(shp$variable)
to ensure no weird values. Including these in the narrative shows the conscientious steps taken.
By weaving together prose, code, and evidence (figures/tables), an R Markdown report on spatial econometrics becomes a self-contained chapter or paper that others can follow step-by-step. This is particularly important given the complexity of spatial analyses – there are many decisions (weight matrix, model type, etc.) that should be justified and recorded. R Markdown helps avoid the common pitfall of an analyst manually adjusting things and forgetting exactly what they did; instead, the “paper” is the code. Many academic journals and policy organizations appreciate this transparency, and it aligns with the push towards open science.
Finally, when sharing or publishing, be mindful of file sizes – large shapefiles or rasters might need to be external or in a GitHub repository. You can include code to download them on the fly (with caching). Also consider knitting to an HTML or PDF and checking that all figures rendered correctly and that any interactive elements (if HTML) work as intended.
13.8 Troubleshooting Common Issues in Geospatial Econometric Workflows
Working with spatial data and models can introduce unique challenges. Here are some common issues and tips to resolve them:
Coordinate Reference System (CRS) Mismatch: A frequent headache is combining spatial data with different projections (CRS). If points and polygons don’t align, or distance-based weights seem off by orders of magnitude, suspect a CRS issue. Solution: Always check
st_crs()
for each spatial layer. Usest_transform()
to a common CRS (often a projected coordinate system if doing distance calculations – e.g., UTM or an equal-area projection). As a rule, for distance-based analysis (like distance bands, k-nearest neighbors in kilometers), use a projected CRS in meters. Geographic (latitude-longitude) coordinates can lead to incorrect distance weighting (since degrees aren’t constant distances) and have issues with functions expecting planar coordinates.Islands or No Neighbor Units: If using contiguity weights, sometimes a unit has no neighbors (e.g., an island country with no land borders). This can cause functions like
nb2listw
to fail or Moran’s I to be undefined (division by zero). Solution: You can manually assign nearest neighbors for isolates (perhaps based on distance threshold) or use a k-nearest approach for all to guarantee connectivity. Inspdep::nb2listw
, there is an argumentzero.policy=TRUE
which will handle zero-neighbor cases by assigning zero weights; but this effectively removes the unit from spatial influence, which may not be ideal. Better is to ensure every unit has at least one neighbor in the W matrix definition (even if that neighbor is somewhat arbitrary).Edge effects in variogram/kriging: When interpolating (kriging), points at the edge of the study area have fewer neighbors and often higher kriging variance. This is an inherent issue (less information at edges). One way to mitigate is to include a buffer area (include points slightly outside area of interest if available). If not, just be cautious interpreting edge predictions. For semivariograms, if the variogram keeps rising at the largest distances (no clear sill), it might indicate a trend in the data or that the area isn’t wide enough to capture the full range. Consider de-trending (regressing out a spatial trend and computing variogram on residuals) or fitting a variogram with a large range (effectively a trend).
Model convergence or identification: In spatial econometric models, especially more complex ones (SDM, SAC, or spatial panels), you can run into convergence issues or weak identification (e.g., SAR and SEM effects hard to distinguish). If a model doesn’t converge, try different starting values or ensure the data are scaled reasonably (very different scales can affect likelihood optimization). If standard errors are huge or signs are odd, multicollinearity could be the culprit – check correlation between \(X\) and \(W X\) in SDM, for instance. In some cases, simplification (choosing SAR or SEM) might be more stable than trying to estimate everything at once.
Memory and performance: Spatial datasets can be large (fine-grained rasters or shapefiles with many polygons). Reading these can be slow and consume a lot of memory. Tips: use
terra
which is more memory efficient for rasters and can lazily load. For very large vector data, consider simplifying geometry (withst_simplify
) if high precision boundaries are not needed for analysis. When creating spatial weights for thousands of units, the weight matrix can be huge. However,spdep
stores weights in sparse form which is efficient; still, operations may be slow. Sometimes using a distance band to limit neighbors or a smaller k can make computation feasible. If running many simulations or permutations, try to use vectorized code or R’s parallel packages (parallel
orfuture.apply
) to speed up, but be cautious with random number seeds.MAUP (modifiable areal unit problem): This is more conceptual – results can depend on the spatial aggregation level. If you notice that using states vs. counties gives different conclusions, that’s the MAUP. There’s no easy fix except awareness: try analyses at multiple scales if possible. For example, a hot spot at the county level might be hidden at the state level due to averaging. If policy insights require finer granularity, use that; otherwise, justify why a certain spatial unit is chosen. Using point data (when available) sidesteps MAUP by not aggregating, but then one needs point pattern methods.
Interpretation Pitfalls: Newcomers might mis-interpret spatial model outputs. A common mistake is to treat \(\rho\) as comparable to R-squared or to interpret \(\beta\) in SAR directly. To troubleshoot understanding, one can do simulation: generate fake data with known spatial processes and see if the model recovers them. Also, use the
impacts
function for SAR/SDM models to get the direct/indirect effects – it clarifies how to talk about results. If a coefficient is significant but its impact is mostly indirect, you’d explain that a lot of the effect occurs via neighbors rather than within the unit.Software nuances:
spatialreg
andspdep
have evolved. Ensure you use the correct function names (older documentation might referencelagsarlm
inspdep
, but now it’s inspatialreg
). If you encounter errors about “neighbors of long vectors” or similar, it often indicates an indexing issue or a bug with very large matrices – check CRAN for updates, as such issues get fixed. The community (e.g., on StackOverflow or R-sig-Geo mailing list) is helpful if you provide reproducible examples of errors.
When troubleshooting, isolating the step that causes an issue is key. If a Moran’s I test returns NaN, inspect the weights and data for missing values or zero-variance. If a model won’t run, try a simpler model on a subset of data. Progressive building and testing at each step (as shown in the workflow section) will catch many issues early – for example, discovering that a join dropped some regions because of name mismatches (leaving NA values), which could later crash a model.
In summary, geospatial workflows add layers of complexity (pun intended) but with careful checking of spatial data integrity, thoughtful choice of methods, and iterative validation of results, these challenges are manageable. Modern R tools and an active user community have greatly smoothed the process, making it an exciting time to incorporate spatial perspectives in social science research.
13.9 Conclusion: Implications for Policy and Spatial Thinking in Economics and Geopolitics
Incorporating geospatial statistics and econometrics into social science research fundamentally enriches our understanding of economic and political phenomena. Space is not just a background where events unfold; it actively shapes outcomes via spillovers, diffusion, and context. The case studies we explored demonstrate concrete implications: Regional inequality analysis shows that growth or decline doesn’t happen in isolation – regions form interconnected webs, so regional policies (e.g., EU structural funds) can leverage positive spillovers if coordinated geographically. Refugee distribution analysis highlights how burdens are unevenly shared and clustered near conflict zones, informing international policy that aiding host countries in conflict neighborhoods is both morally and practically necessary (as their stability affects a whole region). Conflict diffusion results underscore that peace and conflict have a spatial domino aspect – peacebuilding in one country may contribute to stability in neighbors, whereas unrest can seep across borders, which implies that diplomatic and security efforts must be regional, not just national. Trade flow spillovers suggest that countries benefit from being surrounded by open economies, providing rationale for regional trade agreements and infrastructure connecting countries – a nation’s development prospects can be lifted by a prosperous neighborhood (or stifled by a troubled one).
For policymakers, spatial thinking means moving beyond “one-size-fits-all” and recognizing heterogeneous spatial clusters. For example, development programs might target specific clusters of underdevelopment (the LL cold spots in a LISA map) with integrated interventions, rather than spreading resources thinly. In conflict prevention, it means monitoring not only internal risk factors but also regional conflict neighborhoods (a concept supported by spatial econometric findings that neighbors’ conflict is a significant risk factor). In public health, detecting disease clusters quickly can guide localized containment.
Methodologically, the rise of spatial data (e.g., high-resolution satellite imagery, geo-referenced survey data) and open-source tools (like the R packages used here) has lowered barriers for analysts to include geography in their models. An economist or political scientist armed with these tools can avoid mis-specification that treats correlated observations as independent – thus making more valid inferences. For instance, ignoring spatial autocorrelation in a regression can underestimate standard errors leading to overstated significance of some policies. Spatial diagnostics help reveal such issues so that the researcher can choose an appropriate spatial model and obtain correct inference.
Looking ahead, spatial econometric models are evolving (e.g., handling network weights beyond simple geography, combining spatial with temporal dynamics in sophisticated ways). But the core insight remains Tobler’s Law: context matters. We encourage readers to integrate the approaches from this chapter in their own research – whether it’s mapping the data first to see patterns, formally testing for spatial autocorrelation, or estimating a spatial lag model to capture neighborhood effects. By doing so, one often uncovers richer stories in the data. As the saying goes in spatial analysis, “location, location, location” – where things happen influences why they happen.
In conclusion, geospatial analysis in R provides a potent framework for social scientists to quantitatively assess the role of space. It bridges the gap between qualitative geographic insights and formal statistical modeling. The applications in economics, political science, sociology, and beyond are vast – from studying how policies diffuse across states, to understanding urban segregation patterns, to evaluating regional impacts of climate change. The knowledge and tools outlined in this chapter equip researchers to not only avoid pitfalls (like ignored spatial dependence) but also to ask new questions that explicitly involve spatial relationships. This spatial perspective ultimately leads to conclusions and policy recommendations that are more nuanced and grounded in the real-world interconnectivity of places and people. By embracing spatial thinking and methods, quantitative social science can more fully capture the complexities of the world it seeks to explain.
References
Anselin, L. (1995). Local indicators of spatial association—LISA. Geographical Analysis, 27(2), 93–115.
Bivand, R. S., Pebesma, E. J., & Gómez‑Rubio, V. (2013). Applied spatial data analysis with R (2nd ed.). Springer.
Center for International Earth Science Information Network (CIESIN). (2018). Gridded population of the world, version 4 (GPWv4): Population count [Data set]. NASA SEDAC.
Cheng, J., Karambelkar, B., & Xie, Y. (2022). leaflet: Create interactive web maps with the JavaScript Leaflet library (Version 2.1.1) [Computer software].
Cressie, N. (1993). Statistics for spatial data (Rev. ed.). Wiley.
Elhorst, J. P. (2010). Spatial panel data models. In M. M. Fischer & A. Getis (Eds.), Handbook of applied spatial analysis (pp. 377–407). Springer.
GADM. (2022). GADM database of global administrative areas (Version 4.1) [Data set].
Geary, R. C. (1954). The contiguity ratio and statistical mapping. The Incorporated Statistician, 5(3), 115–146.
Getis, A., & Ord, J. K. (1992). The analysis of spatial association by use of distance statistics. Geographical Analysis, 24(3), 189–206.
Hijmans, R. J. (2022). terra: Spatial data analysis (Version 1.5) [Computer software].
Kelso, N. V., & Patterson, T. (2009). Natural Earth vector. Cartographic Perspectives, 64, 45–50.
LeSage, J. P., & Pace, R. K. (2009). Introduction to spatial econometrics. CRC Press.
Millo, G., & Piras, G. (2012). splm: Spatial panel data models in R. Journal of Statistical Software, 47(1), 1–38.
Moran, P. A. P. (1950). Notes on continuous stochastic phenomena. Biometrika, 37(1/2), 17–23.
Pebesma, E. J. (2004). Multivariable geostatistics in S: The gstat package. Computers & Geosciences, 30(7), 683–691.
Pebesma, E. (2018). Simple features for R: Standardized support for spatial vector data. The R Journal, 10(1), 439–446.
Raleigh, C., Linke, A., Hegre, H., & Karlsen, J. (2010). Introducing ACLED: An armed conflict location and event dataset. Journal of Peace Research, 47(5), 651–660.
Tatem, A. J. (2017). WorldPop, open data for spatial demography. Scientific Data, 4, 170004.
Tennekes, M. (2018). tmap: Thematic maps in R. Journal of Statistical Software, 84(6), 1–39.
Tobler, W. R. (1970). A computer movie simulating urban growth in the Detroit region. Economic Geography, 46(sup1), 234–240.
United Nations High Commissioner for Refugees. (2023). Refugee Data Finder [Database].
United Nations High Commissioner for Refugees. (n.d.). Operational Data Portal: Refugee situations [Data repository]. Retrieved July 15, 2025.
World Bank. (2024). World Development Indicators [Data set].