8 Geospatial Data Analysis
Geospatial data analysis involves harnessing spatial information to explore, model, and interpret spatial relationships and phenomena across geographic spaces. It plays a pivotal role in diverse fields such as urban planning, environmental management, public health, transportation, and economic geography. Through rigorous analytical methods, practitioners can uncover hidden patterns, detect clusters, analyze spatial interactions, and develop predictive models. This chapter provides an extensive overview of essential analytical techniques in geospatial data analysis using R and Python, empowering analysts to extract meaningful insights from complex spatial data effectively.
8.1 Exploratory Spatial Data Analysis (ESDA)
Exploratory Spatial Data Analysis (ESDA) is the initial phase of spatial analysis, dedicated to visualizing spatial distributions, detecting spatial clusters or outliers, and formulating hypotheses about spatial relationships. ESDA is defined as a collection of techniques for describing and visualizing spatial distributions, identifying atypical locations (spatial outliers), uncovering spatial association patterns (clusters), and revealing spatial heterogeneity. In practice, ESDA helps analysts understand underlying spatial patterns and gather insights before applying more sophisticated statistical modeling.
Spatial Distribution
Examining the spatial distribution of features is critical for identifying broad patterns such as clustering or dispersion in the data. Simple mapping of point or polygon data can immediately suggest whether features are randomly distributed or form clusters in certain areas. By plotting the geometry of spatial features, analysts can visually inspect if points tend to concentrate (indicating possible hotspots) or spread evenly (suggesting spatial dispersion).
Example in R:
library(sf)
# Load spatial data (point shapefile)
<- st_read("data/points.shp")
points
# Visualize spatial distribution on a simple plot
plot(st_geometry(points), pch=20, col="blue", main="Spatial Distribution of Points")
Example in Python:
import geopandas as gpd
import matplotlib.pyplot as plt
# Load spatial data (point shapefile)
= gpd.read_file("data/points.shp")
points
# Plot spatial distribution of points
='o', color='blue', markersize=5, figsize=(8,6))
points.plot(marker"Spatial Distribution of Points")
plt.title("Longitude")
plt.xlabel("Latitude")
plt.ylabel( plt.show()
These basic plots allow an exploratory look at how points are spread out geographically. Clusters of points in the plot may indicate areas of higher intensity or activity, which can warrant further analysis.
Kernel Density Estimation (KDE)
Kernel Density Estimation (KDE) is a technique to visualize and quantify the density of spatial point occurrences, providing insights into the intensity of spatial phenomena such as crime incidents, disease cases, or resource locations. Conceptually, KDE creates a smooth surface over the study area by centering a kernel (a smoothly peaked function) over each point and summing their contributions. The surface value is highest at the location of each point and diminishes with increasing distance, reaching zero at the kernel’s search radius. The result is a continuous raster where higher values indicate areas with greater concentration of points (potential “hotspots”). This approach is widely used to identify patterns like crime hot spots or disease outbreak centers.
Example in R:
library(spatstat)
# Convert sf points to spatstat point pattern (assuming 'points' from above has coordinates)
<- st_coordinates(points)
coords <- ppp(x = coords[,1], y = coords[,2], window = owin(xrange=range(coords[,1]), yrange=range(coords[,2])))
pp
# Compute KDE for point pattern
<- density(pp, sigma=0.01) # using a bandwidth (sigma) of 0.01, for example
kde
# Visualize the KDE raster
plot(kde, main="Kernel Density Estimation")
points(coords, pch=20, cex=0.5, col='red') # overlay original points for reference
Example in Python:
import numpy as np
from sklearn.neighbors import KernelDensity
import geopandas as gpd
import matplotlib.pyplot as plt
# Load points as GeoDataFrame (if not already loaded)
= gpd.read_file("data/points.shp")
points = np.vstack([points.geometry.x, points.geometry.y]).T
coords
# Fit a KDE model on the point coordinates
= KernelDensity(bandwidth=0.01).fit(coords)
kde_model = kde_model.score_samples(coords)
log_densities = np.exp(log_densities) # convert log density to density values
densities
# Plot points colored by their estimated density
= plt.subplots(figsize=(8,6))
fig, ax =ax, markersize=20, column=densities, cmap='Reds', legend=True)
points.plot(ax"Kernel Density Estimation")
plt.title( plt.show()
In these examples, a bandwidth (smoothing parameter) must be chosen (here 0.01 in coordinate units). The output is a density estimate: areas where points are tightly clustered will show higher density values (darker red in the Python plot), whereas sparsely populated areas show low density. By applying KDE, analysts can quantitatively identify where points concentrate, which is useful for hotspot analysis or resource allocation planning.
8.2 Spatial Autocorrelation
Spatial autocorrelation measures the degree to which similar values occur near each other in space. It quantifies whether high or low values of a variable tend to cluster spatially (positive autocorrelation), or whether high values repel high values and instead surround low values and vice versa (negative autocorrelation). Evaluating spatial autocorrelation helps analysts identify underlying spatial processes and dependencies—whether the spatial arrangement of values is random or exhibits systematic patterning.
Global Spatial Autocorrelation
Global spatial autocorrelation statistics summarize the overall clustering pattern across the entire study region. The most commonly used measure is Moran’s I, which evaluates whether feature values are more spatially clustered or dispersed than would be expected under a random spatial distribution. In essence, Moran’s I is a correlation coefficient for spatial data: it compares each feature’s value with those of its neighbors, yielding a value that is usually between –1 and +1 (though not strictly bounded). A significantly positive Moran’s I indicates that similar values (either high or low) cluster together in space more than random chance would dictate, whereas a significantly negative Moran’s I suggests a checkerboard pattern where high values are near low values (spatial dispersion). A value close to zero implies spatial randomness (no meaningful spatial clustering).
Example in R (Moran’s I using spdep):
library(spdep)
# Assume 'points' is a spatial points data frame with an attribute 'variable'
# Create spatial neighbors (e.g., using queen contiguity on point data might require constructing polygons or distance-based neighbors)
<- st_coordinates(points)
coords <- dnearneigh(coords, d1=0, d2=1) # neighbors within distance 1 (example)
neighbors <- nb2listw(neighbors, style="W") # spatial weights list
weights
# Compute Moran's I for 'variable'
<- moran.test(points$variable, weights)
morans_test print(morans_test)
Example in Python (Moran’s I using PySAL):
import libpysal
from esda.moran import Moran
# Create spatial weights (e.g., Queen contiguity for polygons or K-nearest for points)
= libpysal.weights.KNN.from_dataframe(points, k=5) # 5 nearest neighbors example
weights = 'R' # row-standardize weights
weights.transform
# Calculate Moran's I for the attribute 'variable'
= Moran(points['variable'], weights)
moran print(f"Moran's I: {moran.I:.3f}, p-value: {moran.p_norm:.5f}")
In these snippets, we first define a spatial weights matrix (which determines the neighbor relationships). We then compute Moran’s I for a variable of interest. The output includes Moran’s I value and a p-value to test its significance. For example, a Moran’s I significantly greater than 0 (with a very low p-value) would indicate globally clustered values of the variable (e.g., regions of consistently high values and regions of low values), whereas a Moran’s I near zero (p-value high) suggests no global spatial structure.
Local Spatial Autocorrelation
Global metrics can mask localized variations, so Local Indicators of Spatial Association (LISA) are used to assess spatial autocorrelation at the local level. A local Moran’s I (one type of LISA) is computed for each individual location, identifying specific areas that contribute significantly to the overall spatial autocorrelation. This analysis pinpoints hotspots and coldspots: for instance, a high-high cluster means a location with a high value is surrounded by high values (a hotspot), and a low-low cluster means a low-value location is near other low values (a coldspot). Likewise, spatial outliers can be detected (e.g., a high value surrounded by lows is a high-low outlier). By mapping local Moran’s I results, we obtain a clear visualization of where significant clustering or dispersion is occurring in the study area.
Example in R (Local Moran’s I using spdep):
# Compute Local Moran's I
<- localmoran(points$variable, weights)
local_I # localmoran returns a matrix with columns: Ii, E(Ii), Var(Ii), Z.Ii, Pr(z.Ii)
$lisa_I <- local_I[,1] # the local Moran's I statistic
points$lisa_p <- local_I[,5] # the p-value for each local Moran's I
points
# Mark significant local clusters (e.g., p < 0.05)
$lisa_cluster <- factor(ifelse(points$lisa_p < 0.05, "Significant", "Not Significant"))
points
# Plot significant local clusters
plot(points["lisa_cluster"], main="Local Moran's I Clusters (p < 0.05)")
Example in Python (Local Moran’s I using PySAL):
from esda.moran import Moran_Local
# Calculate local Moran's I
= Moran_Local(points['variable'], weights)
moran_local
# Extract results
'local_I'] = moran_local.Is # local Moran's I values
points['local_p'] = moran_local.p_sim # simulated p-values for each location
points['cluster'] = moran_local.q # cluster types (1 HH, 2 LH, 3 LL, 4 HL)
points[
# For visualization, flag significant results
'signif'] = moran_local.p_sim < 0.05
points[
# Plot significant clusters (True/False)
='signif', categorical=True, legend=True, cmap='Set1', markersize=30)
points.plot(column"Local Moran's I Significant Clusters")
plt.title( plt.show()
In the Python example, moran_local.q
provides a code for cluster type: typically, 1 = High-High, 2 = Low-High, 3 = Low-Low, 4 = High-Low quadrant for each observation based on the Moran scatterplot. By examining these, an analyst can identify where significant local clustering exists. For instance, a group of neighboring areas all with high values and a significant local Moran’s I would form a high-high cluster, which might indicate a localized hotspot for the phenomenon being studied. Such information is invaluable for targeting interventions or further investigation in specific areas rather than treating the study region as homogeneous.
8.3 Spatial Regression and Modeling
Traditional regression methods assume observations are independent, an assumption violated when spatial autocorrelation is present. Spatial regression models address this by explicitly incorporating spatial dependence in the model structure, leading to more reliable inference and often improved predictive accuracy. Two common types of spatial regression are the Spatial Lag Model and the Spatial Error Model. Both approaches extend linear regression by adding components that capture spatial effects: the lag model introduces spatial dependence in the response (outcome) variable, while the error model handles spatial dependence in the residuals (error term).
Spatial Lag Model
A spatial lag model includes a spatially lagged dependent variable as an additional predictor in the regression. In simpler terms, for each location, the model incorporates a weighted average of the values of the dependent variable in neighboring locations (the “lag”), reflecting the influence that neighboring observations have on the location’s outcome. This is useful when we suspect a diffusion or spillover process — for example, a city’s economic growth might be influenced by growth in adjacent cities. Mathematically, if $y$ is the outcome and $W$ is the spatial weight matrix, the spatial lag model can be written as $y = W y + X+ $, where $$ is the spatial lag coefficient measuring the strength of neighbor influence.
Example in R (Spatial lag using spatialreg):
library(spatialreg)
# Suppose 'variable' is the dependent variable and 'predictor' is an independent variable
# 'weights' is the listw spatial weights (as defined earlier)
<- lagsarlm(variable ~ predictor, data=points, listw=weights)
lag_model summary(lag_model)
Example in Python (Spatial lag using PySAL/spreg):
from spreg import ML_Lag
# Prepare data for regression (y as dependent, X as independent matrix)
= points[['variable']].values
y = points[['predictor']].values
X
# Spatial Lag Model (maximum likelihood estimation)
= ML_Lag(y, X, w=weights, name_y='variable', name_x=['predictor'])
lag_model print(lag_model.summary)
In the R output or Python summary, one will see a coefficient for the spatially lagged dependent variable (often denoted as ρ or lambda). A significant positive lag coefficient (ρ) suggests that there is positive spatial autocorrelation in the dependent variable that is being captured: high values in a location are associated with high values in neighbors (and similarly for low values). Essentially, the spatial lag model is telling us that part of the variance in the outcome can be explained by the outcomes in nearby areas, validating the intuition that neighboring locations influence each other. Including this term corrects the bias and inefficiency that would plague a standard OLS regression when spatial dependence is present.
Spatial Error Model
A spatial error model, on the other hand, assumes that spatial autocorrelation enters through the error term of the regression. This addresses scenarios where the omitted variables, measurement errors, or other unmodeled influences have a spatial pattern. The model can be expressed as $y = X+ u$ with $u = W u + $, where $u$ are errors that follow a spatial autoregressive process (with spatial autoregressive coefficient λ). Here, $W u$ captures the spatially correlated portion of the error. If $$ is significant, it indicates that there are latent spatial processes not captured by the independent variables, but manifesting as spatially structured error terms.
Example in R (Spatial error using spatialreg):
# Spatial error model
<- errorsarlm(variable ~ predictor, data=points, listw=weights)
error_model summary(error_model)
Example in Python (Spatial error using PySAL/spreg):
from spreg import ML_Error
# Spatial Error Model (maximum likelihood estimation)
= ML_Error(y, X, w=weights, name_y='variable', name_x=['predictor'])
error_model print(error_model.summary)
The output will include a coefficient for the spatial error term (often denoted as Lambda or λ). A significant λ indicates spatial autocorrelation in the error terms, meaning the model’s residuals still contain a spatial pattern that the spatial error model is accounting for. In practice, one might choose a spatial error model when we believe that spatially distributed omitted factors (like unmeasured environmental variables, policy effects, etc.) are influencing the outcome. By modeling the error term’s spatial structure, we obtain unbiased (though less efficient) coefficient estimates and correct the standard errors for spatial autocorrelation.
Note: Deciding between a spatial lag and spatial error model (or more complex models like Spatial Durbin, etc.) often involves diagnostic tests and theoretical considerations. One common approach is to run Lagrange Multiplier tests after an OLS regression to see if a lag or error term is warranted. In any case, incorporating spatial dependence via these models is crucial when ordinary regression assumptions are violated by spatial autocorrelation, as it avoids biased or inefficient estimates.
8.4 Network Analysis
Network analysis in a geospatial context focuses on the connectivity and relationships across networks such as road systems, transit lines, or river networks. It helps answer questions related to accessibility, optimal routing, and flow through a network. For instance, transportation planning and logistics often require finding the shortest or fastest path through a road network, and infrastructure resilience studies may examine alternate routes if certain network links fail.
A common task is shortest path analysis, where the goal is to identify the route between two locations that minimizes some cost (distance, travel time, etc.). Modern tools like the OSMnx library in Python make it straightforward to retrieve street network data from OpenStreetMap and perform routing algorithms on it. OSMnx represents the street network as a graph (nodes are intersections and end-points, edges are road segments) and allows users to calculate and visualize routes that minimize a specified weight (e.g., distance).
Example in Python (using OSMnx for shortest driving route):
import osmnx as ox
# Download road network data for a place (e.g., Montreal)
= ox.graph_from_place("Montreal, Quebec, Canada", network_type='drive')
G
# Define origin and destination points (latitude, longitude)
= 45.5017, -73.5673 # example origin (Montreal city center)
orig_lat, orig_lon = 45.5222, -73.5852 # example destination (somewhere in Montreal)
dest_lat, dest_lon
# Get the nearest nodes in the graph to these coordinates
= ox.distance.nearest_nodes(G, X=orig_lon, Y=orig_lat)
orig_node = ox.distance.nearest_nodes(G, X=dest_lon, Y=dest_lat)
dest_node
# Compute the shortest path (by distance) between origin and destination
= ox.shortest_path(G, orig_node, dest_node, weight='length')
route_nodes
# Plot the route on the network
= ox.plot_graph_route(G, route_nodes, route_linewidth=4, node_size=0, bgcolor='k') fig, ax
In this example, we first retrieve the driving network for Montreal. We then find the nearest network nodes to an origin and a destination coordinate. The ox.shortest_path
function is essentially performing Dijkstra’s algorithm (by default) on the graph to find the path of least cumulative length. Finally, plot_graph_route
provides a visualization of the resulting shortest route overlaying the road network (with the route highlighted in a different color).
Network analysis is not limited to shortest paths. Other common analyses include: finding service areas (e.g., all reachable roads within 10 minutes of a location), calculating centrality measures for network nodes (which intersections are most important in traffic flow), identifying bottlenecks or critical links, and solving traveling salesman or vehicle routing problems. The key is that network analysis treats movement through space constrained along network links, which is a different perspective from broad geostatistical analysis that treats space as continuous.
8.5 Geostatistical Analysis
Geostatistical analysis involves statistical techniques for modeling and predicting spatially continuous phenomena based on samples at specific locations. Unlike ESDA and spatial autocorrelation which often deal with discrete areas or points, geostatistics typically addresses variables like pollutant concentration, temperature, or elevation, which vary continuously over space. Two fundamental geostatistical interpolation methods are Kriging and Inverse Distance Weighting (IDW), among others. These methods allow us to estimate values at unsampled locations using the values from sampled locations, under different assumptions about spatial correlation.
Kriging
Kriging is a powerful interpolation technique rooted in geostatistics. It provides the Best Linear Unbiased Predictor (BLUP) for spatially correlated data under certain assumptions. In essence, kriging predicts the value at an unknown location as a weighted linear combination of observed values from nearby locations. What sets kriging apart is that the weights are not chosen arbitrarily or just by distance (as in IDW), but are derived from a statistical model of spatial covariance (the variogram) that describes how similarity between points decays with distance. By using the variogram, kriging accounts for the spatial structure of the data – points that are closer (and more correlated) get more weight, but the weighting also considers redundancy among clustered sample points (so clustered points don’t overweight the estimate).
Example in R (Ordinary Kriging using gstat):
library(gstat)
# Assume 'points' is an sf object with coordinates and a field 'variable'
# Convert to Spatial object for gstat (if needed)
coordinates(points) <- ~x+y # assuming points has columns x, y for coords
# Compute empirical variogram of the data
<- variogram(variable ~ 1, data=points)
vario <- fit.variogram(vario, model=vgm("Sph")) # fit a spherical variogram model
vario_fit
# Create a grid of prediction locations (as an sf or Spatial object) - for simplicity, use same area
<- spsample(points, type="regular", n=1000) # 1000 sample grid points within the extent
grid gridded(grid) <- TRUE
# Perform kriging
<- krige(variable ~ 1, points, grid, model=vario_fit)
krige_result plot(krige_result["var1.pred"], main="Kriging Prediction")
Explanation: We first calculate the empirical variogram of the observed data variable
(variogram cloud and model fitting step). We then perform ordinary kriging to predict variable
on a grid of unsampled locations. The result (krige_result
) contains predictions (var1.pred
) and also kriging standard errors (var1.var
if requested), which provide uncertainty estimates.
Kriging will yield a smooth interpolation that honors the sample data (predictions at sample points equal observed values for ordinary kriging) and tends to give intermediate values in between. Importantly, because it’s model-based, kriging can also provide confidence intervals for predictions, which is a major advantage in decision-making (knowing how uncertain a prediction is).
Inverse Distance Weighting (IDW)
Inverse Distance Weighting is a simpler, deterministic interpolation technique. The premise is straightforward: nearby points have more influence on the estimated value than farther points. IDW computes an unknown value as a weighted average of known values, where weights are an inverse function of distance (e.g., weight proportional to $1/d^p$, with $p$ being a power parameter, commonly 2). There is no probabilistic model of spatial correlation here; it’s purely based on distance decay. Because of this simplicity, IDW is easy to understand and implement, though it lacks the statistical optimality of kriging.
Example in Python (IDW using PyInterpolate or custom code):
import numpy as np
# Sample known data points (assuming we have numpy arrays for coords and values)
= np.array([(x, y, val) for x, y, val in zip(points.geometry.x, points.geometry.y, points['variable'])])
known_points
# Define a function for IDW
def idw_predict(x, y, known_points, power=2, radius=None):
= np.sqrt((known_points[:,0] - x)**2 + (known_points[:,1] - y)**2)
distances if radius:
# optionally ignore points beyond a certain radius
= distances <= radius
mask = distances[mask]
distances = known_points[mask, 2]
values else:
= known_points[:, 2]
values # Avoid division by zero by setting a minimum distance
= np.where(distances==0, 1e-8, distances)
distances = 1 / (distances**power)
weights = np.sum(weights * values) / np.sum(weights)
weighted_avg return weighted_avg
# Predict at a new location (for example)
= 100.0, 200.0 # some coordinates
x_new, y_new = idw_predict(x_new, y_new, known_points, power=2)
pred_value print(f"IDW Prediction at ({x_new}, {y_new}) = {pred_value}")
In practice, one would use IDW to generate a prediction surface over a grid similar to kriging (by applying the function over all grid cell centers). The power
parameter controls how quickly influence of a point falls off with distance (higher power means nearer points have much more influence and far points contribute little). If $p=2$, it’s the familiar inverse-square law weighting. IDW does not inherently provide uncertainty measures, but it is computationally fast and intuitive. It assumes a smooth, monotonic decay of influence with distance and does not consider direction or trends – limitations that more advanced methods like kriging or trend surface analysis can overcome.
When to use what: Kriging requires more work (variogram modeling) but can be more accurate and provides statistical error estimates if the spatial correlation assumptions hold. IDW requires choosing a power (and optionally a neighborhood search radius) but is a quick first-cut interpolator when an analyst needs a fast result and a smooth surface.
8.6 Best Practices in Spatial Analysis
To ensure robust and credible results in spatial analysis, practitioners should follow a set of best practices:
Clearly define objectives: Begin with well-defined questions or hypotheses about spatial phenomena. A clear objective guides the choice of analysis methods (clustering, regression, interpolation, etc.) and the appropriate scale of analysis. For example, identifying disease clusters vs. modeling environmental risk will require different approaches and data preparations.
Account for spatial autocorrelation: Always test for spatial autocorrelation in your data and model residuals. Ignoring spatial dependence when it is present can lead to biased or misleading results. If exploratory tools (like Moran’s I) indicate significant autocorrelation, consider spatial models or include spatial terms to account for it. This applies both in exploratory mapping (to avoid misinterpreting random clumps as meaningful clusters) and in regression modeling (to avoid violating model assumptions).
Use appropriate spatial weights and scales: The definition of “neighbor” (spatial weights matrix) can greatly affect analyses like Moran’s I or spatial regression. Explore different spatial weight schemes (contiguity vs. distance bands vs. k-nearest neighbors) to test the robustness of your findings. Similarly, be mindful of the scale and projection of your data (e.g., working in an appropriate coordinate system and spatial resolution for the phenomenon of interest).
Validate models rigorously: When building predictive spatial models (e.g., interpolation or regression), use proper validation techniques. In particular, use spatial cross-validation or other techniques that respect spatial structure when partitioning data into training and test sets. Random cross-validation can overestimate performance if spatial autocorrelation exists, because nearby training and test points are not truly independent. Additionally, check model residuals for any remaining spatial patterns (a Moran’s I on residuals should be non-significant if spatial effects were adequately modeled).
Communicate uncertainty and assumptions: Spatial analysis results often inform policy and decisions (e.g., identifying a cluster of disease for intervention). It is crucial to communicate the uncertainty (confidence intervals in kriging, p-values in cluster detection, etc.) and the assumptions made (stationarity in kriging, independence in regression, etc.). This transparency helps stakeholders understand the reliability and limitations of the analysis.
By adhering to these practices, analysts ensure their spatial insights are valid and trustworthy. As noted in literature, failing to account for spatial effects or validation can lead to errors such as overestimating the significance or predictive power of models. Therefore, spatial rigor is as important as statistical rigor in any geospatial study.
8.7 Conclusion
Geospatial data analysis provides a powerful suite of methods for interpreting complex spatial phenomena. By integrating techniques outlined in this chapter — from exploratory analysis (ESDA) that visualizes distributions and detects clusters, to spatial autocorrelation measures that quantify dependence, spatial regression models that improve explanatory power by including spatial effects, network analyses that solve connectivity problems, and geostatistical methods like kriging and IDW for spatial prediction — analysts can deeply explore and accurately model spatial data. Mastery of these tools enables effective communication of spatial insights via maps, statistics, and predictive models. Ultimately, proficiency in geospatial analytical techniques strengthens data-driven decision-making, facilitating informed interventions and strategies in a wide range of spatially oriented disciplines. By following best practices and staying mindful of spatial relationships throughout the analytical process, one ensures that the conclusions drawn truly reflect the underlying geography of the data.