11 Spatial Autocorrelation
11.1 Introduction
Definition
“The first law of geography: Everything is related to everything else, but near things are more related than distant things.” — Waldo R. Tobler (Tobler 1970)
Spatial autocorrelation measures the extent to which the values of a variable are related to their spatial arrangement. It captures the idea that geographically close features tend to exhibit similar attributes, while those further apart are less likely to share common characteristics.
In Geographic Information Systems (GIS), spatial data often includes attributes, which are pieces of information tied to mapped entities but not inherently spatial themselves. These attributes can include demographic statistics, economic indicators, or environmental measurements. Mapping these attributes reveals that their values are typically not randomly distributed. Instead, they exhibit spatial relatedness, reflecting the influence of underlying processes that shape patterns across space.
This concept aligns with Tobler’s First Law of Geography: proximity often implies similarity. For instance, property values in a neighborhood, pollution levels near industrial zones, or vegetation types in an ecosystem often form clusters or follow discernible trends. These patterns rarely emerge by chance; instead, they reflect physical, social, or economic dynamics at work.
It is exceedingly rare to find spatial data devoid of any discernible structure. From urban expansion and migration pathways to climatic variations, spatial phenomena almost always exhibit a degree of organization. Understanding these patterns through spatial autocorrelation allows researchers to uncover meaningful relationships, interpret processes shaping the data, and make informed predictions about future trends. This chapter delves into the methods and tools for measuring and interpreting spatial autocorrelation, providing a foundation for more advanced spatial analyses.
See https://mgimond.github.io/Spatial/index.html
To model spatial patterns, the chosen approach depends significantly on how we conceptualize and define the underlying spatial process. Broadly, spatial processes can be categorized into two main approaches: spatial trend models and spatial clustering and dispersion models. Each offers unique insights into the relationships within spatial data, focusing on either systematic variations or localized patterns.
Spatial Trend Models
Spatial trend models aim to identify and analyze gradual changes or gradients across space. They operate under the assumption that the attribute of interest varies systematically throughout the study area, often driven by external factors such as environmental conditions, socioeconomic influences, or global processes. For instance, temperature changes across a continent may display a spatial trend influenced by latitude, altitude, or distance from large water bodies. These models are particularly useful for capturing and explaining global variations in data, providing a broad understanding of spatial phenomena.
Spatial Clustering and Dispersion Models
In contrast, spatial clustering and dispersion models investigate the degree to which features are grouped together (clustered) or evenly distributed (dispersed) across a landscape. These models focus on localized patterns of spatial association rather than global trends. For example, hotspots of disease outbreaks, clusters of urban development, or the distribution of wildlife habitats can all illustrate spatial clustering, where certain areas exhibit higher concentrations of specific features. Conversely, agricultural fields planted in a uniform pattern often exemplify spatial dispersion.
While spatial trend models provide a macro-level view of systematic variations, clustering and dispersion models reveal micro-level patterns, emphasizing localized relationships that may uncover critical insights into the underlying spatial processes. These complementary approaches form the foundation for understanding and analyzing spatial data in both theoretical and applied contexts.
11.2 Global Moran’s I
Moran’s I, often referred to as Global Moran’s I, is a widely used statistic for measuring spatial autocorrelation across an entire dataset. While the same principle can be applied to sub-datasets or localized areas, for now, let’s focus on the global application.
Visual interpretation can sometimes help identify clusters or dispersed regions on a map, but this method has its limitations. Patterns are not always immediately obvious, and relying on visual analysis alone can be subjective. To address this, we need a quantitative and objective method for determining the degree of clustering or dispersion within spatial data and identifying where such patterns occur.
One of the most commonly used tools for this purpose is the Moran’s I coefficient, which provides a single value summarizing the spatial autocorrelation of a dataset. Positive values indicate clustering of similar features, negative values suggest dispersion, and values near zero imply randomness.
A Working Example:
To understand how Moran’s I works in practice, let’s analyze a specific case: the 2020 median per capita income for the state of Maine. By mapping and calculating Moran’s I for this dataset, we can explore whether areas with similar income levels are clustered together, dispersed, or distributed randomly across the state. This example will guide us through the steps of applying Moran’s I to real-world data, illustrating its practical significance in spatial analysis.
Figure: Map of 2020 median per capita income for Maine counties (USA). https://mgimond.github.io/Spatial/index.html
When analyzing spatial patterns, it may seem obvious that income distribution, when aggregated at the county level, exhibits clustering—with counties of high income surrounded by others with high income, and similarly for counties with low income. However, a qualitative description is often insufficient, particularly when we want to rigorously analyze spatial relationships or compare patterns across different datasets. In such cases, we need to quantify the degree of clustering.
This is where the Moran’s I statistic comes in. It acts as a correlation coefficient for the relationship between the values of a variable (e.g., income) and the values of that variable in neighboring locations. Positive Moran’s I values indicate clustering of similar values, negative values suggest dispersion, and values near zero imply a random spatial distribution.
Defining a Neighbor
Before we can compute Moran’s I, we must first define what constitutes a neighbor, as this relationship forms the basis of the analysis. There are several common approaches:
-
Contiguity-based Neighbors:
A neighbor is defined as any geographically adjacent polygon (i.e., polygons that share a boundary). For example:- Aroostook County in northern Maine has four contiguous neighbors.
- York County in southern Maine has only two contiguous neighbors.
This approach is intuitive and widely used, especially for administrative boundary data like counties or districts.
Distance-based Neighbors:
A neighbor is defined as any feature within a specified distance, such as all counties within 100 kilometers. This method is particularly useful when analyzing datasets where spatial influence might extend beyond direct contiguity.k-Nearest Neighbors:
A neighbor is defined as the closest k features, such as the 2 nearest counties. This approach ensures each feature has a fixed number of neighbors, which can be helpful for datasets with varying feature sizes or irregular spatial distributions.
For both distance-based and k-nearest neighbor approaches, distances are usually measured between the centroids (geometric centers) of the polygons rather than their boundaries. This simplifies calculations and ensures consistency in defining spatial relationships.
By clearly defining neighbors, we establish the framework for computing Moran’s I and systematically quantifying spatial relationships, enabling meaningful analysis of spatial clustering or dispersion.
Figure: Maps show the links between each polygon and their respective neighbor(s) based on the neighborhood definition. A contiguous neighbor is defined as one that shares a boundary or a vertex with the polygon of interest. Orange numbers indicate the number of neighbors for each polygon. Note that the top most county has no neighbors when a neighborhood definition of a 100 km distance band is used (i.e. no centroids are within a 100 km search radius). https://mgimond.github.io/Spatial/index.html
Once a neighborhood definition is established for our analysis, the next step is to identify the neighbors for each polygon in the dataset and summarize the values within each neighborhood. This typically involves calculating a summary statistic, such as the mean of the neighboring values.
This summarized value is commonly referred to as a spatially lagged value (often denoted as (X_{})). It represents the average or weighted average of the values in a feature’s defined neighborhood, providing a measure of the spatial context around each feature.
Working Example: Income in Maine
In our example of 2020 median per capita income for counties in Maine, we use a contiguity-based neighborhood definition. This means each county’s neighbors are the counties that share a boundary with it. For every county, we compute the average neighboring income value (\(\text{Incomelag}\)):
- For each county, the neighboring income values are identified.
- The mean of these values is calculated to determine \(\text{Incomelag}\).
Next, we create a scatterplot of \(\text{Incomelag}\) (the spatially lagged income value) versus \(\text{Income}\) (the county’s original income value). This visualization helps us understand the spatial relationship:
- Points clustering along a positive slope indicate spatial autocorrelation, where counties with higher incomes are surrounded by neighbors with similarly higher incomes, and vice versa.
- Points scattered without a clear pattern suggest a lack of spatial autocorrelation.
By examining the relationship between \(\text{Income}\) and \(\text{Incomelag}\), we can gain insights into the degree of spatial clustering or dispersion in the dataset, forming the foundation for computing the Moran’s I statistic.
11.3 Some statistical background
In traditional statistics, the correlation equation measures the strength and direction of the linear relationship between two variables. For two variables, \(X\) and \(Y\), the Pearson correlation coefficient (\(r\)) is given by:
\[ r = \frac{\sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^N (x_i - \bar{x})^2 \sum_{i=1}^N (y_i - \bar{y})^2}} \]
Terms in the Equation:
- \(x_i\): The individual data point for variable \(X\).
- \(y_i\): The individual data point for variable \(Y\).
- \(\bar{x}\): The mean of variable \(X\).
- \(\bar{y}\): The mean of variable \(Y\).
- \(N\): The number of data points.
The equation can also be expressed using the covariance of \(X\) and \(Y\) and their standard deviations:
\[ r = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \]
Where:
- \(\text{Cov}(X, Y) = \frac{\sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})}{N}\): Covariance between \(X\) and \(Y\).
- \(\sigma_X = \sqrt{\frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N}}\): Standard deviation of \(X\).
- \(\sigma_Y = \sqrt{\frac{\sum_{i=1}^N (y_i - \bar{y})^2}{N}}\): Standard deviation of \(Y\).
Key Properties of \(r\)
-
Range: \(r\) ranges from \(-1\) to \(1\).
- \(r = 1\): Perfect positive linear relationship.
- \(r = -1\): Perfect negative linear relationship.
- \(r = 0\): No linear relationship.
Dimensionless: \(r\) is a unitless measure, allowing comparison across different datasets.
Symmetry: Correlation is symmetric, meaning \(r(X, Y) = r(Y, X)\).
Interpretation
- \(0 < r \leq 1\): Positive correlation (as \(X\) increases, \(Y\) tends to increase).
- \(-1 \leq r < 0\): Negative correlation (as \(X\) increases, \(Y\) tends to decrease).
- \(r = 0\): No linear relationship, though there might still be a non-linear relationship.
This equation is fundamental in statistics for examining relationships between variables in fields such as economics, social sciences, and natural sciences.
11.4 Definition
Moran’s I is a statistical measure used to assess the degree of spatial autocorrelation in a dataset.
Spatial autocorrelation refers to the phenomenon where observations located near each other in space tend to be more similar (or dissimilar) than those further apart.
- Moran’s I is a global measure, providing a single summary value that characterizes the overall spatial structure of the data.
\[ I = \frac{N}{W} \frac{\sum_{i}\sum_{j} w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_{i}(x_i - \bar{x})^2} \]
Where:
- \(N\): The number of spatial units (observations) in the dataset.
- \(x_i\): The value of the variable at location \(i\).
- \(\bar{x}\): The mean of the variable across all locations.
- \(w_{ij}\): The spatial weight between locations \(i\) and \(j\), which defines the spatial relationship (e.g., proximity or adjacency).
- \(W = \sum_{i}\sum_{j} w_{ij}\): The sum of all spatial weights.
The value of Moran’s I typically ranges between -1 and 1:
- \(I > 0\): Positive spatial autocorrelation, indicating clustering of similar values.
- \(I < 0\): Negative spatial autocorrelation, indicating dispersion or a checkerboard pattern.
- \(I \approx 0\): No spatial autocorrelation, suggesting randomness in the spatial distribution.
Definition of a Neighbor
A neighbor is typically defined as a spatial relationship between two geographic entities or locations. Common definitions include:
-
Adjacency (Contiguity):
- First-order contiguity: Locations that share a boundary (e.g., adjacent polygons or grid cells).
- Second-order contiguity: Locations that share a boundary with a first-order neighbor.
-
Distance-based:
- Neighbors are defined as all locations within a certain distance of each other.
- This approach often uses a radius or a fixed threshold to determine proximity.
-
K-nearest neighbors (KNN):
- Each location has exactly \(k\) neighbors, defined as the \(k\) closest locations based on distance.
-
Custom criteria:
- Neighbors may be defined based on socioeconomic, environmental, or other relationships rather than strict geographic proximity.
Spatial Weight
The spatial weight quantifies the strength of the relationship between neighbors. It is typically derived from the definition of neighbors but adds a numerical value to their relationship. The spatial weight matrix (\(W\)) represents these relationships, where:
- \(w_{ij}\): The weight assigned to the relationship between location \(i\) and location \(j\).
- Diagonal elements (\(w_{ii}\)): Usually set to 0 since locations are not considered their own neighbors.
Types of Spatial Weights
Spatial weights are constructed based on how neighbors are defined, but they add nuance to the relationship:
-
Binary weights:
- \(w_{ij} = 1\) if \(j\) is a neighbor of \(i\), otherwise \(w_{ij} = 0\).
- Simple representation of whether two locations are neighbors.
-
Inverse distance weights:
- \(w_{ij} = 1/d_{ij}\), where \(d_{ij}\) is the distance between \(i\) and \(j\).
- Closer neighbors are assigned higher weights.
-
Row-standardized weights:
- Normalize weights so that the sum of weights for each row is 1, ensuring comparability across observations.
-
Custom weights:
- Weight values can reflect relationships like trade flows, connectivity, or shared attributes.
Relationship Between Neighbors and Spatial Weights
- The definition of neighbors sets the structure of which entities are considered related (e.g., adjacency, distance).
- Spatial weights further define the intensity or strength of those relationships.
For example:
- In a binary contiguity matrix, adjacency defines neighbors, and spatial weights assign a value of 1 for neighboring pairs and 0 for non-neighboring pairs.
- In an inverse distance matrix, distance-based neighbors are first defined (e.g., locations within 50 km), and then weights are assigned inversely proportional to their distances.
The spatial weight depends on the definition of neighbors but provides additional information by quantifying the strength of these relationships. While the neighbor definition is a binary concept (neighbor or not), spatial weights add a continuous dimension to represent varying levels of influence or proximity.
Why Is Moran’s I Useful?
Detecting Spatial Patterns: Moran’s I helps identify whether there is a statistically significant spatial structure in the data, which is critical for understanding processes such as clustering of similar values or the dispersion of dissimilar values.
Improving Model Accuracy: If spatial autocorrelation exists, incorporating it into statistical models (e.g., through spatial regression) improves the accuracy of predictions and avoids misleading results.
Policy and Decision-Making: In fields like urban planning, epidemiology, or environmental science, Moran’s I helps policymakers identify hotspots, clusters, or areas that require targeted interventions.
Validation in Spatial Models: Moran’s I is often used to check for residual spatial autocorrelation after model fitting, ensuring that all spatial structures are adequately accounted for.
Application Domains
- Public Health: Identifying disease clusters and assessing spatial dependencies in health outcomes.
- Environmental Studies: Analyzing spatial patterns in pollution, climate variables, or resource distribution.
- Urban Planning: Detecting land-use patterns or economic clustering in cities.
- Economics: Studying regional economic disparities and spatial dependencies in economic indicators.
Moran’s I provides a foundation for understanding spatial relationships in data, making it a valuable tool across disciplines that deal with spatially distributed phenomena.
11.5 Local Moran’s I
- Purpose: Measures spatial autocorrelation at the local level, identifying specific locations where significant clustering or dispersion occurs. This allows for the detection of hotspots (clusters of high values), cold spots (clusters of low values), and spatial outliers (locations that differ significantly from their neighbors).
- Scope: Local Moran’s I provides a value for each observation, indicating the degree to which that observation contributes to the overall spatial autocorrelation. This makes it useful for pinpointing areas of interest within the dataset.
- Example: Using the same property price dataset, Local Moran’s I might identify specific neighborhoods where high-priced properties are clustered (hotspots) or areas where low-priced properties are clustered (cold spots).
\[ I_i = \frac{(x_i - \bar{x})}{\sum_{j}(x_j - \bar{x})^2} \sum_{j} w_{ij}(x_j - \bar{x}) \]
-
Interpretation:
- \(I_i > 0\): Positive local autocorrelation (a cluster of similar values).
- \(I_i < 0\): Negative local autocorrelation (an outlier or opposite values compared to neighbors).
- Tests for statistical significance (e.g., p-values) are usually applied to identify meaningful patterns.
Key Differences
Aspect | Global Moran’s I | Local Moran’s I |
---|---|---|
Scale | Whole dataset | Individual observations |
Focus | Overall spatial pattern | Specific locations’ contributions |
Output | Single statistic | One value for each observation |
Goal | Assess general spatial autocorrelation | Detect clusters, outliers, hotspots |
Example Question | “Is there spatial clustering overall?” | “Where are the clusters or outliers?” |
When to Use Which?
- Use Global Moran’s I when your goal is to determine if spatial autocorrelation exists in the dataset as a whole.
- Use Local Moran’s I when you need to pinpoint the locations of clusters or outliers within the data.
By combining both, you can first confirm the presence of spatial patterns globally and then delve into local patterns for more detailed insights.
11.6 Homework
Step 1: Prepare Your Data
-
Use an Existing Shapefile:
- If you have a shapefile or another vector dataset, load it into QGIS by dragging it into the workspace or using
Layer
>Add Layer
>Add Vector Layer
. - Ensure your shapefile contains the variable (attribute column) you want to analyze for spatial autocorrelation.
- If you have a shapefile or another vector dataset, load it into QGIS by dragging it into the workspace or using
-
Ensure the Data is Ready:
- Check that the layer is spatially enabled (it should have geometry, such as points, polygons, or lines).
- Verify that the dataset includes a field with the values you want to assess for spatial autocorrelation.
-
Define the Coordinate System:
- Ensure your shapefile has a coordinate reference system (CRS) that supports spatial analysis, such as UTM or a projected CRS. If necessary, reproject your layer:
- Right-click on the layer >
Export
>Save Features As
and select an appropriate CRS.
- Right-click on the layer >
- Ensure your shapefile has a coordinate reference system (CRS) that supports spatial analysis, such as UTM or a projected CRS. If necessary, reproject your layer:
You can proceed directly with this shapefile for Moran’s I analysis without needing to create new data from scratch.
Step 2: Install the Hotspot Analysis Plugin
-
Open the Plugin Manager:
- Go to
Plugins
>Manage and Install Plugins
.
- Go to
-
Search for the Plugin:
- In the search bar, type
Hotspot Analysis
. - Click on the plugin to select it.
- In the search bar, type
-
Install the Plugin:
- Click on the
Install Plugin
button. - Wait for the installation to complete.
- Click on the
-
Close the Plugin Manager:
- Once the plugin is installed, close the Plugin Manager window.
Step 3: Run the Hotspot Analysis
-
Open the Hotspot Analysis Plugin:
- Go to
Plugins
>Hotspot Analysis
>Hotspot Analysis
.
- Go to
-
Select Your Input Layer:
- In the Hotspot Analysis window, select the layer you want to analyze from the
Input Layer
dropdown menu.
- In the Hotspot Analysis window, select the layer you want to analyze from the
-
Choose the Field to Analyze:
- Select the field containing the values you want to analyze for spatial autocorrelation from the
Field
dropdown menu.
- Select the field containing the values you want to analyze for spatial autocorrelation from the
-
Set the Analysis Parameters:
- Define the
Distance Band
orNumber of Neighbors
for the analysis. - Choose the
Method
for calculating spatial weights (e.g.,K Nearest Neighbors
,Distance Band
, etc.).
- Define the
-
Run the Analysis:
- Click on the
Run
button to start the Hotspot Analysis.
- Click on the
-
Review the Results:
- Once the analysis is complete, review the results to identify spatial autocorrelation patterns in your data.
-
Save or Export the Results:
- If desired, save or export the results of the Hotspot Analysis for further analysis or visualization.
Step 4: Interpret the Results
-
Identify Hotspots and Coldspots:
- Look for clusters of high values (hotspots) and low values (coldspots) in the analysis
-
Analyze the Significance:
- Check the significance of the spatial autocorrelation patterns to determine if they are statistically meaningful.
-
Visualize the Results:
- Use maps, charts, or other visualizations to communicate the spatial autocorrelation patterns in your data.
-
Draw Conclusions:
- Based on the results of the Hotspot Analysis, draw conclusions about the spatial distribution of the variable you analyzed.
11.7 What we have learned
- Spatial autocorrelation is the degree to which the values of a variable are correlated in space.
- Moran’s I is a common measure of spatial autocorrelation that quantifies the degree of clustering or dispersion in a dataset.
- Global Moran’s I assesses spatial autocorrelation across the entire dataset, while Local Moran’s I identifies clusters or outliers at the local level.
- Moran’s I at different lags examines spatial autocorrelation at varying distances to identify patterns of clustering or dispersion.
- Advanced modeling uses regression analysis to estimate the Moran’s I coefficient and assess its significance through Monte Carlo simulations.
- Hotspot Analysis in QGIS enables the visualization and analysis of spatial autocorrelation patterns in geospatial data.