2 Chapter 2: Making Sense of Spatial Data Using GIS & Geocoding and Georeferencing
2.1 Introduction
“Everything is related to everything else, but near things are more related than distant things.” – Waldo Tobler’s First Law of Geography.
This simple yet profound law forms the foundation of geospatial analysis. When we model and analyze the world, location is not just a variable; it’s a critical dimension that influences how things interact. Spatial proximity often drives relationships that might be missed in non-spatial models. For instance, predicting housing prices based purely on characteristics like size or age can miss the significant impact of location. A house next to a park or near public transportation will often command a higher price, even if it’s identical to one farther away.
In this way, geospatial data allows us to enhance traditional models by incorporating the spatial dimension—a layer of truth that non-spatial models overlook.
Geospatial data science sits at the crossroads of geography and data science, providing a framework for understanding the world through the lens of location. In today’s data-driven society, location matters more than ever. From smartphones equipped with GPS to vast networks of sensors, the explosion of geospatial data has revolutionized how we analyze global phenomena—whether it’s tracking climate change, analyzing economic trends, or mapping global migration patterns.
This chapter will guide you through the foundational concepts and tools used in geospatial data science, focusing on how spatial data is represented, managed, and analyzed, primarily through the use of QGIS and R. It provides a comprehensive overview of essential tools available for geospatial data analysis, with a primary focus on QGIS and R. As geospatial data becomes increasingly prevalent across various fields—ranging from aerial photography and satellite imagery to census data and geolocated social media posts—understanding how to process and analyze this data is crucial.
We will explore the two primary types of geospatial data:
- Raster data: Represents variables at each point in a grid covering the data’s full extent (e.g., satellite images).
- Vector data: Associates variables with discrete geometric objects in space (e.g., city locations on a map).
While QGIS offers a user-friendly graphical interface that simplifies many tasks, script-based processing in R remains essential for certain operations due to its reproducibility, transparency, and integration with other datasets. Throughout this chapter, we will demonstrate how to effectively use QGIS for various geospatial operations, with R code examples included for those interested in script-based analysis.
2.2 The Prevalence of Geospatial Data
Today, geospatial data is ubiquitous. From satellites capturing high-resolution images of the Earth to the growing availability of government data, organizations and researchers now have access to vast amounts of spatial data.
The power of spatial data lies in its ability to show how things relate to one another geographically. For instance, analyzing the spatial distribution of economic activity alongside transportation networks can reveal how access to infrastructure impacts growth.
Concepts
This section offers an overview of the tools available in R for analyzing geolocated data. This type of data is becoming increasingly common in various domains, such as aerial photography, satellite imagery, census data, and geolocated social media posts.
Geospatial data generally falls into two categories:
- Raster data: Represents variables defined at each point of a grid covering the full extent of the data (e.g., satellite images).
- Vector data: Associates variables (sometimes called attributes) with discrete geometric objects located in space (e.g., city locations on a map).
Objectives of Script-Based Analysis
While the graphical interface of GIS software like QGIS is intuitive, script-based processing in R offers several advantages:
- Reproducibility: It is easy to repeat the analysis for new data by rerunning the script.
- Transparency: Other researchers can reproduce the methods if they have access to the same programming language.
- Integration: Spatial data can be extracted and merged with other datasets for statistical analysis within a single programming environment.
Goals
- Familiarize yourself with R packages for processing and visualizing vector and raster data (e.g., sf and stars).
- Perform common data transformation operations using these packages.
- Create more complex static maps (using ggplot2) and interactive maps (using mapview).
Note on Packages
The set of packages available for spatial analysis in R has rapidly evolved. A few years ago, the sp and raster packages were the main tools for processing vector and raster data, respectively. sf and stars are part of a recent initiative to overhaul R’s spatial tools (https://www.r-spatial.org/).
The sf package represents spatial data frames with a standard format based on open-source geodatabases and integrates well with popular R packages for data manipulation and visualization (such as dplyr and ggplot2).
The stars package is also compatible with ggplot2 and provides good support for raster “cubes” with non-spatial dimensions such as time.
The raster package and its successor terra (released in 2020) possess some features not present in stars and perform some operations more quickly. Therefore, it may be useful to learn them if your workflow involves complex operations on large rasters; see the documentation on https://r-spatial.org/ for more details.
2.3 Working with Geospatial Data in QGIS and R
Geospatial data science combines the principles of geography with data science to analyze phenomena through the dimension of location. In this chapter, we’ll explore the foundational concepts and tools used in geospatial data science, with a focus on how spatial data is represented, analyzed, and visualized using QGIS and R.
Why Location Matters
Waldo Tobler’s First Law of Geography states: “Everything is related to everything else, but near things are more related than distant things.” Location is a fundamental aspect of how the world functions, and by incorporating the spatial dimension, we can better understand relationships that may be overlooked in non-spatial models. This concept is central to geospatial analysis, which adds significant value to predictions, such as the relationship between housing prices and their proximity to infrastructure or amenities.
Types of Geospatial Data
Geospatial data comes in many forms, each suited for different types of analysis. Here are the most common data formats and types:
Data Formats:
Data Type | Non-Spatial Formats | Spatial Formats |
---|---|---|
Text | CSV, JSON, XML | GeoJSON, GML, KML |
Binary | PDF, XLS, ZIP | Shapefile, GeoPackage |
Images | TIFF, JPG, PNG | GeoTIFF, JPEG2000 |
Databases | SQLite, PostgreSQL | Spatialite, PostGIS |
- Shapefiles: Widely used for storing geometries and attributes.
- GeoJSON: Lightweight and web-friendly format for spatial data.
- GeoTIFF: Common for raster data, such as satellite images.
Data Types:
- Vector Data: Points, lines, and polygons representing features like roads, cities, or administrative boundaries.
- Raster Data: Gridded data that represents variables such as temperature or elevation, widely used in remote sensing.
- Tiles: Subdivided raster or vector datasets, commonly used in web mapping.
Understanding Spatial Data: Geometry and Attributes
Spatial data is composed of two critical elements:
- Geometry: This describes the physical shape and location of spatial features (points, lines, polygons).
- Attributes: Additional data describing the features, such as population, area, or economic activity.
Example: GeoJSON Format
A GeoJSON object represents spatial data by encoding geometry (coordinates) and attributes. Here is a simple GeoJSON for the city of Montreal:
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [-73.5673, 45.5017]
},
"properties": {
"id": 1,
"name": "Montreal"
}
}
This example includes both the geometry (longitude and latitude) and attributes (city name and ID).
Coordinate Reference Systems (CRS) and Map Projections
CRS are essential for representing the Earth’s 3D surface on 2D maps. Choosing the right CRS ensures accuracy in spatial analysis.
Map Projections:
- Equal Earth projection: Ideal for global maps, preserving area proportions.
- UTM (Universal Transverse Mercator): Commonly used for regional mapping due to minimized distortions in specific zones.
Example: Choosing a CRS
For global maps, projections like Equal Earth offer visual appeal and minimal distortion. For local studies, country-specific CRSs, such as EPSG:7755 for India, are used to maintain accuracy over small areas.
Tools for Geospatial Data Science: QGIS and R
QGIS: A Practical Introduction
QGIS is an open-source GIS platform that offers a wide range of tools for visualizing, manipulating, and analyzing geospatial data. Its ability to handle multiple data formats, overlay layers, and visualize complex relationships makes it ideal for both beginners and advanced users.
Key QGIS functions include: - Data loading: Use the Data Source Manager to load shapefiles, raster data, and other formats. - Exploring attributes: Tools like the Identify tool allow you to view and explore attribute data. - Applying symbology: Customize how data is visualized using the Layer Styling Panel to create meaningful maps.
R for Geospatial Analysis
R complements QGIS by offering advanced statistical capabilities. With packages like sf and ggplot2, R provides powerful tools for spatial analysis and data visualization. For example, the sf
package allows seamless handling of spatial data, while ggplot2
enables sophisticated data visualization.
Example: Loading and Exploring Data in R
You can perform operations similar to QGIS using the sf
package in R:
library(sf)
<- st_read("data/mrc.shp")
mrc st_bbox(mrc)
st_crs(mrc)
head(mrc)
Working with Geospatial Data: Visualization and Manipulation
Visualizing Geospatial Data in QGIS
In QGIS, you can visualize data by: 1. Loading data: Use the Data Source Manager to import data. 2. Apply Symbology: Open the Layer Styling Panel to customize the visualization based on attributes.
In R, you can use ggplot2
to create maps:
ggplot(data = mrc_proj) +
geom_sf()
Manipulating Geospatial Data in QGIS
QGIS offers various tools for data manipulation: - Field Calculator: Perform calculations or create new fields. - Processing Toolbox: Tools like buffering, clipping, and spatial joins.
In R, similar data manipulation can be performed using dplyr
and sf
:
<- mrc %>%
regions group_by(reg_name) %>%
summarize(pop2016 = sum(pop2016))
The Power of Overlaying: Combining Layers for Deeper Insights
Overlaying is a technique that allows multiple data layers to be analyzed together, revealing relationships between different types of data. For example: - Overlaying economic data with geographic features: Understanding how natural barriers (e.g., rivers, mountains) impact economic clusters or transportation networks. - Combining social and environmental data: Correlating population density with climate data to study urban heat islands.
Overlaying reveals hidden correlations and offers a richer understanding of the spatial dimension of data.
Creating Custom Maps in QGIS
QGIS is designed for both simple and complex map creation. Here’s how to create a map: 1. Load Data Layers: Load the relevant vector and raster layers. 2. Print Layout: Use the Print Layout tool to arrange elements like maps, titles, and legends. 3. Export Map: Export your final map as an image, PDF, or SVG for presentation or publication.
In R, map creation is simplified with the ggplot2
package:
ggplot(data = mrc_proj) +
geom_sf()
Managing and Transforming Coordinate Reference Systems (CRS)
CRS management ensures spatial alignment and accurate representation of data.
Transforming Coordinate Systems in QGIS
You can reproject layers to a new CRS using the Reproject Layer tool: 1. Select Layer: Choose the layer. 2. Reproject: Use the Reproject Layer tool to transform it to a new CRS. 3. Verify: Ensure the transformation was successful by checking the layer properties.
In R, reprojection can be done with sf
:
<- st_transform(mrc, crs = 6622)
mrc_proj st_crs(mrc_proj)
Conclusion: Geospatial Data and Model Building
Geospatial data science transforms traditional models by adding the dimension of location, making them more representative of real-world phenomena. By integrating geospatial data into models, we can test and falsify theories more rigorously, moving closer to the truth. Models that ignore location overlook key variables, whereas those that incorporate spatial data provide a more holistic understanding of global patterns.
Working with Time Series Data in R
For those interested in working with time series data, here’s how you can reshape and visualize it in R:
<- c("SID74" = "1974 - 1978", "SID79" = "1979 - 1984")
year_labels %>%
nc_32119 select(SID74, SID79) %>%
pivot_longer(starts_with("SID")) -> nc_longer
ggplot() +
geom_sf(data = nc_longer, aes(fill = value), linewidth = 0.4) +
facet_wrap(~ name, ncol = 1, labeller = labeller(name = year_labels)) +
scale_y_continuous(breaks = 34:36) +
scale_fill_gradientn(colors = sf.colors(20)) +
theme(panel.grid.major = element_line(color = "white"))
Exploring Raster and Vector Data in QGIS and R
Raster Data
Raster data is defined at each point of a grid covering the full extent of the data. It is often used for representing images, topographic maps, remote sensing data, and census data.
Vector Data
Vector data associates variables with discrete geometric objects located in space. It is commonly used to represent points, lines, polygons, and networks.
Combining Raster and Vector Data in QGIS and R
In both QGIS and R, raster and vector data can be combined in various ways. For example, you can query the raster at specific points or calculate aggregates over arbitrary regions.
In R, this can be done using the stars
package:
library(stars)
st_extract(x, pts) # query at points
aggregate(x, st_buffer(pts, 500), FUN = mean) %>% st_as_sf() # aggregate over circles
2.4 Conclusion
This chapter introduced key tools in QGIS and R for geospatial data analysis, focusing on both vector and raster data. We explored basic data manipulation and visualization techniques, emphasizing reproducibility, transparency, and integration. The importance of understanding coordinate reference systems and projections was highlighted, alongside methods for transforming data between systems. The next chapter will delve into advanced spatial operations, spatial analysis, and the creation of interactive maps.
Data Repository
1. GeoPlatform
URL: geoplatform.gov
Provides a massive repository of geospatial data across various governmental levels (federal, state, county, local, tribal), with an emphasis on climate change, environmental, and policy datasets. Its broad scope and government backing make it one of the most valuable resources for spatial data.
2. EROS Center (U.S. Geological Survey)
URL: eros.usgs.gov
One of the most respected global repositories of aerial photography, satellite imagery, elevation data, and land cover datasets, the USGS EROS Center is a crucial institution for global geospatial research.
3. National Map Small-Scale Data Downloads
URL: nationalmap.gov/small_scale
Originally developed for the National Atlas project, this collection of datasets covers a broad range of topics, including transportation, land cover, boundaries, and water resources. These datasets provide excellent foundational information for U.S.-based projects.
4. TIGER Shapefile Downloads (U.S. Census Bureau)
URL: census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html
TIGER provides essential data for U.S. infrastructure and boundary-based projects, with shapefiles covering districts, roads, rivers, and more. The broad scope of national and state-level data makes it a must-have for projects based in the U.S.
5. Geospatial Data Gateway (U.S. Department of Agriculture)
URL: datagateway.nrcs.usda.gov
A useful source for county-based raster map files, especially orthoimagery. The U.S. Department of Agriculture’s gateway provides easily accessible geospatial datasets that are valuable for environmental and agricultural research.
6. DIVA-GIS
URL: diva-gis.org
Offering free geographic data for any country in the world, DIVA-GIS is widely used for environmental and biological projects. It provides a broad array of administrative and ecological data.
7. Global Administrative Areas (GADM)
URL: gadm.org
A significant resource for administrative boundary data worldwide, GADM supports multiple formats and coordinate systems, offering flexibility for diverse GIS applications.
8. Natural Earth
URL: naturalearthdata.com
Natural Earth provides global-scale vector and raster datasets for cartography and geospatial analysis. It is commonly used for creating visually appealing maps due to its focus on data simplicity and high quality.
9. MapCruzin
URL: mapcruzin.com
Known for providing free GIS maps and geospatial data, MapCruzin is a good resource for users interested in digital cartography, with a focus on U.S.-based data sets.
10. National Historical GIS
URL: nhgis.org
An essential resource for historical geospatial studies in the United States, providing census data and boundary files dating back to 1790.
11. GeoCommunicator (Bureau of Land Management)
URL: www.geocommunicator.gov
This platform focuses on the Public Land Survey System (PLSS) data and supports the mapping of federal land parcels. It’s particularly useful for land management and cadastral surveys in the U.S.
12. GeoCommunity – GIS Data Depot
URL: data.geocomm.com
GeoCommunity offers various types of raster data, including U.S. Geological Survey DRGs, DEMs, DOQQs, and FEMA Flood Data, making it a useful, though more specialized, resource.
13. GEOFABRIK
URL: download.geofabrik.de
GEOFABRIK provides OpenStreetMap shapefiles for streets and other geographic features worldwide. It is a highly specialized tool, particularly valuable for urban studies and infrastructure projects.