3 Advanced Techniques for Geospatial Data Manipulation
Geospatial data analysis has become a cornerstone of decision-making in diverse domains. From urban planning and environmental management to public health, agriculture, and disaster response, understanding where things happen is as important as what happens. In fact, it is often cited that roughly 80% of the world’s data includes some kind of spatial aspect, underscoring the ubiquitous importance of location in modern data. The proliferation of GPS-equipped devices, remote sensing satellites, and location-based services means that spatial data are being generated at unprecedented scales. The geospatial analytics industry is rapidly growing – by one estimate it will nearly double in size between 2021 and 2026 – and its applications are expanding across both commercial and public sectors.
In tandem with this data boom, tools for spatial analysis have advanced significantly. The R programming language, long favored for data analysis, now boasts a rich ecosystem for geospatial work. In particular, the advent of packages like sf
(simple features) and stars
has revolutionized how spatial data are handled in R. The sf
package implements the Open Geospatial Consortium (OGC) Simple Features standard for vector data, providing a modern, user-friendly alternative to earlier R spatial libraries (e.g. sp for vectors and raster for rasters). By treating spatial objects (points, lines, polygons) as first-class data frame elements, sf
makes manipulating geospatial data as intuitive as manipulating tables. Likewise, the stars
package extends R’s capabilities to raster and spatiotemporal data “cubes,” enabling the handling of multi-dimensional environmental data (e.g. satellite images, climate model outputs) within the R ecosystem.
This chapter provides a comprehensive introduction to advanced techniques for geospatial data manipulation using R. We will cover the essential workflows for reading, exploring, transforming, and visualizing spatial data. Emphasis is placed on using the sf
package for vector data and stars
for raster data, while leveraging the tidyverse tools for efficient data handling. Through examples, we will illustrate how to create both static and interactive maps to effectively communicate spatial insights. By the end of the chapter, you should be equipped with the fundamental skills to tackle complex spatial analysis tasks – from cleaning and subsetting spatial datasets, to performing geographic transformations and producing publication-quality maps.
3.1 Objectives
The primary objectives of this chapter are:
- Introduce essential geospatial packages in R: We will become familiar with R’s core spatial libraries, particularly
sf
for vector data andstars
for raster and spatiotemporal data. Understanding these packages’ capabilities and data structures lays the groundwork for advanced spatial analysis. - Demonstrate basic spatial data handling: Using real-world examples, we will walk through loading spatial datasets, examining their properties, performing data cleaning and transformations (both attribute and geometric operations), and combining spatial data with traditional data frames.
- Explore visualization techniques: We will create maps as a key analytical and communication tool. This includes basic plotting of spatial objects, thematic mapping with ggplot2, and building interactive maps with leaflet. By learning to visualize spatial data effectively, you can better interpret patterns and convey findings to stakeholders.
These objectives align with building a robust skill set in geospatial data manipulation. Mastery of these techniques will enable you to confidently integrate spatial thinking into data analyses and support informed decision-making across applications.
3.2 Working with Vector Data
Vector data represents geographic features as discrete geometries (points, lines, and polygons) with associated attributes. Common examples include city locations as points, roads or rivers as lines, and administrative boundaries as polygons. In R, the sf
package is the de facto standard for working with vector data, providing a powerful and consistent framework. In this section, we cover the fundamental tasks of reading vector data, examining its structure, basic visualization, and subsetting/filtering spatial features.
Reading Vector Data
Spatial vector datasets are often stored in specialized file formats or databases. One of the most prevalent formats is the ESRI Shapefile, which actually consists of multiple files (.shp, .dbf, .shx, etc.) representing geometry and attributes. Other popular formats include GeoJSON (for web-compatible data exchange), KML (Google Earth format), GeoPackage (GPKG), and spatial databases like PostGIS. The good news is that sf
can read and write a wide array of these formats thanks to its use of the GDAL library. This means that a single function, st_read()
, can ingest virtually any spatial data source – whether it’s a shapefile on your disk, a GeoJSON from a web URL, or a table in a spatial database – and load it as an R spatial object.
R Example – Reading a Shapefile: Suppose we have a shapefile of municipal regions named mrc.shp stored in a local “data” directory. We can read it into R as follows:
library(sf)
<- st_read("data/mrc.shp") mrc
This command will import the vector data into an sf
object called mrc
. Under the hood, st_read
uses GDAL to detect the file format and parse the data. If the dataset contains multiple layers (as in a GeoPackage or a database), you can specify the layer name or use st_layers()
to list available layers. In many cases, st_read
will print a brief summary of what was read – for example, the number of features, number of attributes, geometry type (point/line/polygon), and the Coordinate Reference System (CRS). The sf
object mrc
is essentially a data frame (specifically, a tibble) with an attached geometry column, and it can be treated much like a regular data frame for non-spatial operations.
Supported Data Formats: The flexibility of st_read
cannot be overstated. Thanks to GDAL support, sf
handles formats including shapefiles, GeoJSON, KML, GeoPackage, and more. For example, to read a GeoJSON file or URL, you would simply provide the .geojson path to st_read
. Similarly, connecting to a PostGIS database table can be done via a database connection (using packages like DBI and RPostgres) and passing a query or table name to st_read
. This uniform interface abstracts away format-specific complexities, allowing you to work with spatial data from various sources without worrying about conversion – a major advantage in real-world workflows where data comes in many forms.
Note: When reading spatial data, it’s good practice to handle possible errors or warnings. If st_read
reports issues (e.g., unreadable geometries or missing CRS), you may need to ensure the data source is not corrupted and that auxiliary files (like .prj for shapefiles, which stores projection info) are present. Additionally, you can use quiet = FALSE
(the default) or verbose = TRUE
to get more detailed logging from GDAL if needed. For large files, reading might take time; sf
will attempt to use efficient methods, but be mindful of memory limits.
Examining Spatial Data Structure
After loading a spatial dataset, it’s important to examine its structure and metadata to understand what we’re working with. Key things to inspect include the data’s attributes, its geometry type, the bounding box (spatial extent), and the coordinate reference system (CRS).
Basic data frame functions like head(mrc)
or summary(mrc)
will show attribute columns (just as for a regular data frame) and some high-level info. For spatial specifics, sf
provides utility functions:
- Geometry type and feature count: Simply printing the
sf
object (mrc
) will report how many features (rows) it contains and the geometry type (e.g., “MULTIPOLYGON” or “POINT”). It will also list the CRS and bounding box. - Bounding Box: The bounding box is the minimal rectangle that contains all features. Use
st_bbox(mrc)
to retrieve it. This returns the xmin, ymin, xmax, ymax coordinates of the extent. For example,st_bbox(mrc)
might output something likexmin: -73.9, ymin: 45.3, xmax: -72.5, ymax: 46.1
(coordinates in whatever CRS the data is in). Knowing the extent is useful for sanity checks (does it cover the area you expect?) and for setting up map viewports. - Coordinate Reference System: The CRS defines how the coordinates in the data map to real locations on Earth. Use
st_crs(mrc)
to get the CRS of the dataset. This will typically return an EPSG code or PROJ4 string. For instance, a common CRS for global data is WGS 84 longitude/latitude (EPSG:4326), while regional data might use a local projection (the example datasetmrc.shp
might use NAD83 / Quebec Lambert projection, EPSG:6622). Confirming the CRS is crucial because if you overlay or compare two spatial datasets, they must be in the same CRS or explicitly transformed to a common one. A CRS description fromst_crs
will include the coordinate system type (geographic or projected), datum, and projection parameters if applicable.
Why CRS matters: A coordinate reference system defines how a two-dimensional map corresponds to real places on Earth. Geographic CRSs (like WGS 84) use latitude/longitude on an ellipsoidal model of Earth, while projected CRSs (like UTM or state plane systems) flatten the earth for use on a plane, with coordinates in linear units (meters, feet, etc.). Using the wrong or inconsistent CRS can lead to misaligned layers (e.g., your points might not fall inside the polygons they should) and measurement errors. Always check the CRS for each dataset and ensure you know what it is. If st_crs()
returns NA
, it means the data is missing CRS metadata – you may need to set it manually with st_set_crs()
if you know the correct one.
Basic Visualization of Vector Data
One immediate way to understand spatial data is to plot it. Visualization can reveal spatial patterns, highlight potential errors (such as outlier points in the wrong place), and give a sense of the geographic context.
The sf
package provides a simple plot method for sf
objects. If you call plot(mrc)
, by default it will produce a multi-panel plot of all attributes (for numeric or factor attributes) along with the geometries. However, for a quick look at just the shapes, it’s common to plot only the geometry. You can do this by either subsetting just the geometry column or using the special function st_geometry()
.
R Example – Quick Plot:
# Plot just the geometry of the municipal regions
plot(st_geometry(mrc), col = NA, border = 'blue', axes = TRUE)
This will draw the outlines of the polygons in blue with no fill color (col = NA
means transparent fill) and include axes with coordinate ticks (axes = TRUE
). We could also do plot(mrc["geometry"])
which is equivalent in this case. The result is a basic map of the features. If mrc
contains multiple polygon features (regions), you’ll see their boundaries drawn.
This kind of base R plot is great for an exploratory view. It’s fast and useful for verifying that the data loaded correctly (for instance, the shapes appear in the expected location). You might use this to notice if there’s a stray polygon far off (indicating maybe an erroneous coordinate), or just to familiarize yourself with the region’s shape.
For more refined visual analysis, one might use interactive viewing tools. The mapview package (not covered in detail here) can display sf
objects in an interactive viewer with minimal code (e.g., mapview(mrc)
), which is very handy during data exploration. We’ll cover interactive maps later with leaflet
, but mapview is a quick one-liner alternative for exploring data.
Data Selection and Filtering
Just as with non-spatial data frames, we often need to subset spatial data to focus on relevant features. With sf
objects, we can perform attribute-based filtering using typical R or dplyr syntax, and we can also leverage spatial relationships to filter data (e.g., selecting all points within a certain area).
Attribute filtering: You can index an sf
object like a data frame. For example, suppose mrc
has an attribute column pop2016
for population in 2016 and mrc_name
for the region name. We can filter regions by population:
# Select regions with population > 200,000, keeping only name and population columns
<- mrc[mrc$pop2016 > 200000, c("mrc_name", "pop2016")] mrc_subset
This uses base R indexing: the row condition mrc$pop2016 > 200000
selects only those rows (features) meeting the criterion, and the column index c("mrc_name","pop2016")
keeps only those two attributes (plus, implicitly, the geometry column – which is always kept unless dropped explicitly). The result mrc_subset
is a new sf
object containing, say, the subset of municipal regions that have over 200k population, with just their name and population fields. Because sf
extends the data frame class, subsetting preserves spatial attributes by default (the geometry column stays with the data as long as drop = FALSE
, which is the default in the bracket operator for sf
objects).
After such filtering, you might plot mrc_subset
to see where these high-population regions are, or do further analysis on them.
Spatial filtering: A powerful feature of sf
is the ability to filter based on spatial relationships. The sf
package supports spatial predicates – functions like st_intersects
, st_contains
, st_within
, etc., that compare geometries. Moreover, you can use the bracket notation directly with an sf
object as the filter to select by intersection. For example, if we had a layer of points (say plots_sf
representing sample plot locations) and we wanted to select only the municipalities that contain at least one sample plot, we could do:
# Assuming plots_sf is an sf of points with the same CRS as mrc
<- mrc[plots_sf, ] selected_mrc
Here, using mrc[sf_object, ]
triggers a spatial selection: it will return only those features of mrc
that spatially intersect with any feature in plots_sf
. By default, the predicate is intersection; you can specify a different one with the op
argument (e.g., mrc[plots_sf, , op = st_contains]
to select polygons that fully contain any point). This spatial subsetting syntax is very succinct for common tasks like clipping points by a boundary or finding polygons that overlap with others.
Additionally, the dplyr interface provides filter()
method for sf
which works similarly – you can mix attribute conditions and even call spatial predicate functions inside filter
. But often using base subsetting or explicit spatial joins (next topic) is clearer for complex operations.
Whether by attribute or geometry, subsetting helps focus the analysis. It’s analogous to making SQL queries on your spatial data: you can select “WHERE population > X” or “WHERE geometry intersects some area.” Both are essential tools.
3.3 Integration with Data Manipulation Tools
One of the strengths of the sf
package is its seamless integration with the tidyverse data manipulation verbs. Since an sf
object is essentially a data frame with geometry, you can directly use dplyr functions like filter()
, select()
, mutate()
, group_by()
, and summarize()
on it. This allows powerful combinations of attribute and spatial operations in a clear pipeline style.
Let’s explore a common task: aggregating or summarizing spatial data by a region attribute. Suppose our mrc
polygons belong to larger regions (for instance, say mrc
has a column reg_name
indicating a broader region each municipality falls into). If we want to get total population per region, we can use dplyr
:
library(dplyr)
<- mrc %>%
regions group_by(reg_name) %>%
summarize(pop2016 = sum(pop2016))
This pipeline does what it would for a normal data frame: group the data by reg_name
and sum the pop2016
values within each group. The result, regions
, will have one row per unique reg_name
with the summed population. But what about the geometry? By default, summarizing an sf object will also perform a spatial union of the geometries in each group. In other words, the multiple municipal polygons that shared the same reg_name
are dissolved into a single combined polygon for that region. The sf
package uses the GEOS library to carry out this union operation behind the scenes, ensuring that we get a proper merged geometry for each group. This behavior is controlled by an argument do_union = TRUE
in the summarize()
method for sf
(which is TRUE by default). The benefit is that the resulting regions
object remains an sf
with correct geometries representing, say, larger administrative regions. Had we not unioned the geometries, we would end up with a multipolygon object containing disjoint pieces (or an invalid geometry with overlapping boundaries). The automatic union usually aligns with what an analyst expects when grouping spatial features.
It’s worth noting that if for some reason you wanted to keep the individual geometries and not dissolve them, you could use summarize(..., do_union = FALSE)
, but then you might get a geometry list-column or invalid overlaps – typically not desired unless you have a specific reason.
Beyond summarization, other dplyr verbs work naturally:
mutate()
can add or transform attributes. For example,mrc <- mrc %>% mutate(pop_density = pop2016 / area_km2)
(assuming there’s anarea_km2
attribute or we computed area viast_area()
).select()
chooses a subset of columns but, by default, it keeps the geometry column even if you don’t list it (so you don’t accidentally drop the spatial component). If you do want to drop geometry and get a plain data frame, you can explicitly dost_drop_geometry(mrc)
or convert to data frame first.arrange()
will sort the rows; it doesn’t affect geometries except reordering them.- Joins: You can do attribute joins between an
sf
object and a regular tibble usingleft_join()
,inner_join()
, etc., as long as you join by attribute keys (e.g., attaching a demographic table to a spatial layer by region code). The join will simply carry over the new columns into thesf
while retaining the geometry. This is extremely useful for enriching spatial data with additional attributes (for example, joining census data to a spatial boundary file by an ID field).
As an example, if we had a data frame df
with columns mrc_name
and some statistic, and we wanted to attach that to mrc
spatially, we’d do mrc <- left_join(mrc, df, by="mrc_name")
. The result is still an sf
object but now with extra attributes from df
.
Finally, it’s important to mention spatial joins as a form of integration between datasets. A spatial join attaches attributes from one spatial layer to another based on location (e.g., assign each point the attributes of the polygon it falls in). The sf
package has a function st_join(x, y, join = st_intersects)
for this purpose. For instance, if we have a point layer plots_sf
and polygon layer mrc
, st_join(plots_sf, mrc)
will produce a new set of points that carry the attributes of the municipality each point lies in. By default it uses intersection (point-in-polygon), but you can specify other relations or a distance buffer. We will discuss some of these spatial operations more in the next section. Conceptually, though, st_join
is analogous to a database join but where the key is a spatial relationship rather than an exact matching ID.
sf
objects are fully compatible with the tidyverse workflow, allowing you to filter, aggregate, and join data with the same code you’d use for non-spatial tables. This integration accelerates spatial data cleaning and preparation tasks. Moreover, under the hood sf
leverages high-performance libraries for geometry operations (it uses GDAL for I/O, GEOS for geometric ops, and PROJ for transformations), so many of these operations are efficient even on large datasets. Embracing these tools means we can write clear, concise code to manipulate geospatial data and trust that the heavy lifting (geometry crunching) is handled by battle-tested libraries.
3.4 Creating Spatial Data from Data Frames
Not all data comes pre-packaged in a shapefile or spatial format. Often, you may have ordinary data frames or CSV files that contain location information (e.g., latitude and longitude columns, or address fields). R’s spatial toolkit allows you to convert such tabular data into spatial objects, enabling mapping and spatial analysis on them.
A common scenario is a CSV file with coordinates. For example, suppose plots.csv contains ecological survey plots with columns for longitude and latitude, along with other attributes like vegetation type. We can turn this into an sf
object with the function st_as_sf()
:
R Example – Data Frame to sf Point:
<- read.csv("data/plots.csv") # assume it has columns "long" and "lat"
plots <- st_as_sf(plots, coords = c("long", "lat"), crs = 4326) plots_sf
In this snippet, coords = c("long","lat")
tells st_as_sf
which columns contain the X and Y coordinates (note: the order is X then Y, so long = X, lat = Y). We also specify a CRS; here we assume these are longitude/latitude in WGS 84 (EPSG:4326), which is a common scenario for GPS coordinates. If we know the points were collected in the same CRS as another layer (say the mrc
data uses a specific projection and the CSV coordinates are actually in that same projection system), we could use crs = st_crs(mrc)
to assign the same CRS as mrc
. In our example, we used 4326 for global lat/long.
After this conversion, plots_sf
becomes an sf
object of point geometry. Each row of plots
now has a corresponding point (with coordinates from the given columns) stored in plots_sf$geometry
. All the other columns (cover_type, plot_id, etc.) remain as attributes in plots_sf
. Now that it’s spatial, we can do things like plot these points on a map, or perform spatial joins/operations with other spatial data.
A couple of considerations when creating spatial data from scratch:
- CRS correctness: The
st_as_sf()
function will not transform coordinates, it only labels them. So it’s crucial to specify the correct CRS that the coordinates are in. If you set the wrong CRS, your points will appear in the wrong location when mapped. For instance, if the CSV lat/long are WGS84 and you mistakenly assign a local projection, the points could be interpreted entirely incorrectly. Always confirm what coordinate system your source data uses. If the data came from a GPS or an online source, WGS84 (EPSG:4326) is likely. If it’s from a local survey, it might be a local projection (you might be given a PROJ4 string or EPSG code in metadata). - Other coordinate data: Sometimes coordinates are given in separate fields such as UTM northing/easting or addresses. For addresses, geocoding (using an API or package) would be needed to get coordinates. For projected coordinates like UTM, you would still use
st_as_sf()
but you need the appropriate CRS (for UTM, specify the correct zone’s EPSG code, e.g., 32633 for UTM zone 33N WGS84). - Data validation: After creating
plots_sf
, it’s wise to do a quick plot or check a few coordinates to ensure they line up with expectations (e.g., plot them on top of a known map or see if coordinates fall within expected min/max ranges).
By converting raw data frames into spatial objects, we unlock all the spatial operations for those data. For example, we could count how many plots fall in each municipality by doing a spatial join between plots_sf
(points) and mrc
(polygons). We could measure distances between plot locations, or buffer them to analyze neighboring areas (topics we will touch on soon). Essentially, any dataset with location info can become part of our geospatial analysis workflow after this conversion step.
3.5 Working with Raster Data
While vector data deals with discrete features, raster data represents continuous spatial information and imagery, using grids of cells (pixels). Common examples of raster data include satellite images, elevation models, land cover classification grids, and climate data layers. In R, the new-generation package stars
is designed for handling raster and multi-dimensional array data. (The older raster package and its successor terra are also widely used; terra is very powerful for raster analyses, but here we’ll focus on stars
as it was mentioned in our objectives.)
Raster data is essentially a matrix (or a stack of matrices for multi-band data) where each cell has a value. This could be a color (in remote sensing imagery), a temperature, an elevation, etc. stars
treats rasters as spatiotemporal arrays, which can have dimensions like x, y, time, band, etc.. One strength of stars
is that it can handle multi-band images and time series data (data cubes) in a tidy way, and it interoperates with sf
for combined vector/raster operations.
Reading and Exploring Raster Data
Reading a raster file in R with stars
is straightforward using read_stars()
. For example:
library(stars)
# Read a GeoTIFF file (e.g., a satellite image)
<- read_stars("data/landsat_scene.tif")
sat_image print(sat_image)
If landsat_scene.tif
has multiple bands (say, Red, Green, Blue, Infrared), read_stars
will load them all, and the print
output will tell us the dimensions. A typical print might show something like: “stars object with 2 dimensions and 4 attributes” or “dimensions: x, y, band; resolution …; CRS: …”. stars
uses GDAL under the hood for reading common raster formats (GeoTIFF, NetCDF, etc.), so it supports many formats similar to sf
.
Just as we checked vector metadata, for rasters we should check: the dimension sizes (number of rows, cols, layers), the resolution (cell size), extent (which st_bbox(sat_image)
would give), and CRS (st_crs(sat_image)
). These are accessible through st_dimensions(sat_image)
which lists each dimension’s values or range. For example, st_dimensions(sat_image)$x
might show the sequence of x coordinates for cell centers, etc.
If the raster has a defined CRS (most GeoTIFFs do), read_stars
will attach it. If not, you might need to set it with st_set_crs()
similarly to vectors.
Plotting rasters: The stars
object can be plotted directly. If it’s a single-band raster, plot(sat_image)
will display it (with a legend of values). If it’s multi-band, plot(sat_image)
might plot each band unless it recognizes a 3-band set as RGB. You can also specify plot(sat_image, rgb = c(3,2,1))
to plot a color image using bands 3,2,1 for RGB (for instance). Plotting is useful to quickly visualize the raster and check for any obvious issues (like whether the data aligns with known features, or if there are missing data areas).
Raster operations basics: Once loaded, raster data can be manipulated much like any array. For example, you can do arithmetic on the object: sat_image * 2
would multiply all cell values by 2 (if it’s numeric data). With multiple bands, you can index the band dimension to select a subset or do calculations across bands. stars
provides functionality such as st_apply()
to apply functions across dimensions (like calculating a mean across time or bands).
However, many advanced raster analyses (like filtering, zonal stats, etc.) are beyond our scope in this intro. We will focus on a few key tasks:
Cropping / masking: You can crop a raster to a region of interest or mask it by a polygon. For example, to crop to our
mrc
extent:sat_image_cropped <- sat_image[st_bbox(mrc)]
(stars allows using ansf
or bbox to index the raster by spatial extent). For masking by a polygon (set values outside a polygon to NA), one might usest_crop()
followed by manual NA assignment or useterra::mask
if mixing with terra.Raster-vector integration: A common need is to extract raster values at point locations or summarize raster values within polygons. The
st_extract()
function in stars can retrieve raster values for given point geometries. E.g.,st_extract(sat_image, plots_sf)
would give the pixel values for each point inplots_sf
. To get summary within polygons (like average value per polygon), you could useaggregate()
: e.g.,stars::aggregate(sat_image, by = mrc, FUN = mean)
which will compute the mean of raster cells within eachmrc
polygon. Under the hood, stars uses GDAL routines for some of these tasks and can convert between raster and vector (e.g.,st_as_sf
on a stars object can polygonize raster regions, though for large rasters this can be heavy).Example – Thresholding a raster: To illustrate a simple raster operation, imagine our satellite image has a band for vegetation index and we want to find areas above a certain threshold. We could do:
<- sat_image[ , , "NDVI"] # assuming one band named "NDVI" veg_index <- veg_index > 0.8 # this yields a logical stars object high_veg <- st_as_stars(high_veg) # convert logical to 0/1 raster (TRUE/FALSE) high_veg <- st_as_sf(high_veg, as_points = FALSE, merge = TRUE) high_veg_polygons
In this pseudo-code,
high_veg
would be a raster mask of TRUE/FALSE where NDVI > 0.8. Converting that stars object to sf polygons (withmerge=TRUE
to combine adjacent TRUE cells) would yield polygon features outlining high vegetation areas. The example shows how one can go from raster analysis to vector results, which is often done in land cover mapping etc. (This process relies on GDAL’s polygonize under the hood.)
Handling large rasters: Raster datasets can be very large (think hundreds of millions of pixels). stars
has the concept of lazy loading and proxy objects. If a file is big, read_stars()
may create a stars_proxy object that doesn’t immediately load all data into memory, but rather waits until you actually need it (for plotting or computation), possibly downsampling if appropriate. This allows working with large rasters without running out of memory, as long as you don’t force read everything. If you encounter performance issues, consider reading only what you need (spatially or by band). Alternatively, the terra package is optimized in C++ for many raster operations and can handle large data and on-disk processing efficiently. It’s fine to use a mix of tools: for instance, you might use terra
for heavy raster crunching and then convert the result to stars or sf for integration with vector data.
stars
extends our geospatial toolkit to continuous data. The ability to bring in a raster layer (e.g., a digital elevation model or a climate surface) and combine it with vector data (e.g., administrative boundaries, sample locations) greatly enriches the analysis. Many spatial analyses involve both vector and raster data – for example, extracting elevation at point locations, or computing the population within areas of high flood risk (overlaying population rasters with flood extent polygons). With sf
and stars
together, R provides a unified environment to conduct such multi-dimensional analyses.
3.6 Coordinate Reference Systems (CRS)
As introduced earlier, coordinate reference systems are the backbone that ties spatial data to real-world locations. A CRS includes a coordinate system (how coordinates are measured) and a geodetic datum (a model of Earth’s shape). Working with CRSs correctly is vital for accurate spatial analysis. This section delves a bit deeper into CRSs and demonstrates how to reproject data in R.
A key point to remember: All spatial operations and comparisons assume the coordinates are in a single consistent CRS. If not, you must transform one dataset to the CRS of the other (or both to a common CRS) before analysis. R will generally not stop you from overlaying data with different CRSs, but the results will be nonsense if the CRSs differ.
Understanding and Inspecting CRS
When you load a spatial dataset, always note its CRS (using st_crs
). CRSs can be geographic (lat-long) or projected.
- Geographic CRS (unprojected): This uses latitude and longitude on an ellipsoid. Coordinates are in degrees. Example: WGS 84 (EPSG:4326) – the most common global CRS, used by GPS and many datasets. In a geographic CRS, a degree of longitude represents a different distance on the ground depending on latitude (roughly 111 km at equator, 0 at poles), so distance/area calculations in degrees are not uniform.
- Projected CRS: A mathematical projection of Earth onto a flat surface, using linear units (meters, feet). There are hundreds of projections, each suited for different regions or purposes (preserving areas, shapes, distances, etc.). Examples: Web Mercator (EPSG:3857) is used in many web maps (like Google or OSM tiles) – it’s a Mercator projection in meters; Universal Transverse Mercator (UTM) which divides the world into zones for detailed mapping; national grid systems like NAD83 / State Plane or the Lambert Conformal Conic used for Canadian provinces. Each projection introduces some distortion (no flat map can represent the 3D earth perfectly), but is often chosen to minimize distortion in the area of interest.
It’s important to choose an appropriate projection for your analysis needs. If you’re measuring areas or distances, an equal-area or equidistant projection might be preferred. If you just need a pleasing visual map for a local area, a conformal projection (preserving shape) might be used. Fortunately, reprojecting data in R is easy, which encourages experimenting or switching CRS as needed.
Reprojecting Data with st_transform
To convert spatial data from one CRS to another, we use the function st_transform(x, crs = <target CRS>)
. The crs
argument can be specified by EPSG code, or by a PROJ4string, or by an st_crs()
object copied from another layer.
R Example – Reprojecting: Continuing with our mrc
example, suppose it was originally in a geographic CRS (lat/long). For certain analyses (say we want to calculate land area or distances between centroids), it’s better to have a projected CRS in meters. We could do:
<- st_transform(mrc, crs = 6622) mrc_proj
Here EPSG:6622 (NAD83(CSRS) / Quebec Lambert) is a projected CRS suitable for Quebec area (meters as units). After this, mrc_proj
coordinates are now in a planar system. We could verify by checking st_crs(mrc_proj)
which should show EPSG:6622 and units in meters, and st_bbox(mrc_proj)
which would now give coordinates in perhaps hundreds of thousands of meters (e.g., Easting/Northing values), rather than degrees.
If you don’t know an appropriate EPSG code offhand, you can often find one by searching (sf::st_crs("WGS84")
gives 4326; st_crs(3857)
gives Web Mercator; for local systems, sometimes the data provider will specify the EPSG or you might search on spatialreference.org).
st_transform
will utilize the PROJ library to accurately convert coordinates. This includes datum transformations if needed (for example, converting from NAD27 datum to WGS84 will apply a shift – PROJ handles that internally if it has the grid parameters). If R outputs messages about missing datum shift files, you might need to install those if precision is required, but often small datum differences are negligible for many uses.
Practical tip: Always ensure the data’s CRS is correctly set before transforming. If st_crs(mrc)
was NA (unknown), st_transform
won’t know how to interpret the numbers and will likely fail or give wrong results. You would first do st_set_crs(mrc, <EPSG or proj string>)
to assign the correct original CRS (without changing data, just labeling it), then transform.
When to reproject: You’ll reproject data whenever layers need to align. For instance, if you have points in WGS84 but polygons in a local projection, decide on one CRS and transform the other. Also, if you plan to compute lengths or areas, using an appropriate projection (especially an equal-area projection for area) is recommended. Measuring area in degrees is meaningless; one degree of latitude is not a constant distance. Instead, transform to a CRS where a unit is a real length (meter) in your region, then use st_area
or st_length
.
One example: If calculating the area of each mrc
polygon, do st_area(mrc_proj)
after projecting to a suitable meter-based CRS. If you used st_area(mrc)
while it was in lat/long, sf
(by default, as of recent versions) might actually use spherical geometry to approximate the area on the ellipsoid, which can yield results in squared meters even from degrees (thanks to the s2 geometry engine integration). This can be handy, but it’s sometimes confusing. Many prefer to explicitly project to a planar CRS and then calculate area.
Multiple CRSs in a project: It’s common to switch CRS depending on task. You might keep data in WGS84 for ease of sharing and because it’s the common denominator, but project to local coordinate systems for analysis or mapping. For example, you could have your raw data in 4326, then create a transformed version in a local projection for distance calculations, and perhaps another transformed to Web Mercator if you want to overlay on a Google map in a web context. The key is to keep track of which object is in which CRS (naming like mrc_ll
vs mrc_proj
can help). The st_crs
metadata will guide you, so always check it if unsure.
Coordinate reference systems ensure spatial data from different sources align correctly on the Earth’s surface. Proper use of CRSs allows us to integrate maps seamlessly and perform accurate spatial analysis (for instance, when combining layers, a CRS acts like a common language of location). With R’s st_transform
function, we can easily convert CRSs, thus overcoming one of the historically tricky parts of GIS work. This removes a major barrier to combining data globally and lets us focus on analysis, confident that “a meter means a meter” and “a degree means a degree” as appropriate in our computations.
3.7 Customizing Maps with Visualization Libraries
So far we have discussed plotting for exploratory purposes. When it comes to presenting results or creating publication-quality maps, we often need more control over the appearance. This is where R’s powerful ggplot2 package comes in. Thanks to the integration of sf
with ggplot2, making complex maps with legends, scales, and multiple layers is straightforward.
Additionally, specialized mapping packages like tmap (thematic maps) exist, but here we will illustrate with ggplot2 since it’s widely used and very flexible.
Using ggplot2 for Static Maps
Recall that sf
objects can be used directly in ggplot. ggplot2 has a geom specifically for spatial data: geom_sf()
. This geom will automatically handle the projection (using the coordinate system of the data) and can plot points, lines, or polygons appropriately based on the geometry type.
R Example – Layered Map with ggplot2:
library(ggplot2)
ggplot() +
geom_sf(data = mrc_proj, fill = "gray90", color = "white") +
geom_sf(data = plots_sf, aes(color = cover_type), size = 1.5) +
labs(title = "Sample Plots and Municipal Regions",
subtitle = "Cover types of plots overlaid on region boundaries",
color = "Cover Type") +
theme_minimal()
Let’s break down what this does:
- We start with
ggplot()
and add layers with+
. The first layergeom_sf(data = mrc_proj, ...)
draws the municipal regions (polygons). We setfill = "gray90"
to fill polygons light gray, andcolor = "white"
to draw white borders (this gives a subtle boundary effect). If we had an attribute to color by (say population density), we could doaes(fill = pop_density)
inside geom_sf to create a choropleth. Here we chose a uniform fill. - The second layer
geom_sf(data = plots_sf, aes(color = cover_type), size = 1.5)
adds point locations for the plots, using the attributecover_type
to color them. Each unique cover type (e.g., forest, grassland, etc.) will get a different color automatically. Thesize = 1.5
makes the points a bit larger for visibility. - We add
labs
to set a title, subtitle, and rename the legend for color to “Cover Type”. By mappingcolor = cover_type
inaes()
, ggplot knows to produce a legend for that. - We apply a theme (minimal theme here) to give a clean look. We could further customize, e.g., adding
theme(legend.position="bottom")
ortheme(axis.text=element_blank(), axis.ticks=element_blank())
if we want to hide coordinate axes for a cleaner map (often maps don’t need lat-long labels if it’s not important).
The output is a layered map: a base of polygons and points on top, with a legend indicating the cover type colors. This demonstrates how ggplot’s layering system works naturally with spatial data, just like with other data types.
Coordinate systems in ggplot: By default, when using geom_sf
, ggplot will use the CRS of the data for plotting (and include an appropriate coordinate grid if axes are drawn). If you want to change the map projection for display, you can use coord_sf(crs = <desired CRS>)
. For example, you might have data in a projected CRS but want to show it in a geographic (unprojected) form or vice versa. coord_sf
will project the geometries on the fly for display. However, if your data is already in a suitable projection for the region shown, it’s often fine to let ggplot use it directly.
Further customization: You have full ggplot2 capabilities, which include:
- Adjusting color scales (e.g.,
scale_color_brewer(palette="Set2")
to use a colorblind-friendly palette for categories). - If using fill for polygons,
scale_fill_viridis_c()
might be nice for numeric choropleth, orscale_fill_manual()
for custom discrete colors. - Adding annotations: ggplot makes it easy to add titles, captions, or even map elements like scale bars or north arrows (not built-in, but packages like ggspatial provide those).
- Faceting: You could facet maps by an attribute. For example, if
plots_sf
had surveys from different years, you could facet by year to compare spatial patterns over time side by side. - Combining with non-spatial layers: Because ggplot is general, you can overlay non-map data too. For instance, adding
geom_text()
at specific coordinates for labels, or even usingannotation_raster()
to put an image under the map, etc.
An important consideration for map presentation is coordinate aspect ratio. coord_sf
by default ensures the x and y scales are in correct proportion (1 unit in x equals 1 unit in y), which prevents distortion of shapes. If you ever override that (not usually advisable for maps), you could stretch the map, but normally you keep aspect ratio fixed to reality.
Example – Choropleth Map: As another scenario, say we want to color municipalities by population density. We could do:
<- mrc_proj %>% mutate(pop_density = pop2016 / area_km2)
mrc_proj ggplot(mrc_proj) +
geom_sf(aes(fill = pop_density)) +
scale_fill_viridis_c(trans = "log", name="People per km^2") +
theme_bw()
This single-layer plot will color each polygon by pop_density
using a viridis color scale (log-transformed to handle skewed distribution). The legend is labeled accordingly. geom_sf
knows to take the data from mrc_proj
and map fill to that attribute. The result is a classic choropleth map. We used theme_bw()
for a base theme with gray background; one could further tweak it.
Through these examples, you can see that advanced maps in R benefit from combining sf
for data and ggplot2 for aesthetics. The result is high-quality visualizations that can be output to PNG/PDF for reports. They are also reproducible: the code fully specifies how the map is generated, so you can rerun it with updated data or share it with collaborators for transparency.
3.8 Interactive Maps
Static maps are great for reports and printed media, but interactive maps provide a dynamic way to explore spatial data. By panning, zooming, and clicking on features, users can glean insights that might be hard to convey in a static image. R makes it easy to create interactive web maps through packages like leaflet (an R interface to the popular Leaflet.js library) and mapview. In this section, we’ll focus on leaflet
for creating customizable interactive maps.
Leaflet in R works by creating a widget (an HTML map) that you can view in RStudio’s viewer or in a web browser. The basic workflow is to initialize a map with leaflet()
and then add layers to it (tiles, markers, polygons, etc.) with add*
functions.
Using Leaflet in R
To start, ensure you have leaflet
installed (install.packages("leaflet")
). Then:
R Example – Basic Interactive Map:
library(leaflet)
leaflet(mrc_proj) %>%
addTiles() %>% # add default OpenStreetMap tile background
addPolygons()
In this snippet, leaflet(mrc_proj)
initializes a leaflet map with the mrc_proj
polygons as the default data source. The addTiles()
function adds a base map (the default is OpenStreetMap tiles, which are free and widely used). addPolygons()
without any arguments will add the polygons from mrc_proj
(because we provided that as the data in leaflet()
call). The result is an interactive map where you can zoom and pan, and the municipal regions are overlaid on top of OSM street maps. By default, leaflet will give the polygons a blue fill and white outline with some transparency.
This minimal example shows the core idea. But leaflet allows a lot of customization:
Popups and tooltips: You can specify what happens when a user clicks or hovers on a feature. For example,
addPolygons(popup = ~mrc_name)
would display the region name when a polygon is clicked. The~mrc_name
syntax tells leaflet to use themrc_name
column of the data for each feature’s popup. Tooltips (appearing on hover) can be added withlabel = ~some_column
.Theming polygons: You can style the polygons via arguments in
addPolygons
, such ascolor
,fillColor
,weight
(border thickness),opacity
,fillOpacity
, etc., and these can be set dynamically using attribute values. For example, you could color by population density by providing a vector of colors or a palette function. Theleaflet
package has functions likecolorNumeric()
to generate color palettes. E.g.,<- colorNumeric("YlOrRd", domain = mrc_proj$pop_density) pal leaflet(mrc_proj) %>% addTiles() %>% addPolygons(fillColor = ~pal(pop_density), color="black", weight=1, popup = ~paste(mrc_name, "<br> Density:", pop_density))
This would color each polygon using a yellow-to-red palette based on density, with black borders, and a popup listing name and density.
Adding markers or circles for point data: If you have
plots_sf
points, you could doaddCircleMarkers(data = plots_sf, radius = 3, color = ~cover_pal(cover_type), popup = ~cover_type)
. This would put interactive circle markers for each plot, colored by cover type (using some palette functioncover_pal
you define).Layers control: Leaflet allows multiple layers and a control to toggle them. For instance, you can add multiple tile layers (
addProviderTiles("CartoDB.Positron")
for a different basemap) and useaddLayersControl
to choose between base maps or toggle overlays (like turning on/off the plots layer).Integration with Shiny: Leaflet maps can be made part of Shiny web applications, where user interactions on the map can trigger R code (for example, selecting points, dynamic filtering). This is beyond this chapter, but worth noting as a capability.
One thing to remember is that leaflet expects data in WGS84 geographic coordinates (longitude/latitude) under the hood. If your data isn’t in lat/long, leaflet will still try to display it but the results may be distorted or offset. Many spatial packages automatically transform data for leaflet if needed. It’s good practice to ensure the data given to leaflet()
is in EPSG:4326. In our example, mrc_proj
was in a projected CRS, so technically we should transform it: leaflet(st_transform(mrc_proj, 4326)) %>% ...
. The example might have worked if mrc_proj
had a CRS attribute that leaflet recognized and transformed, but to be safe, supply lat/long.
Interactive maps are especially useful during exploratory analysis and for stakeholder engagement. For example, if analyzing health outcomes by region, an interactive map allows a user to click on their region and see details, or to zoom into neighborhoods. The ability to overlay multiple layers and toggle them helps in understanding relationships (e.g., toggling a flood risk raster vs. population density layer).
In R, other packages like mapview provide even quicker interactive plotting (one line to view an sf
object on a map). Mapview is great for quick looks, but leaflet gives finer control for making a polished interactive map. There’s also tmap which can produce static and interactive maps with the same code (switching tmap_mode("view")
for interactive), which is very convenient.
For our purposes, the main takeaway is that creating an interactive map in R is relatively simple: use leaflet()
, add some base tiles, and add spatial data layers. The result can be saved as an HTML or viewed in RStudio. Such maps are a powerful way to communicate spatial findings, as users can explore the data themselves.
3.9 Conclusion
In this chapter, we covered advanced techniques for geospatial data manipulation in R, focusing on practical skills with the sf
and stars
packages. We began by highlighting the importance of spatial analysis across fields and noting the explosive growth in spatial data availability. The introduction of modern spatial packages in R – notably sf
for vector data (a package that “has revolutionized spatial handling in R” by adopting the standardized simple features approach) and stars
for raster/array data – has made it much easier to integrate geospatial analysis into data science workflows.
Key takeaways from this chapter include:
- Reading and exploring spatial data: We learned how to load vector data from various formats with
st_read()
and inspect its structure (attributes, CRS, bounding box). Understanding the data’s coordinate reference system is critical to avoid misalignment; a CRS defines how spatial coordinates map to real locations, and we practiced checking and setting CRSs appropriately. - Data manipulation and subsetting: Treating
sf
objects like data frames allows leveraging of powerful dplyr verbs. We filtered spatial features by attribute (e.g., extracting municipalities above a population threshold) and by spatial relationship (using intersections for selection). We performed group-wise summarization to aggregate smaller units into larger regions, during whichsf
automatically handled the geometric union of features in each group – simplifying the creation of dissolved boundaries for aggregated data. - Raster data handling with stars: We introduced the
stars
package for raster and multi-dimensional data. Reading rasters withread_stars()
and plotting them gave us a way to incorporate continuous spatial data (like satellite imagery or climate surfaces) into analysis. We discussed basic raster operations and howstars
andsf
can be combined, for example extracting raster values at vector point locations or converting raster data to vector features (and vice versa). This opens the door to rich spatial analyses, such as overlaying population rasters with administrative boundaries to calculate population statistics per region. - Coordinate transformations: We emphasized the need for and ease of reprojecting data with
st_transform()
. Whether to prepare data for distance calculations or to match the projection of a map background, transforming CRSs ensures apples-to-apples comparisons in space. The examples illustrated transforming lat/long data to a local projected CRS for accurate area measurements. We cited how sf leverages the PROJ library for these transformations, ensuring high accuracy. - Visualization techniques: Creating maps is often the end goal of spatial analysis. We saw how base plotting can give a quick view of data, but for polished outputs ggplot2 with
geom_sf
is extremely powerful. With ggplot we made layered maps, combining polygons and points with custom colors and legends. The integration ofsf
with ggplot2 means we can directly map attributes to aesthetics (fill, color, size) and produce thematic maps with minimal effort. We also touched on adding titles, scales, and choosing appropriate color palettes – crucial for making the map interpretable and visually appealing. - Interactive mapping: Finally, we explored making interactive maps using leaflet. This capability transforms static results into engaging tools where users can navigate the data spatially. By adding tiles and data layers to a leaflet widget, we created an interactive map of regions and sample plots. Features like popups demonstrated how additional information can be revealed on demand, making the map a richer experience. Interactive maps are invaluable for presentations and web dashboards, as they invite exploration and can help convey complex spatial stories (for instance, seeing how different layers overlap by toggling them).
Throughout the chapter, we have reinforced that geospatial analysis in R is not an isolated specialty – it connects with general data manipulation and visualization workflows. You can join tables, aggregate data, and pipe through transformations in a manner consistent with non-spatial analyses, thanks to the design of packages like sf
. Moreover, critical computational geometry tasks are handled by robust libraries (GDAL, GEOS, PROJ), giving you reliable results without needing GIS-specific software.
With these advanced techniques, you can confidently tackle a wide range of spatial data tasks. For example, you could: merge disparate spatial datasets (such as linking health data points to census tract polygons), transform coordinates to analyze distances (such as computing how far each school is from the nearest earthquake fault line), rasterize vector data or vice versa (for instance, converting land use polygons into a raster grid for modeling), and visualize the outputs in compelling ways (static reports or interactive web apps).
As you apply these skills, keep in mind best practices like checking data for errors (invalid geometries can be fixed with st_make_valid
if needed), using appropriate projections for the task (distortion matters!), and documenting your workflow (perhaps using R Markdown to integrate code, analysis, and maps in one reproducible document). Geospatial analysis can be complex, but the structured approach presented in this chapter should provide a strong foundation. With practice, you’ll be able to extend these techniques – for example, into spatial modeling, point pattern analysis, network analysis, and more advanced topics covered in later chapters.
Geospatial data manipulation in R empowers us to add the “where” dimension to our analyses. By mastering vector and raster handling, coordinate systems, and mapping techniques, we unlock new insights that purely non-spatial analysis might miss. The techniques learned here will enable you to build maps that inform policy, identify patterns like clusters or hotspots, allocate resources efficiently by location, and generally make better decisions grounded in spatial context. These are highly valuable capabilities in our increasingly location-aware world. Happy mapping, and may your future analyses always find the geo in data!