3 Advanced Techniques for Geospatial Data Manipulation

Geospatial data analysis has become a cornerstone of decision-making in diverse domains. From urban planning and environmental management to public health, agriculture, and disaster response, understanding where things happen is as important as what happens. In fact, it is often cited that roughly 80% of the world’s data includes some kind of spatial aspect, underscoring the ubiquitous importance of location in modern data. The proliferation of GPS-equipped devices, remote sensing satellites, and location-based services means that spatial data are being generated at unprecedented scales. The geospatial analytics industry is rapidly growing – by one estimate it will nearly double in size between 2021 and 2026 – and its applications are expanding across both commercial and public sectors.

In tandem with this data boom, tools for spatial analysis have advanced significantly. The R programming language, long favored for data analysis, now boasts a rich ecosystem for geospatial work. In particular, the advent of packages like sf (simple features) and stars has revolutionized how spatial data are handled in R. The sf package implements the Open Geospatial Consortium (OGC) Simple Features standard for vector data, providing a modern, user-friendly alternative to earlier R spatial libraries (e.g. sp for vectors and raster for rasters). By treating spatial objects (points, lines, polygons) as first-class data frame elements, sf makes manipulating geospatial data as intuitive as manipulating tables. Likewise, the stars package extends R’s capabilities to raster and spatiotemporal data “cubes,” enabling the handling of multi-dimensional environmental data (e.g. satellite images, climate model outputs) within the R ecosystem.

This chapter provides a comprehensive introduction to advanced techniques for geospatial data manipulation using R. We will cover the essential workflows for reading, exploring, transforming, and visualizing spatial data. Emphasis is placed on using the sf package for vector data and stars for raster data, while leveraging the tidyverse tools for efficient data handling. Through examples, we will illustrate how to create both static and interactive maps to effectively communicate spatial insights. By the end of the chapter, you should be equipped with the fundamental skills to tackle complex spatial analysis tasks – from cleaning and subsetting spatial datasets, to performing geographic transformations and producing publication-quality maps.

3.1 Objectives

The primary objectives of this chapter are:

Introduce essential geospatial packages in R: We will become familiar with R’s core spatial libraries, particularly sf for vector data and stars for raster and spatiotemporal data. Understanding these packages’ capabilities and data structures lays the groundwork for advanced spatial analysis.
Demonstrate basic spatial data handling: Using real-world examples, we will walk through loading spatial datasets, examining their properties, performing data cleaning and transformations (both attribute and geometric operations), and combining spatial data with traditional data frames.
Explore visualization techniques: We will create maps as a key analytical and communication tool. This includes basic plotting of spatial objects, thematic mapping with ggplot2, and building interactive maps with leaflet. By learning to visualize spatial data effectively, you can better interpret patterns and convey findings to stakeholders.

These objectives align with building a robust skill set in geospatial data manipulation. Mastery of these techniques will enable you to confidently integrate spatial thinking into data analyses and support informed decision-making across applications.

3.2 Working with Vector Data

Vector data represents geographic features as discrete geometries (points, lines, and polygons) with associated attributes. Common examples include city locations as points, roads or rivers as lines, and administrative boundaries as polygons. In R, the sf package is the de facto standard for working with vector data, providing a powerful and consistent framework. In this section, we cover the fundamental tasks of reading vector data, examining its structure, basic visualization, and subsetting/filtering spatial features.

Reading Vector Data

Spatial vector datasets are often stored in specialized file formats or databases. One of the most prevalent formats is the ESRI Shapefile, which actually consists of multiple files (.shp, .dbf, .shx, etc.) representing geometry and attributes. Other popular formats include GeoJSON (for web-compatible data exchange), KML (Google Earth format), GeoPackage (GPKG), and spatial databases like PostGIS. The good news is that sf can read and write a wide array of these formats thanks to its use of the GDAL library. This means that a single function, st_read(), can ingest virtually any spatial data source – whether it’s a shapefile on your disk, a GeoJSON from a web URL, or a table in a spatial database – and load it as an R spatial object.

R Example – Reading a Shapefile: Suppose we have a shapefile of municipal regions named mrc.shp stored in a local “data” directory. We can read it into R as follows:

library(sf)
mrc <- st_read("data/mrc.shp")

This command will import the vector data into an sf object called mrc. Under the hood, st_read uses GDAL to detect the file format and parse the data. If the dataset contains multiple layers (as in a GeoPackage or a database), you can specify the layer name or use st_layers() to list available layers. In many cases, st_read will print a brief summary of what was read – for example, the number of features, number of attributes, geometry type (point/line/polygon), and the Coordinate Reference System (CRS). The sf object mrc is essentially a data frame (specifically, a tibble) with an attached geometry column, and it can be treated much like a regular data frame for non-spatial operations.

Supported Data Formats: The flexibility of st_read cannot be overstated. Thanks to GDAL support, sf handles formats including shapefiles, GeoJSON, KML, GeoPackage, and more. For example, to read a GeoJSON file or URL, you would simply provide the .geojson path to st_read. Similarly, connecting to a PostGIS database table can be done via a database connection (using packages like DBI and RPostgres) and passing a query or table name to st_read. This uniform interface abstracts away format-specific complexities, allowing you to work with spatial data from various sources without worrying about conversion – a major advantage in real-world workflows where data comes in many forms.

Note: When reading spatial data, it’s good practice to handle possible errors or warnings. If st_read reports issues (e.g., unreadable geometries or missing CRS), you may need to ensure the data source is not corrupted and that auxiliary files (like .prj for shapefiles, which stores projection info) are present. Additionally, you can use quiet = FALSE (the default) or verbose = TRUE to get more detailed logging from GDAL if needed. For large files, reading might take time; sf will attempt to use efficient methods, but be mindful of memory limits.

Examining Spatial Data Structure

After loading a spatial dataset, it’s important to examine its structure and metadata to understand what we’re working with. Key things to inspect include the data’s attributes, its geometry type, the bounding box (spatial extent), and the coordinate reference system (CRS).

Basic data frame functions like head(mrc) or summary(mrc) will show attribute columns (just as for a regular data frame) and some high-level info. For spatial specifics, sf provides utility functions:

Geometry type and feature count: Simply printing the sf object (mrc) will report how many features (rows) it contains and the geometry type (e.g., “MULTIPOLYGON” or “POINT”). It will also list the CRS and bounding box.
Bounding Box: The bounding box is the minimal rectangle that contains all features. Use st_bbox(mrc) to retrieve it. This returns the xmin, ymin, xmax, ymax coordinates of the extent. For example, st_bbox(mrc) might output something like xmin: -73.9, ymin: 45.3, xmax: -72.5, ymax: 46.1 (coordinates in whatever CRS the data is in). Knowing the extent is useful for sanity checks (does it cover the area you expect?) and for setting up map viewports.
Coordinate Reference System: The CRS defines how the coordinates in the data map to real locations on Earth. Use st_crs(mrc) to get the CRS of the dataset. This will typically return an EPSG code or PROJ4 string. For instance, a common CRS for global data is WGS 84 longitude/latitude (EPSG:4326), while regional data might use a local projection (the example dataset mrc.shp might use NAD83 / Quebec Lambert projection, EPSG:6622). Confirming the CRS is crucial because if you overlay or compare two spatial datasets, they must be in the same CRS or explicitly transformed to a common one. A CRS description from st_crs will include the coordinate system type (geographic or projected), datum, and projection parameters if applicable.

Why CRS matters: A coordinate reference system defines how a two-dimensional map corresponds to real places on Earth. Geographic CRSs (like WGS 84) use latitude/longitude on an ellipsoidal model of Earth, while projected CRSs (like UTM or state plane systems) flatten the earth for use on a plane, with coordinates in linear units (meters, feet, etc.). Using the wrong or inconsistent CRS can lead to misaligned layers (e.g., your points might not fall inside the polygons they should) and measurement errors. Always check the CRS for each dataset and ensure you know what it is. If st_crs() returns NA, it means the data is missing CRS metadata – you may need to set it manually with st_set_crs() if you know the correct one.

Basic Visualization of Vector Data

One immediate way to understand spatial data is to plot it. Visualization can reveal spatial patterns, highlight potential errors (such as outlier points in the wrong place), and give a sense of the geographic context.

The sf package provides a simple plot method for sf objects. If you call plot(mrc), by default it will produce a multi-panel plot of all attributes (for numeric or factor attributes) along with the geometries. However, for a quick look at just the shapes, it’s common to plot only the geometry. You can do this by either subsetting just the geometry column or using the special function st_geometry().

R Example – Quick Plot:

# Plot just the geometry of the municipal regions
plot(st_geometry(mrc), col = NA, border = 'blue', axes = TRUE)

This will draw the outlines of the polygons in blue with no fill color (col = NA means transparent fill) and include axes with coordinate ticks (axes = TRUE). We could also do plot(mrc["geometry"]) which is equivalent in this case. The result is a basic map of the features. If mrc contains multiple polygon features (regions), you’ll see their boundaries drawn.

This kind of base R plot is great for an exploratory view. It’s fast and useful for verifying that the data loaded correctly (for instance, the shapes appear in the expected location). You might use this to notice if there’s a stray polygon far off (indicating maybe an erroneous coordinate), or just to familiarize yourself with the region’s shape.

For more refined visual analysis, one might use interactive viewing tools. The mapview package (not covered in detail here) can display sf objects in an interactive viewer with minimal code (e.g., mapview(mrc)), which is very handy during data exploration. We’ll cover interactive maps later with leaflet, but mapview is a quick one-liner alternative for exploring data.

Data Selection and Filtering

Just as with non-spatial data frames, we often need to subset spatial data to focus on relevant features. With sf objects, we can perform attribute-based filtering using typical R or dplyr syntax, and we can also leverage spatial relationships to filter data (e.g., selecting all points within a certain area).

Attribute filtering: You can index an sf object like a data frame. For example, suppose mrc has an attribute column pop2016 for population in 2016 and mrc_name for the region name. We can filter regions by population:

# Select regions with population > 200,000, keeping only name and population columns
mrc_subset <- mrc[mrc$pop2016 > 200000, c("mrc_name", "pop2016")]

This uses base R indexing: the row condition mrc$pop2016 > 200000 selects only those rows (features) meeting the criterion, and the column index c("mrc_name","pop2016") keeps only those two attributes (plus, implicitly, the geometry column – which is always kept unless dropped explicitly). The result mrc_subset is a new sf object containing, say, the subset of municipal regions that have over 200k population, with just their name and population fields. Because sf extends the data frame class, subsetting preserves spatial attributes by default (the geometry column stays with the data as long as drop = FALSE, which is the default in the bracket operator for sf objects).

After such filtering, you might plot mrc_subset to see where these high-population regions are, or do further analysis on them.

Spatial filtering: A powerful feature of sf is the ability to filter based on spatial relationships. The sf package supports spatial predicates – functions like st_intersects, st_contains, st_within, etc., that compare geometries. Moreover, you can use the bracket notation directly with an sf object as the filter to select by intersection. For example, if we had a layer of points (say plots_sf representing sample plot locations) and we wanted to select only the municipalities that contain at least one sample plot, we could do:

# Assuming plots_sf is an sf of points with the same CRS as mrc
selected_mrc <- mrc[plots_sf, ]

Here, using mrc[sf_object, ] triggers a spatial selection: it will return only those features of mrc that spatially intersect with any feature in plots_sf. By default, the predicate is intersection; you can specify a different one with the op argument (e.g., mrc[plots_sf, , op = st_contains] to select polygons that fully contain any point). This spatial subsetting syntax is very succinct for common tasks like clipping points by a boundary or finding polygons that overlap with others.

Additionally, the dplyr interface provides filter() method for sf which works similarly – you can mix attribute conditions and even call spatial predicate functions inside filter. But often using base subsetting or explicit spatial joins (next topic) is clearer for complex operations.

Whether by attribute or geometry, subsetting helps focus the analysis. It’s analogous to making SQL queries on your spatial data: you can select “WHERE population > X” or “WHERE geometry intersects some area.” Both are essential tools.

3.3 Integration with Data Manipulation Tools

One of the strengths of the sf package is its seamless integration with the tidyverse data manipulation verbs. Since an sf object is essentially a data frame with geometry, you can directly use dplyr functions like filter(), select(), mutate(), group_by(), and summarize() on it. This allows powerful combinations of attribute and spatial operations in a clear pipeline style.

Let’s explore a common task: aggregating or summarizing spatial data by a region attribute. Suppose our mrc polygons belong to larger regions (for instance, say mrc has a column reg_name indicating a broader region each municipality falls into). If we want to get total population per region, we can use dplyr:

library(dplyr)
regions <- mrc %>%
  group_by(reg_name) %>%
  summarize(pop2016 = sum(pop2016))

This pipeline does what it would for a normal data frame: group the data by reg_name and sum the pop2016 values within each group. The result, regions, will have one row per unique reg_name with the summed population. But what about the geometry? By default, summarizing an sf object will also perform a spatial union of the geometries in each group. In other words, the multiple municipal polygons that shared the same reg_name are dissolved into a single combined polygon for that region. The sf package uses the GEOS library to carry out this union operation behind the scenes, ensuring that we get a proper merged geometry for each group. This behavior is controlled by an argument do_union = TRUE in the summarize() method for sf (which is TRUE by default). The benefit is that the resulting regions object remains an sf with correct geometries representing, say, larger administrative regions. Had we not unioned the geometries, we would end up with a multipolygon object containing disjoint pieces (or an invalid geometry with overlapping boundaries). The automatic union usually aligns with what an analyst expects when grouping spatial features.

It’s worth noting that if for some reason you wanted to keep the individual geometries and not dissolve them, you could use summarize(..., do_union = FALSE), but then you might get a geometry list-column or invalid overlaps – typically not desired unless you have a specific reason.

Beyond summarization, other dplyr verbs work naturally:

mutate() can add or transform attributes. For example, mrc <- mrc %>% mutate(pop_density = pop2016 / area_km2) (assuming there’s an area_km2 attribute or we computed area via st_area()).
select() chooses a subset of columns but, by default, it keeps the geometry column even if you don’t list it (so you don’t accidentally drop the spatial component). If you do want to drop geometry and get a plain data frame, you can explicitly do st_drop_geometry(mrc) or convert to data frame first.
arrange() will sort the rows; it doesn’t affect geometries except reordering them.
Joins: You can do attribute joins between an sf object and a regular tibble using left_join(), inner_join(), etc., as long as you join by attribute keys (e.g., attaching a demographic table to a spatial layer by region code). The join will simply carry over the new columns into the sf while retaining the geometry. This is extremely useful for enriching spatial data with additional attributes (for example, joining census data to a spatial boundary file by an ID field).

As an example, if we had a data frame df with columns mrc_name and some statistic, and we wanted to attach that to mrc spatially, we’d do mrc <- left_join(mrc, df, by="mrc_name"). The result is still an sf object but now with extra attributes from df.

Finally, it’s important to mention spatial joins as a form of integration between datasets. A spatial join attaches attributes from one spatial layer to another based on location (e.g., assign each point the attributes of the polygon it falls in). The sf package has a function st_join(x, y, join = st_intersects) for this purpose. For instance, if we have a point layer plots_sf and polygon layer mrc, st_join(plots_sf, mrc) will produce a new set of points that carry the attributes of the municipality each point lies in. By default it uses intersection (point-in-polygon), but you can specify other relations or a distance buffer. We will discuss some of these spatial operations more in the next section. Conceptually, though, st_join is analogous to a database join but where the key is a spatial relationship rather than an exact matching ID.

sf objects are fully compatible with the tidyverse workflow, allowing you to filter, aggregate, and join data with the same code you’d use for non-spatial tables. This integration accelerates spatial data cleaning and preparation tasks. Moreover, under the hood sf leverages high-performance libraries for geometry operations (it uses GDAL for I/O, GEOS for geometric ops, and PROJ for transformations), so many of these operations are efficient even on large datasets. Embracing these tools means we can write clear, concise code to manipulate geospatial data and trust that the heavy lifting (geometry crunching) is handled by battle-tested libraries.

3.4 Creating Spatial Data from Data Frames

Not all data comes pre-packaged in a shapefile or spatial format. Often, you may have ordinary data frames or CSV files that contain location information (e.g., latitude and longitude columns, or address fields). R’s spatial toolkit allows you to convert such tabular data into spatial objects, enabling mapping and spatial analysis on them.

A common scenario is a CSV file with coordinates. For example, suppose plots.csv contains ecological survey plots with columns for longitude and latitude, along with other attributes like vegetation type. We can turn this into an sf object with the function st_as_sf():

R Example – Data Frame to sf Point:

plots <- read.csv("data/plots.csv")  # assume it has columns "long" and "lat"
plots_sf <- st_as_sf(plots, coords = c("long", "lat"), crs = 4326)

In this snippet, coords = c("long","lat") tells st_as_sf which columns contain the X and Y coordinates (note: the order is X then Y, so long = X, lat = Y). We also specify a CRS; here we assume these are longitude/latitude in WGS 84 (EPSG:4326), which is a common scenario for GPS coordinates. If we know the points were collected in the same CRS as another layer (say the mrc data uses a specific projection and the CSV coordinates are actually in that same projection system), we could use crs = st_crs(mrc) to assign the same CRS as mrc. In our example, we used 4326 for global lat/long.

After this conversion, plots_sf becomes an sf object of point geometry. Each row of plots now has a corresponding point (with coordinates from the given columns) stored in plots_sf$geometry. All the other columns (cover_type, plot_id, etc.) remain as attributes in plots_sf. Now that it’s spatial, we can do things like plot these points on a map, or perform spatial joins/operations with other spatial data.

A couple of considerations when creating spatial data from scratch:

CRS correctness: The st_as_sf() function will not transform coordinates, it only labels them. So it’s crucial to specify the correct CRS that the coordinates are in. If you set the wrong CRS, your points will appear in the wrong location when mapped. For instance, if the CSV lat/long are WGS84 and you mistakenly assign a local projection, the points could be interpreted entirely incorrectly. Always confirm what coordinate system your source data uses. If the data came from a GPS or an online source, WGS84 (EPSG:4326) is likely. If it’s from a local survey, it might be a local projection (you might be given a PROJ4 string or EPSG code in metadata).
Other coordinate data: Sometimes coordinates are given in separate fields such as UTM northing/easting or addresses. For addresses, geocoding (using an API or package) would be needed to get coordinates. For projected coordinates like UTM, you would still use st_as_sf() but you need the appropriate CRS (for UTM, specify the correct zone’s EPSG code, e.g., 32633 for UTM zone 33N WGS84).
Data validation: After creating plots_sf, it’s wise to do a quick plot or check a few coordinates to ensure they line up with expectations (e.g., plot them on top of a known map or see if coordinates fall within expected min/max ranges).

By converting raw data frames into spatial objects, we unlock all the spatial operations for those data. For example, we could count how many plots fall in each municipality by doing a spatial join between plots_sf (points) and mrc (polygons). We could measure distances between plot locations, or buffer them to analyze neighboring areas (topics we will touch on soon). Essentially, any dataset with location info can become part of our geospatial analysis workflow after this conversion step.

3.5 Working with Raster Data

While vector data deals with discrete features, raster data represents continuous spatial information and imagery, using grids of cells (pixels). Common examples of raster data include satellite images, elevation models, land cover classification grids, and climate data layers. In R, the new-generation package stars is designed for handling raster and multi-dimensional array data. (The older raster package and its successor terra are also widely used; terra is very powerful for raster analyses, but here we’ll focus on stars as it was mentioned in our objectives.)

Raster data is essentially a matrix (or a stack of matrices for multi-band data) where each cell has a value. This could be a color (in remote sensing imagery), a temperature, an elevation, etc. stars treats rasters as spatiotemporal arrays, which can have dimensions like x, y, time, band, etc.. One strength of stars is that it can handle multi-band images and time series data (data cubes) in a tidy way, and it interoperates with sf for combined vector/raster operations.

Reading and Exploring Raster Data

Reading a raster file in R with stars is straightforward using read_stars(). For example:

library(stars)
# Read a GeoTIFF file (e.g., a satellite image)
sat_image <- read_stars("data/landsat_scene.tif")
print(sat_image)

If landsat_scene.tif has multiple bands (say, Red, Green, Blue, Infrared), read_stars will load them all, and the print output will tell us the dimensions. A typical print might show something like: “stars object with 2 dimensions and 4 attributes” or “dimensions: x, y, band; resolution …; CRS: …”. stars uses GDAL under the hood for reading common raster formats (GeoTIFF, NetCDF, etc.), so it supports many formats similar to sf.

Just as we checked vector metadata, for rasters we should check: the dimension sizes (number of rows, cols, layers), the resolution (cell size), extent (which st_bbox(sat_image) would give), and CRS (st_crs(sat_image)). These are accessible through st_dimensions(sat_image) which lists each dimension’s values or range. For example, st_dimensions(sat_image)$x might show the sequence of x coordinates for cell centers, etc.

If the raster has a defined CRS (most GeoTIFFs do), read_stars will attach it. If not, you might need to set it with st_set_crs() similarly to vectors.

Plotting rasters: The stars object can be plotted directly. If it’s a single-band raster, plot(sat_image) will display it (with a legend of values). If it’s multi-band, plot(sat_image) might plot each band unless it recognizes a 3-band set as RGB. You can also specify plot(sat_image, rgb = c(3,2,1)) to plot a color image using bands 3,2,1 for RGB (for instance). Plotting is useful to quickly visualize the raster and check for any obvious issues (like whether the data aligns with known features, or if there are missing data areas).

Raster operations basics: Once loaded, raster data can be manipulated much like any array. For example, you can do arithmetic on the object: sat_image * 2 would multiply all cell values by 2 (if it’s numeric data). With multiple bands, you can index the band dimension to select a subset or do calculations across bands. stars provides functionality such as st_apply() to apply functions across dimensions (like calculating a mean across time or bands).

However, many advanced raster analyses (like filtering, zonal stats, etc.) are beyond our scope in this intro. We will focus on a few key tasks:

Cropping / masking: You can crop a raster to a region of interest or mask it by a polygon. For example, to crop to our mrc extent: sat_image_cropped <- sat_image[st_bbox(mrc)] (stars allows using an sf or bbox to index the raster by spatial extent). For masking by a polygon (set values outside a polygon to NA), one might use st_crop() followed by manual NA assignment or use terra::mask if mixing with terra.
Raster-vector integration: A common need is to extract raster values at point locations or summarize raster values within polygons. The st_extract() function in stars can retrieve raster values for given point geometries. E.g., st_extract(sat_image, plots_sf) would give the pixel values for each point in plots_sf. To get summary within polygons (like average value per polygon), you could use aggregate(): e.g., stars::aggregate(sat_image, by = mrc, FUN = mean) which will compute the mean of raster cells within each mrc polygon. Under the hood, stars uses GDAL routines for some of these tasks and can convert between raster and vector (e.g., st_as_sf on a stars object can polygonize raster regions, though for large rasters this can be heavy).
Example – Thresholding a raster: To illustrate a simple raster operation, imagine our satellite image has a band for vegetation index and we want to find areas above a certain threshold. We could do:
```
veg_index <- sat_image[ , , "NDVI"]        # assuming one band named "NDVI"
high_veg <- veg_index > 0.8               # this yields a logical stars object
high_veg <- st_as_stars(high_veg)         # convert logical to 0/1 raster (TRUE/FALSE)
high_veg_polygons <- st_as_sf(high_veg, as_points = FALSE, merge = TRUE)
```
In this pseudo-code, high_veg would be a raster mask of TRUE/FALSE where NDVI > 0.8. Converting that stars object to sf polygons (with merge=TRUE to combine adjacent TRUE cells) would yield polygon features outlining high vegetation areas. The example shows how one can go from raster analysis to vector results, which is often done in land cover mapping etc. (This process relies on GDAL’s polygonize under the hood.)

Handling large rasters: Raster datasets can be very large (think hundreds of millions of pixels). stars has the concept of lazy loading and proxy objects. If a file is big, read_stars() may create a stars_proxy object that doesn’t immediately load all data into memory, but rather waits until you actually need it (for plotting or computation), possibly downsampling if appropriate. This allows working with large rasters without running out of memory, as long as you don’t force read everything. If you encounter performance issues, consider reading only what you need (spatially or by band). Alternatively, the terra package is optimized in C++ for many raster operations and can handle large data and on-disk processing efficiently. It’s fine to use a mix of tools: for instance, you might use terra for heavy raster crunching and then convert the result to stars or sf for integration with vector data.

stars extends our geospatial toolkit to continuous data. The ability to bring in a raster layer (e.g., a digital elevation model or a climate surface) and combine it with vector data (e.g., administrative boundaries, sample locations) greatly enriches the analysis. Many spatial analyses involve both vector and raster data – for example, extracting elevation at point locations, or computing the population within areas of high flood risk (overlaying population rasters with flood extent polygons). With sf and stars together, R provides a unified environment to conduct such multi-dimensional analyses.

3.6 Coordinate Reference Systems (CRS)

As introduced earlier, coordinate reference systems are the backbone that ties spatial data to real-world locations. A CRS includes a coordinate system (how coordinates are measured) and a geodetic datum (a model of Earth’s shape). Working with CRSs correctly is vital for accurate spatial analysis. This section delves a bit deeper into CRSs and demonstrates how to reproject data in R.

A key point to remember: All spatial operations and comparisons assume the coordinates are in a single consistent CRS. If not, you must transform one dataset to the CRS of the other (or both to a common CRS) before analysis. R will generally not stop you from overlaying data with different CRSs, but the results will be nonsense if the CRSs differ.

Understanding and Inspecting CRS

When you load a spatial dataset, always note its CRS (using st_crs). CRSs can be geographic (lat-long) or projected.

Geographic CRS (unprojected): This uses latitude and longitude on an ellipsoid. Coordinates are in degrees. Example: WGS 84 (EPSG:4326) – the most common global CRS, used by GPS and many datasets. In a geographic CRS, a degree of longitude represents a different distance on the ground depending on latitude (roughly 111 km at equator, 0 at poles), so distance/area calculations in degrees are not uniform.
Projected CRS: A mathematical projection of Earth onto a flat surface, using linear units (meters, feet). There are hundreds of projections, each suited for different regions or purposes (preserving areas, shapes, distances, etc.). Examples: Web Mercator (EPSG:3857) is used in many web maps (like Google or OSM tiles) – it’s a Mercator projection in meters; Universal Transverse Mercator (UTM) which divides the world into zones for detailed mapping; national grid systems like NAD83 / State Plane or the Lambert Conformal Conic used for Canadian provinces. Each projection introduces some distortion (no flat map can represent the 3D earth perfectly), but is often chosen to minimize distortion in the area of interest.

It’s important to choose an appropriate projection for your analysis needs. If you’re measuring areas or distances, an equal-area or equidistant projection might be preferred. If you just need a pleasing visual map for a local area, a conformal projection (preserving shape) might be used. Fortunately, reprojecting data in R is easy, which encourages experimenting or switching CRS as needed.

Reprojecting Data with `st_transform`

To convert spatial data from one CRS to another, we use the function st_transform(x, crs = <target CRS>). The crs argument can be specified by EPSG code, or by a PROJ4string, or by an st_crs() object copied from another layer.

R Example – Reprojecting: Continuing with our mrc example, suppose it was originally in a geographic CRS (lat/long). For certain analyses (say we want to calculate land area or distances between centroids), it’s better to have a projected CRS in meters. We could do:

mrc_proj <- st_transform(mrc, crs = 6622)

Here EPSG:6622 (NAD83(CSRS) / Quebec Lambert) is a projected CRS suitable for Quebec area (meters as units). After this, mrc_proj coordinates are now in a planar system. We could verify by checking st_crs(mrc_proj) which should show EPSG:6622 and units in meters, and st_bbox(mrc_proj) which would now give coordinates in perhaps hundreds of thousands of meters (e.g., Easting/Northing values), rather than degrees.

If you don’t know an appropriate EPSG code offhand, you can often find one by searching (sf::st_crs("WGS84") gives 4326; st_crs(3857) gives Web Mercator; for local systems, sometimes the data provider will specify the EPSG or you might search on spatialreference.org).

st_transform will utilize the PROJ library to accurately convert coordinates. This includes datum transformations if needed (for example, converting from NAD27 datum to WGS84 will apply a shift – PROJ handles that internally if it has the grid parameters). If R outputs messages about missing datum shift files, you might need to install those if precision is required, but often small datum differences are negligible for many uses.

Practical tip: Always ensure the data’s CRS is correctly set before transforming. If st_crs(mrc) was NA (unknown), st_transform won’t know how to interpret the numbers and will likely fail or give wrong results. You would first do st_set_crs(mrc, <EPSG or proj string>) to assign the correct original CRS (without changing data, just labeling it), then transform.

When to reproject: You’ll reproject data whenever layers need to align. For instance, if you have points in WGS84 but polygons in a local projection, decide on one CRS and transform the other. Also, if you plan to compute lengths or areas, using an appropriate projection (especially an equal-area projection for area) is recommended. Measuring area in degrees is meaningless; one degree of latitude is not a constant distance. Instead, transform to a CRS where a unit is a real length (meter) in your region, then use st_area or st_length.

One example: If calculating the area of each mrc polygon, do st_area(mrc_proj) after projecting to a suitable meter-based CRS. If you used st_area(mrc) while it was in lat/long, sf (by default, as of recent versions) might actually use spherical geometry to approximate the area on the ellipsoid, which can yield results in squared meters even from degrees (thanks to the s2 geometry engine integration). This can be handy, but it’s sometimes confusing. Many prefer to explicitly project to a planar CRS and then calculate area.

Multiple CRSs in a project: It’s common to switch CRS depending on task. You might keep data in WGS84 for ease of sharing and because it’s the common denominator, but project to local coordinate systems for analysis or mapping. For example, you could have your raw data in 4326, then create a transformed version in a local projection for distance calculations, and perhaps another transformed to Web Mercator if you want to overlay on a Google map in a web context. The key is to keep track of which object is in which CRS (naming like mrc_ll vs mrc_proj can help). The st_crs metadata will guide you, so always check it if unsure.

Coordinate reference systems ensure spatial data from different sources align correctly on the Earth’s surface. Proper use of CRSs allows us to integrate maps seamlessly and perform accurate spatial analysis (for instance, when combining layers, a CRS acts like a common language of location). With R’s st_transform function, we can easily convert CRSs, thus overcoming one of the historically tricky parts of GIS work. This removes a major barrier to combining data globally and lets us focus on analysis, confident that “a meter means a meter” and “a degree means a degree” as appropriate in our computations.

3.7 Customizing Maps with Visualization Libraries

So far we have discussed plotting for exploratory purposes. When it comes to presenting results or creating publication-quality maps, we often need more control over the appearance. This is where R’s powerful ggplot2 package comes in. Thanks to the integration of sf with ggplot2, making complex maps with legends, scales, and multiple layers is straightforward.

Additionally, specialized mapping packages like tmap (thematic maps) exist, but here we will illustrate with ggplot2 since it’s widely used and very flexible.

Using ggplot2 for Static Maps

Recall that sf objects can be used directly in ggplot. ggplot2 has a geom specifically for spatial data: geom_sf(). This geom will automatically handle the projection (using the coordinate system of the data) and can plot points, lines, or polygons appropriately based on the geometry type.

R Example – Layered Map with ggplot2:

library(ggplot2)

ggplot() +
  geom_sf(data = mrc_proj, fill = "gray90", color = "white") +
  geom_sf(data = plots_sf, aes(color = cover_type), size = 1.5) +
  labs(title = "Sample Plots and Municipal Regions",
       subtitle = "Cover types of plots overlaid on region boundaries",
       color = "Cover Type") +
  theme_minimal()

Let’s break down what this does:

We start with ggplot() and add layers with +. The first layer geom_sf(data = mrc_proj, ...) draws the municipal regions (polygons). We set fill = "gray90" to fill polygons light gray, and color = "white" to draw white borders (this gives a subtle boundary effect). If we had an attribute to color by (say population density), we could do aes(fill = pop_density) inside geom_sf to create a choropleth. Here we chose a uniform fill.
The second layer geom_sf(data = plots_sf, aes(color = cover_type), size = 1.5) adds point locations for the plots, using the attribute cover_type to color them. Each unique cover type (e.g., forest, grassland, etc.) will get a different color automatically. The size = 1.5 makes the points a bit larger for visibility.
We add labs to set a title, subtitle, and rename the legend for color to “Cover Type”. By mapping color = cover_type in aes(), ggplot knows to produce a legend for that.
We apply a theme (minimal theme here) to give a clean look. We could further customize, e.g., adding theme(legend.position="bottom") or theme(axis.text=element_blank(), axis.ticks=element_blank()) if we want to hide coordinate axes for a cleaner map (often maps don’t need lat-long labels if it’s not important).

The output is a layered map: a base of polygons and points on top, with a legend indicating the cover type colors. This demonstrates how ggplot’s layering system works naturally with spatial data, just like with other data types.

Coordinate systems in ggplot: By default, when using geom_sf, ggplot will use the CRS of the data for plotting (and include an appropriate coordinate grid if axes are drawn). If you want to change the map projection for display, you can use coord_sf(crs = <desired CRS>). For example, you might have data in a projected CRS but want to show it in a geographic (unprojected) form or vice versa. coord_sf will project the geometries on the fly for display. However, if your data is already in a suitable projection for the region shown, it’s often fine to let ggplot use it directly.

Further customization: You have full ggplot2 capabilities, which include:

Adjusting color scales (e.g., scale_color_brewer(palette="Set2") to use a colorblind-friendly palette for categories).
If using fill for polygons, scale_fill_viridis_c() might be nice for numeric choropleth, or scale_fill_manual() for custom discrete colors.
Adding annotations: ggplot makes it easy to add titles, captions, or even map elements like scale bars or north arrows (not built-in, but packages like ggspatial provide those).
Faceting: You could facet maps by an attribute. For example, if plots_sf had surveys from different years, you could facet by year to compare spatial patterns over time side by side.
Combining with non-spatial layers: Because ggplot is general, you can overlay non-map data too. For instance, adding geom_text() at specific coordinates for labels, or even using annotation_raster() to put an image under the map, etc.

An important consideration for map presentation is coordinate aspect ratio. coord_sf by default ensures the x and y scales are in correct proportion (1 unit in x equals 1 unit in y), which prevents distortion of shapes. If you ever override that (not usually advisable for maps), you could stretch the map, but normally you keep aspect ratio fixed to reality.

Example – Choropleth Map: As another scenario, say we want to color municipalities by population density. We could do:

mrc_proj <- mrc_proj %>% mutate(pop_density = pop2016 / area_km2)
ggplot(mrc_proj) +
  geom_sf(aes(fill = pop_density)) +
  scale_fill_viridis_c(trans = "log", name="People per km^2") +
  theme_bw()

This single-layer plot will color each polygon by pop_density using a viridis color scale (log-transformed to handle skewed distribution). The legend is labeled accordingly. geom_sf knows to take the data from mrc_proj and map fill to that attribute. The result is a classic choropleth map. We used theme_bw() for a base theme with gray background; one could further tweak it.

Through these examples, you can see that advanced maps in R benefit from combining sf for data and ggplot2 for aesthetics. The result is high-quality visualizations that can be output to PNG/PDF for reports. They are also reproducible: the code fully specifies how the map is generated, so you can rerun it with updated data or share it with collaborators for transparency.

3.8 Interactive Maps

Static maps are great for reports and printed media, but interactive maps provide a dynamic way to explore spatial data. By panning, zooming, and clicking on features, users can glean insights that might be hard to convey in a static image. R makes it easy to create interactive web maps through packages like leaflet (an R interface to the popular Leaflet.js library) and mapview. In this section, we’ll focus on leaflet for creating customizable interactive maps.

Leaflet in R works by creating a widget (an HTML map) that you can view in RStudio’s viewer or in a web browser. The basic workflow is to initialize a map with leaflet() and then add layers to it (tiles, markers, polygons, etc.) with add* functions.

Using Leaflet in R

To start, ensure you have leaflet installed (install.packages("leaflet")). Then:

R Example – Basic Interactive Map:

library(leaflet)

leaflet(mrc_proj) %>%
  addTiles() %>%            # add default OpenStreetMap tile background
  addPolygons()

In this snippet, leaflet(mrc_proj) initializes a leaflet map with the mrc_proj polygons as the default data source. The addTiles() function adds a base map (the default is OpenStreetMap tiles, which are free and widely used). addPolygons() without any arguments will add the polygons from mrc_proj (because we provided that as the data in leaflet() call). The result is an interactive map where you can zoom and pan, and the municipal regions are overlaid on top of OSM street maps. By default, leaflet will give the polygons a blue fill and white outline with some transparency.

This minimal example shows the core idea. But leaflet allows a lot of customization:

Popups and tooltips: You can specify what happens when a user clicks or hovers on a feature. For example, addPolygons(popup = ~mrc_name) would display the region name when a polygon is clicked. The ~mrc_name syntax tells leaflet to use the mrc_name column of the data for each feature’s popup. Tooltips (appearing on hover) can be added with label = ~some_column.
Theming polygons: You can style the polygons via arguments in addPolygons, such as color, fillColor, weight (border thickness), opacity, fillOpacity, etc., and these can be set dynamically using attribute values. For example, you could color by population density by providing a vector of colors or a palette function. The leaflet package has functions like colorNumeric() to generate color palettes. E.g.,
```
pal <- colorNumeric("YlOrRd", domain = mrc_proj$pop_density)
leaflet(mrc_proj) %>% addTiles() %>%
  addPolygons(fillColor = ~pal(pop_density), color="black", weight=1,
              popup = ~paste(mrc_name, "<br> Density:", pop_density))
```
This would color each polygon using a yellow-to-red palette based on density, with black borders, and a popup listing name and density.
Adding markers or circles for point data: If you have plots_sf points, you could do addCircleMarkers(data = plots_sf, radius = 3, color = ~cover_pal(cover_type), popup = ~cover_type). This would put interactive circle markers for each plot, colored by cover type (using some palette function cover_pal you define).
Layers control: Leaflet allows multiple layers and a control to toggle them. For instance, you can add multiple tile layers (addProviderTiles("CartoDB.Positron") for a different basemap) and use addLayersControl to choose between base maps or toggle overlays (like turning on/off the plots layer).
Integration with Shiny: Leaflet maps can be made part of Shiny web applications, where user interactions on the map can trigger R code (for example, selecting points, dynamic filtering). This is beyond this chapter, but worth noting as a capability.

One thing to remember is that leaflet expects data in WGS84 geographic coordinates (longitude/latitude) under the hood. If your data isn’t in lat/long, leaflet will still try to display it but the results may be distorted or offset. Many spatial packages automatically transform data for leaflet if needed. It’s good practice to ensure the data given to leaflet() is in EPSG:4326. In our example, mrc_proj was in a projected CRS, so technically we should transform it: leaflet(st_transform(mrc_proj, 4326)) %>% .... The example might have worked if mrc_proj had a CRS attribute that leaflet recognized and transformed, but to be safe, supply lat/long.

Interactive maps are especially useful during exploratory analysis and for stakeholder engagement. For example, if analyzing health outcomes by region, an interactive map allows a user to click on their region and see details, or to zoom into neighborhoods. The ability to overlay multiple layers and toggle them helps in understanding relationships (e.g., toggling a flood risk raster vs. population density layer).

In R, other packages like mapview provide even quicker interactive plotting (one line to view an sf object on a map). Mapview is great for quick looks, but leaflet gives finer control for making a polished interactive map. There’s also tmap which can produce static and interactive maps with the same code (switching tmap_mode("view") for interactive), which is very convenient.

For our purposes, the main takeaway is that creating an interactive map in R is relatively simple: use leaflet(), add some base tiles, and add spatial data layers. The result can be saved as an HTML or viewed in RStudio. Such maps are a powerful way to communicate spatial findings, as users can explore the data themselves.

3.9 Conclusion

In this chapter, we covered advanced techniques for geospatial data manipulation in R, focusing on practical skills with the sf and stars packages. We began by highlighting the importance of spatial analysis across fields and noting the explosive growth in spatial data availability. The introduction of modern spatial packages in R – notably sf for vector data (a package that “has revolutionized spatial handling in R” by adopting the standardized simple features approach) and stars for raster/array data – has made it much easier to integrate geospatial analysis into data science workflows.

Key takeaways from this chapter include:

Reading and exploring spatial data: We learned how to load vector data from various formats with st_read() and inspect its structure (attributes, CRS, bounding box). Understanding the data’s coordinate reference system is critical to avoid misalignment; a CRS defines how spatial coordinates map to real locations, and we practiced checking and setting CRSs appropriately.
Data manipulation and subsetting: Treating sf objects like data frames allows leveraging of powerful dplyr verbs. We filtered spatial features by attribute (e.g., extracting municipalities above a population threshold) and by spatial relationship (using intersections for selection). We performed group-wise summarization to aggregate smaller units into larger regions, during which sf automatically handled the geometric union of features in each group – simplifying the creation of dissolved boundaries for aggregated data.
Raster data handling with stars: We introduced the stars package for raster and multi-dimensional data. Reading rasters with read_stars() and plotting them gave us a way to incorporate continuous spatial data (like satellite imagery or climate surfaces) into analysis. We discussed basic raster operations and how stars and sf can be combined, for example extracting raster values at vector point locations or converting raster data to vector features (and vice versa). This opens the door to rich spatial analyses, such as overlaying population rasters with administrative boundaries to calculate population statistics per region.
Coordinate transformations: We emphasized the need for and ease of reprojecting data with st_transform(). Whether to prepare data for distance calculations or to match the projection of a map background, transforming CRSs ensures apples-to-apples comparisons in space. The examples illustrated transforming lat/long data to a local projected CRS for accurate area measurements. We cited how sf leverages the PROJ library for these transformations, ensuring high accuracy.
Visualization techniques: Creating maps is often the end goal of spatial analysis. We saw how base plotting can give a quick view of data, but for polished outputs ggplot2 with geom_sf is extremely powerful. With ggplot we made layered maps, combining polygons and points with custom colors and legends. The integration of sf with ggplot2 means we can directly map attributes to aesthetics (fill, color, size) and produce thematic maps with minimal effort. We also touched on adding titles, scales, and choosing appropriate color palettes – crucial for making the map interpretable and visually appealing.
Interactive mapping: Finally, we explored making interactive maps using leaflet. This capability transforms static results into engaging tools where users can navigate the data spatially. By adding tiles and data layers to a leaflet widget, we created an interactive map of regions and sample plots. Features like popups demonstrated how additional information can be revealed on demand, making the map a richer experience. Interactive maps are invaluable for presentations and web dashboards, as they invite exploration and can help convey complex spatial stories (for instance, seeing how different layers overlap by toggling them).

Throughout the chapter, we have reinforced that geospatial analysis in R is not an isolated specialty – it connects with general data manipulation and visualization workflows. You can join tables, aggregate data, and pipe through transformations in a manner consistent with non-spatial analyses, thanks to the design of packages like sf. Moreover, critical computational geometry tasks are handled by robust libraries (GDAL, GEOS, PROJ), giving you reliable results without needing GIS-specific software.

With these advanced techniques, you can confidently tackle a wide range of spatial data tasks. For example, you could: merge disparate spatial datasets (such as linking health data points to census tract polygons), transform coordinates to analyze distances (such as computing how far each school is from the nearest earthquake fault line), rasterize vector data or vice versa (for instance, converting land use polygons into a raster grid for modeling), and visualize the outputs in compelling ways (static reports or interactive web apps).

As you apply these skills, keep in mind best practices like checking data for errors (invalid geometries can be fixed with st_make_valid if needed), using appropriate projections for the task (distortion matters!), and documenting your workflow (perhaps using R Markdown to integrate code, analysis, and maps in one reproducible document). Geospatial analysis can be complex, but the structured approach presented in this chapter should provide a strong foundation. With practice, you’ll be able to extend these techniques – for example, into spatial modeling, point pattern analysis, network analysis, and more advanced topics covered in later chapters.

Geospatial data manipulation in R empowers us to add the “where” dimension to our analyses. By mastering vector and raster handling, coordinate systems, and mapping techniques, we unlock new insights that purely non-spatial analysis might miss. The techniques learned here will enable you to build maps that inform policy, identify patterns like clusters or hotspots, allocate resources efficiently by location, and generally make better decisions grounded in spatial context. These are highly valuable capabilities in our increasingly location-aware world. Happy mapping, and may your future analyses always find the geo in data!

# Advanced Techniques for Geospatial Data Manipulation Geospatial data analysis has become a cornerstone of decision-making in diverse domains. From urban planning and environmental management to public health, agriculture, and disaster response, understanding **where** things happen is as important as **what** happens. In fact, it is often cited that roughly *80% of the world’s data includes some kind of spatial aspect*, underscoring the ubiquitous importance of location in modern data. The proliferation of GPS-equipped devices, remote sensing satellites, and location-based services means that spatial data are being generated at unprecedented scales. The geospatial analytics industry is rapidly growing – by one estimate it will nearly **double in size between 2021 and 2026** – and its applications are expanding across both commercial and public sectors. In tandem with this data boom, tools for spatial analysis have advanced significantly. The R programming language, long favored for data analysis, now boasts a rich ecosystem for geospatial work. In particular, the advent of packages like **`sf`** (simple features) and **`stars`** has revolutionized how spatial data are handled in R. The `sf` package implements the Open Geospatial Consortium (OGC) *Simple Features* standard for vector data, providing a modern, user-friendly alternative to earlier R spatial libraries (e.g. **sp** for vectors and **raster** for rasters). By treating spatial objects (points, lines, polygons) as first-class data frame elements, `sf` makes manipulating geospatial data as intuitive as manipulating tables. Likewise, the `stars` package extends R’s capabilities to raster and spatiotemporal data “cubes,” enabling the handling of multi-dimensional environmental data (e.g. satellite images, climate model outputs) within the R ecosystem. This chapter provides a comprehensive introduction to advanced techniques for geospatial data manipulation using R. We will cover the essential workflows for reading, exploring, transforming, and visualizing spatial data. Emphasis is placed on using the `sf` package for vector data and `stars` for raster data, while leveraging the tidyverse tools for efficient data handling. Through examples, we will illustrate how to create both static and interactive maps to effectively communicate spatial insights. By the end of the chapter, you should be equipped with the fundamental skills to tackle complex spatial analysis tasks – from cleaning and subsetting spatial datasets, to performing geographic transformations and producing publication-quality maps. ## Objectives The primary objectives of this chapter are: * **Introduce essential geospatial packages in R:** We will become familiar with R’s core spatial libraries, particularly `sf` for vector data and `stars` for raster and spatiotemporal data. Understanding these packages’ capabilities and data structures lays the groundwork for advanced spatial analysis. * **Demonstrate basic spatial data handling:** Using real-world examples, we will walk through loading spatial datasets, examining their properties, performing data cleaning and transformations (both attribute and geometric operations), and combining spatial data with traditional data frames. * **Explore visualization techniques:** We will create maps as a key analytical and communication tool. This includes basic plotting of spatial objects, thematic mapping with **ggplot2**, and building interactive maps with **leaflet**. By learning to visualize spatial data effectively, you can better interpret patterns and convey findings to stakeholders. These objectives align with building a robust skill set in geospatial data manipulation. Mastery of these techniques will enable you to confidently integrate spatial thinking into data analyses and support informed decision-making across applications. ## Working with Vector Data Vector data represents geographic features as discrete geometries (points, lines, and polygons) with associated attributes. Common examples include city locations as points, roads or rivers as lines, and administrative boundaries as polygons. In R, the **`sf`** package is the de facto standard for working with vector data, providing a powerful and consistent framework. In this section, we cover the fundamental tasks of reading vector data, examining its structure, basic visualization, and subsetting/filtering spatial features. ### Reading Vector Data Spatial vector datasets are often stored in specialized file formats or databases. One of the most prevalent formats is the ESRI **Shapefile**, which actually consists of multiple files (.shp, .dbf, .shx, etc.) representing geometry and attributes. Other popular formats include **GeoJSON** (for web-compatible data exchange), **KML** (Google Earth format), **GeoPackage (GPKG)**, and spatial databases like **PostGIS**. The good news is that `sf` can read and write a wide array of these formats thanks to its use of the GDAL library. This means that a single function, `st_read()`, can ingest virtually any spatial data source – whether it’s a shapefile on your disk, a GeoJSON from a web URL, or a table in a spatial database – and load it as an R spatial object. **R Example – Reading a Shapefile:** Suppose we have a shapefile of municipal regions named **mrc.shp** stored in a local “data” directory. We can read it into R as follows: ```r library(sf) mrc <- st_read("data/mrc.shp") ``` This command will import the vector data into an `sf` object called `mrc`. Under the hood, `st_read` uses GDAL to detect the file format and parse the data. If the dataset contains multiple layers (as in a GeoPackage or a database), you can specify the layer name or use `st_layers()` to list available layers. In many cases, `st_read` will print a brief summary of what was read – for example, the number of features, number of attributes, geometry type (point/line/polygon), and the Coordinate Reference System (CRS). The `sf` object `mrc` is essentially a data frame (specifically, a tibble) with an attached geometry column, and it can be treated much like a regular data frame for non-spatial operations. **Supported Data Formats:** The flexibility of `st_read` cannot be overstated. Thanks to GDAL support, `sf` handles formats including shapefiles, GeoJSON, KML, GeoPackage, and more. For example, to read a GeoJSON file or URL, you would simply provide the .geojson path to `st_read`. Similarly, connecting to a PostGIS database table can be done via a database connection (using packages like **DBI** and **RPostgres**) and passing a query or table name to `st_read`. This uniform interface abstracts away format-specific complexities, allowing you to work with spatial data from various sources without worrying about conversion – a major advantage in real-world workflows where data comes in many forms. **Note:** When reading spatial data, it’s good practice to handle possible errors or warnings. If `st_read` reports issues (e.g., unreadable geometries or missing CRS), you may need to ensure the data source is not corrupted and that auxiliary files (like .prj for shapefiles, which stores projection info) are present. Additionally, you can use `quiet = FALSE` (the default) or `verbose = TRUE` to get more detailed logging from GDAL if needed. For large files, reading might take time; `sf` will attempt to use efficient methods, but be mindful of memory limits. ### Examining Spatial Data Structure After loading a spatial dataset, it’s important to examine its structure and metadata to understand what we’re working with. Key things to inspect include the data’s **attributes**, its **geometry type**, the **bounding box** (spatial extent), and the **coordinate reference system (CRS)**. Basic data frame functions like `head(mrc)` or `summary(mrc)` will show attribute columns (just as for a regular data frame) and some high-level info. For spatial specifics, `sf` provides utility functions: * **Geometry type and feature count:** Simply printing the `sf` object (`mrc`) will report how many features (rows) it contains and the geometry type (e.g., “MULTIPOLYGON” or “POINT”). It will also list the CRS and bounding box. * **Bounding Box:** The bounding box is the minimal rectangle that contains all features. Use `st_bbox(mrc)` to retrieve it. This returns the xmin, ymin, xmax, ymax coordinates of the extent. For example, `st_bbox(mrc)` might output something like `xmin: -73.9, ymin: 45.3, xmax: -72.5, ymax: 46.1` (coordinates in whatever CRS the data is in). Knowing the extent is useful for sanity checks (does it cover the area you expect?) and for setting up map viewports. * **Coordinate Reference System:** The CRS defines how the coordinates in the data map to real locations on Earth. Use `st_crs(mrc)` to get the CRS of the dataset. This will typically return an EPSG code or PROJ4 string. For instance, a common CRS for global data is *WGS 84 longitude/latitude* (EPSG:4326), while regional data might use a local projection (the example dataset `mrc.shp` might use NAD83 / Quebec Lambert projection, EPSG:6622). Confirming the CRS is crucial because if you overlay or compare two spatial datasets, they must be in the same CRS or explicitly transformed to a common one. A CRS description from `st_crs` will include the coordinate system type (geographic or projected), datum, and projection parameters if applicable. **Why CRS matters:** A coordinate reference system defines how a two-dimensional map corresponds to real places on Earth. Geographic CRSs (like WGS 84) use latitude/longitude on an ellipsoidal model of Earth, while projected CRSs (like UTM or state plane systems) flatten the earth for use on a plane, with coordinates in linear units (meters, feet, etc.). Using the wrong or inconsistent CRS can lead to misaligned layers (e.g., your points might not fall inside the polygons they should) and measurement errors. Always check the CRS for each dataset and ensure you know what it is. If `st_crs()` returns `NA`, it means the data is missing CRS metadata – you may need to set it manually with `st_set_crs()` if you know the correct one. ### Basic Visualization of Vector Data One immediate way to understand spatial data is to plot it. Visualization can reveal spatial patterns, highlight potential errors (such as outlier points in the wrong place), and give a sense of the geographic context. The `sf` package provides a simple plot method for `sf` objects. If you call `plot(mrc)`, by default it will produce a multi-panel plot of all attributes (for numeric or factor attributes) along with the geometries. However, for a quick look at just the shapes, it’s common to plot only the geometry. You can do this by either subsetting just the geometry column or using the special function `st_geometry()`. **R Example – Quick Plot:** ```r # Plot just the geometry of the municipal regions plot(st_geometry(mrc), col = NA, border = 'blue', axes = TRUE) ``` This will draw the outlines of the polygons in blue with no fill color (`col = NA` means transparent fill) and include axes with coordinate ticks (`axes = TRUE`). We could also do `plot(mrc["geometry"])` which is equivalent in this case. The result is a basic map of the features. If `mrc` contains multiple polygon features (regions), you’ll see their boundaries drawn. This kind of base R plot is great for an exploratory view. It’s fast and useful for verifying that the data loaded correctly (for instance, the shapes appear in the expected location). You might use this to notice if there’s a stray polygon far off (indicating maybe an erroneous coordinate), or just to familiarize yourself with the region’s shape. For more refined visual analysis, one might use interactive viewing tools. The **mapview** package (not covered in detail here) can display `sf` objects in an interactive viewer with minimal code (e.g., `mapview(mrc)`), which is very handy during data exploration. We’ll cover interactive maps later with `leaflet`, but mapview is a quick one-liner alternative for exploring data. ### Data Selection and Filtering Just as with non-spatial data frames, we often need to subset spatial data to focus on relevant features. With `sf` objects, we can perform attribute-based filtering using typical R or dplyr syntax, and we can also leverage spatial relationships to filter data (e.g., selecting all points within a certain area). **Attribute filtering:** You can index an `sf` object like a data frame. For example, suppose `mrc` has an attribute column `pop2016` for population in 2016 and `mrc_name` for the region name. We can filter regions by population: ```r # Select regions with population > 200,000, keeping only name and population columns mrc_subset <- mrc[mrc$pop2016 > 200000, c("mrc_name", "pop2016")] ``` This uses base R indexing: the row condition `mrc$pop2016 > 200000` selects only those rows (features) meeting the criterion, and the column index `c("mrc_name","pop2016")` keeps only those two attributes (plus, implicitly, the geometry column – which is always kept unless dropped explicitly). The result `mrc_subset` is a new `sf` object containing, say, the subset of municipal regions that have over 200k population, with just their name and population fields. Because `sf` extends the data frame class, **subsetting preserves spatial attributes by default** (the geometry column stays with the data as long as `drop = FALSE`, which is the default in the bracket operator for `sf` objects). After such filtering, you might plot `mrc_subset` to see where these high-population regions are, or do further analysis on them. **Spatial filtering:** A powerful feature of `sf` is the ability to filter based on spatial relationships. The `sf` package supports **spatial predicates** – functions like `st_intersects`, `st_contains`, `st_within`, etc., that compare geometries. Moreover, you can use the bracket notation directly with an `sf` object as the filter to select by intersection. For example, if we had a layer of points (say `plots_sf` representing sample plot locations) and we wanted to select only the municipalities that contain at least one sample plot, we could do: ```r # Assuming plots_sf is an sf of points with the same CRS as mrc selected_mrc <- mrc[plots_sf, ] ``` Here, using `mrc[sf_object, ]` triggers a spatial selection: it will return only those features of `mrc` that **spatially intersect** with any feature in `plots_sf`. By default, the predicate is intersection; you can specify a different one with the `op` argument (e.g., `mrc[plots_sf, , op = st_contains]` to select polygons that fully contain any point). This spatial subsetting syntax is very succinct for common tasks like clipping points by a boundary or finding polygons that overlap with others. Additionally, the **dplyr** interface provides `filter()` method for `sf` which works similarly – you can mix attribute conditions and even call spatial predicate functions inside `filter`. But often using base subsetting or explicit spatial joins (next topic) is clearer for complex operations. Whether by attribute or geometry, subsetting helps focus the analysis. It’s analogous to making SQL queries on your spatial data: you can select “WHERE population > X” or “WHERE geometry intersects some area.” Both are essential tools. ## Integration with Data Manipulation Tools One of the strengths of the `sf` package is its seamless integration with the tidyverse data manipulation verbs. Since an `sf` object is essentially a data frame with geometry, you can directly use **dplyr** functions like `filter()`, `select()`, `mutate()`, `group_by()`, and `summarize()` on it. This allows powerful combinations of attribute and spatial operations in a clear pipeline style. Let’s explore a common task: aggregating or summarizing spatial data by a region attribute. Suppose our `mrc` polygons belong to larger regions (for instance, say `mrc` has a column `reg_name` indicating a broader region each municipality falls into). If we want to get total population per region, we can use `dplyr`: ```r library(dplyr) regions <- mrc %>% group_by(reg_name) %>% summarize(pop2016 = sum(pop2016)) ``` This pipeline does what it would for a normal data frame: group the data by `reg_name` and sum the `pop2016` values within each group. The result, `regions`, will have one row per unique `reg_name` with the summed population. But what about the geometry? By default, **summarizing an sf object will also perform a spatial union of the geometries in each group**. In other words, the multiple municipal polygons that shared the same `reg_name` are dissolved into a single combined polygon for that region. The `sf` package uses the GEOS library to carry out this union operation behind the scenes, ensuring that we get a proper merged geometry for each group. This behavior is controlled by an argument `do_union = TRUE` in the `summarize()` method for `sf` (which is TRUE by default). The benefit is that the resulting `regions` object remains an `sf` with correct geometries representing, say, larger administrative regions. Had we not unioned the geometries, we would end up with a multipolygon object containing disjoint pieces (or an invalid geometry with overlapping boundaries). The automatic union usually aligns with what an analyst expects when grouping spatial features. It’s worth noting that if for some reason you wanted to keep the individual geometries and not dissolve them, you could use `summarize(..., do_union = FALSE)`, but then you might get a geometry list-column or invalid overlaps – typically not desired unless you have a specific reason. Beyond summarization, other **dplyr** verbs work naturally: * `mutate()` can add or transform attributes. For example, `mrc <- mrc %>% mutate(pop_density = pop2016 / area_km2)` (assuming there’s an `area_km2` attribute or we computed area via `st_area()`). * `select()` chooses a subset of columns but, by default, it **keeps the geometry column** even if you don’t list it (so you don’t accidentally drop the spatial component). If you *do* want to drop geometry and get a plain data frame, you can explicitly do `st_drop_geometry(mrc)` or convert to data frame first. * `arrange()` will sort the rows; it doesn’t affect geometries except reordering them. * Joins: You can do attribute joins between an `sf` object and a regular tibble using `left_join()`, `inner_join()`, etc., as long as you join by attribute keys (e.g., attaching a demographic table to a spatial layer by region code). The join will simply carry over the new columns into the `sf` while retaining the geometry. This is extremely useful for enriching spatial data with additional attributes (for example, joining census data to a spatial boundary file by an ID field). As an example, if we had a data frame `df` with columns `mrc_name` and some statistic, and we wanted to attach that to `mrc` spatially, we’d do `mrc <- left_join(mrc, df, by="mrc_name")`. The result is still an `sf` object but now with extra attributes from `df`. Finally, it’s important to mention **spatial joins** as a form of integration between datasets. A spatial join attaches attributes from one spatial layer to another based on location (e.g., assign each point the attributes of the polygon it falls in). The `sf` package has a function `st_join(x, y, join = st_intersects)` for this purpose. For instance, if we have a point layer `plots_sf` and polygon layer `mrc`, `st_join(plots_sf, mrc)` will produce a new set of points that carry the attributes of the municipality each point lies in. By default it uses intersection (point-in-polygon), but you can specify other relations or a distance buffer. We will discuss some of these spatial operations more in the next section. Conceptually, though, `st_join` is analogous to a database join but where the key is a spatial relationship rather than an exact matching ID. `sf` objects are **fully compatible with the tidyverse workflow**, allowing you to filter, aggregate, and join data with the same code you’d use for non-spatial tables. This integration accelerates spatial data cleaning and preparation tasks. Moreover, under the hood `sf` leverages high-performance libraries for geometry operations (it uses GDAL for I/O, GEOS for geometric ops, and PROJ for transformations), so many of these operations are efficient even on large datasets. Embracing these tools means we can write clear, concise code to manipulate geospatial data and trust that the heavy lifting (geometry crunching) is handled by battle-tested libraries. ## Creating Spatial Data from Data Frames Not all data comes pre-packaged in a shapefile or spatial format. Often, you may have ordinary data frames or CSV files that contain location information (e.g., latitude and longitude columns, or address fields). R’s spatial toolkit allows you to convert such tabular data into spatial objects, enabling mapping and spatial analysis on them. A common scenario is a CSV file with coordinates. For example, suppose **plots.csv** contains ecological survey plots with columns for longitude and latitude, along with other attributes like vegetation type. We can turn this into an `sf` object with the function `st_as_sf()`: **R Example – Data Frame to sf Point:** ```r plots <- read.csv("data/plots.csv") # assume it has columns "long" and "lat" plots_sf <- st_as_sf(plots, coords = c("long", "lat"), crs = 4326) ``` In this snippet, `coords = c("long","lat")` tells `st_as_sf` which columns contain the X and Y coordinates (note: the order is X then Y, so long = X, lat = Y). We also specify a CRS; here we assume these are longitude/latitude in WGS 84 (EPSG:4326), which is a common scenario for GPS coordinates. If we know the points were collected in the same CRS as another layer (say the `mrc` data uses a specific projection and the CSV coordinates are actually in that same projection system), we could use `crs = st_crs(mrc)` to assign the same CRS as `mrc`. In our example, we used 4326 for global lat/long. After this conversion, `plots_sf` becomes an `sf` object of point geometry. Each row of `plots` now has a corresponding point (with coordinates from the given columns) stored in `plots_sf$geometry`. All the other columns (cover\_type, plot\_id, etc.) remain as attributes in `plots_sf`. Now that it’s spatial, we can do things like plot these points on a map, or perform spatial joins/operations with other spatial data. A couple of considerations when creating spatial data from scratch: * **CRS correctness:** The `st_as_sf()` function will not transform coordinates, it only labels them. So it’s crucial to specify the correct CRS that the coordinates are in. If you set the wrong CRS, your points will appear in the wrong location when mapped. For instance, if the CSV lat/long are WGS84 and you mistakenly assign a local projection, the points could be interpreted entirely incorrectly. Always confirm what coordinate system your source data uses. If the data came from a GPS or an online source, WGS84 (EPSG:4326) is likely. If it’s from a local survey, it might be a local projection (you might be given a PROJ4 string or EPSG code in metadata). * **Other coordinate data:** Sometimes coordinates are given in separate fields such as UTM northing/easting or addresses. For addresses, geocoding (using an API or package) would be needed to get coordinates. For projected coordinates like UTM, you would still use `st_as_sf()` but you need the appropriate CRS (for UTM, specify the correct zone’s EPSG code, e.g., 32633 for UTM zone 33N WGS84). * **Data validation:** After creating `plots_sf`, it’s wise to do a quick plot or check a few coordinates to ensure they line up with expectations (e.g., plot them on top of a known map or see if coordinates fall within expected min/max ranges). By converting raw data frames into spatial objects, we unlock all the spatial operations for those data. For example, we could count how many plots fall in each municipality by doing a spatial join between `plots_sf` (points) and `mrc` (polygons). We could measure distances between plot locations, or buffer them to analyze neighboring areas (topics we will touch on soon). Essentially, any dataset with location info can become part of our geospatial analysis workflow after this conversion step. ## Working with Raster Data While vector data deals with discrete features, **raster data** represents continuous spatial information and imagery, using grids of cells (pixels). Common examples of raster data include satellite images, elevation models, land cover classification grids, and climate data layers. In R, the new-generation package **`stars`** is designed for handling raster and multi-dimensional array data. (The older **raster** package and its successor **terra** are also widely used; terra is very powerful for raster analyses, but here we’ll focus on `stars` as it was mentioned in our objectives.) Raster data is essentially a matrix (or a stack of matrices for multi-band data) where each cell has a value. This could be a color (in remote sensing imagery), a temperature, an elevation, etc. `stars` treats rasters as *spatiotemporal arrays*, which can have dimensions like x, y, time, band, etc.. One strength of `stars` is that it can handle multi-band images and time series data (data cubes) in a tidy way, and it interoperates with `sf` for combined vector/raster operations. ### Reading and Exploring Raster Data Reading a raster file in R with `stars` is straightforward using `read_stars()`. For example: ```r library(stars) # Read a GeoTIFF file (e.g., a satellite image) sat_image <- read_stars("data/landsat_scene.tif") print(sat_image) ``` If `landsat_scene.tif` has multiple bands (say, Red, Green, Blue, Infrared), `read_stars` will load them all, and the `print` output will tell us the dimensions. A typical print might show something like: “stars object with 2 dimensions and 4 attributes” or “dimensions: x, y, band; resolution ...; CRS: ...”. `stars` uses GDAL under the hood for reading common raster formats (GeoTIFF, NetCDF, etc.), so it supports many formats similar to `sf`. Just as we checked vector metadata, for rasters we should check: the dimension sizes (number of rows, cols, layers), the resolution (cell size), extent (which `st_bbox(sat_image)` would give), and CRS (`st_crs(sat_image)`). These are accessible through `st_dimensions(sat_image)` which lists each dimension’s values or range. For example, `st_dimensions(sat_image)$x` might show the sequence of x coordinates for cell centers, etc. If the raster has a defined CRS (most GeoTIFFs do), `read_stars` will attach it. If not, you might need to set it with `st_set_crs()` similarly to vectors. **Plotting rasters:** The `stars` object can be plotted directly. If it’s a single-band raster, `plot(sat_image)` will display it (with a legend of values). If it’s multi-band, `plot(sat_image)` might plot each band unless it recognizes a 3-band set as RGB. You can also specify `plot(sat_image, rgb = c(3,2,1))` to plot a color image using bands 3,2,1 for RGB (for instance). Plotting is useful to quickly visualize the raster and check for any obvious issues (like whether the data aligns with known features, or if there are missing data areas). **Raster operations basics:** Once loaded, raster data can be manipulated much like any array. For example, you can do arithmetic on the object: `sat_image * 2` would multiply all cell values by 2 (if it’s numeric data). With multiple bands, you can index the band dimension to select a subset or do calculations across bands. `stars` provides functionality such as `st_apply()` to apply functions across dimensions (like calculating a mean across time or bands). However, many advanced raster analyses (like filtering, zonal stats, etc.) are beyond our scope in this intro. We will focus on a few key tasks: * **Cropping / masking:** You can crop a raster to a region of interest or mask it by a polygon. For example, to crop to our `mrc` extent: `sat_image_cropped <- sat_image[st_bbox(mrc)]` (stars allows using an `sf` or bbox to index the raster by spatial extent). For masking by a polygon (set values outside a polygon to NA), one might use `st_crop()` followed by manual NA assignment or use `terra::mask` if mixing with terra. * **Raster-vector integration:** A common need is to extract raster values at point locations or summarize raster values within polygons. The **`st_extract()`** function in stars can retrieve raster values for given point geometries. E.g., `st_extract(sat_image, plots_sf)` would give the pixel values for each point in `plots_sf`. To get summary within polygons (like average value per polygon), you could use `aggregate()`: e.g., `stars::aggregate(sat_image, by = mrc, FUN = mean)` which will compute the mean of raster cells within each `mrc` polygon. Under the hood, stars uses GDAL routines for some of these tasks and can convert between raster and vector (e.g., `st_as_sf` on a stars object can *polygonize* raster regions, though for large rasters this can be heavy). * **Example – Thresholding a raster:** To illustrate a simple raster operation, imagine our satellite image has a band for vegetation index and we want to find areas above a certain threshold. We could do: ```r veg_index <- sat_image[ , , "NDVI"] # assuming one band named "NDVI" high_veg <- veg_index > 0.8 # this yields a logical stars object high_veg <- st_as_stars(high_veg) # convert logical to 0/1 raster (TRUE/FALSE) high_veg_polygons <- st_as_sf(high_veg, as_points = FALSE, merge = TRUE) ``` In this pseudo-code, `high_veg` would be a raster mask of TRUE/FALSE where NDVI > 0.8. Converting that stars object to sf polygons (with `merge=TRUE` to combine adjacent TRUE cells) would yield polygon features outlining high vegetation areas. The example shows how one can go from raster analysis to vector results, which is often done in land cover mapping etc. (This process relies on GDAL’s polygonize under the hood.) **Handling large rasters:** Raster datasets can be very large (think hundreds of millions of pixels). `stars` has the concept of **lazy loading** and proxy objects. If a file is big, `read_stars()` may create a *stars\_proxy* object that doesn’t immediately load all data into memory, but rather waits until you actually need it (for plotting or computation), possibly downsampling if appropriate. This allows working with large rasters without running out of memory, as long as you don’t force read everything. If you encounter performance issues, consider reading only what you need (spatially or by band). Alternatively, the **terra** package is optimized in C++ for many raster operations and can handle large data and on-disk processing efficiently. It’s fine to use a mix of tools: for instance, you might use `terra` for heavy raster crunching and then convert the result to stars or sf for integration with vector data. `stars` extends our geospatial toolkit to continuous data. The ability to bring in a raster layer (e.g., a digital elevation model or a climate surface) and combine it with vector data (e.g., administrative boundaries, sample locations) greatly enriches the analysis. Many spatial analyses involve both vector and raster data – for example, extracting elevation at point locations, or computing the population within areas of high flood risk (overlaying population rasters with flood extent polygons). With `sf` and `stars` together, R provides a unified environment to conduct such multi-dimensional analyses. ## Coordinate Reference Systems (CRS) As introduced earlier, coordinate reference systems are the backbone that ties spatial data to real-world locations. A CRS includes a coordinate system (how coordinates are measured) and a geodetic datum (a model of Earth’s shape). Working with CRSs correctly is vital for **accurate spatial analysis**. This section delves a bit deeper into CRSs and demonstrates how to reproject data in R. A key point to remember: **All spatial operations and comparisons assume the coordinates are in a single consistent CRS**. If not, you must transform one dataset to the CRS of the other (or both to a common CRS) before analysis. R will generally not stop you from overlaying data with different CRSs, but the results will be nonsense if the CRSs differ. ### Understanding and Inspecting CRS When you load a spatial dataset, always note its CRS (using `st_crs`). CRSs can be geographic (lat-long) or projected. * **Geographic CRS (unprojected):** This uses latitude and longitude on an ellipsoid. Coordinates are in degrees. Example: WGS 84 (EPSG:4326) – the most common global CRS, used by GPS and many datasets. In a geographic CRS, a degree of longitude represents a different distance on the ground depending on latitude (roughly 111 km at equator, 0 at poles), so distance/area calculations in degrees are not uniform. * **Projected CRS:** A mathematical projection of Earth onto a flat surface, using linear units (meters, feet). There are hundreds of projections, each suited for different regions or purposes (preserving areas, shapes, distances, etc.). Examples: *Web Mercator* (EPSG:3857) is used in many web maps (like Google or OSM tiles) – it’s a Mercator projection in meters; *Universal Transverse Mercator (UTM)* which divides the world into zones for detailed mapping; *national grid systems* like NAD83 / State Plane or the Lambert Conformal Conic used for Canadian provinces. Each projection introduces some distortion (no flat map can represent the 3D earth perfectly), but is often chosen to minimize distortion in the area of interest. It’s important to choose an appropriate projection for your analysis needs. If you’re measuring areas or distances, an equal-area or equidistant projection might be preferred. If you just need a pleasing visual map for a local area, a conformal projection (preserving shape) might be used. Fortunately, reprojecting data in R is easy, which encourages experimenting or switching CRS as needed. ### Reprojecting Data with `st_transform` To convert spatial data from one CRS to another, we use the function `st_transform(x, crs = <target CRS>)`. The `crs` argument can be specified by EPSG code, or by a PROJ4string, or by an `st_crs()` object copied from another layer. **R Example – Reprojecting:** Continuing with our `mrc` example, suppose it was originally in a geographic CRS (lat/long). For certain analyses (say we want to calculate land area or distances between centroids), it’s better to have a projected CRS in meters. We could do: ```r mrc_proj <- st_transform(mrc, crs = 6622) ``` Here EPSG:6622 (NAD83(CSRS) / Quebec Lambert) is a projected CRS suitable for Quebec area (meters as units). After this, `mrc_proj` coordinates are now in a planar system. We could verify by checking `st_crs(mrc_proj)` which should show EPSG:6622 and units in meters, and `st_bbox(mrc_proj)` which would now give coordinates in perhaps hundreds of thousands of meters (e.g., Easting/Northing values), rather than degrees. If you don’t know an appropriate EPSG code offhand, you can often find one by searching (`sf::st_crs("WGS84")` gives 4326; `st_crs(3857)` gives Web Mercator; for local systems, sometimes the data provider will specify the EPSG or you might search on spatialreference.org). `st_transform` will utilize the PROJ library to accurately convert coordinates. This includes datum transformations if needed (for example, converting from NAD27 datum to WGS84 will apply a shift – PROJ handles that internally if it has the grid parameters). If R outputs messages about missing datum shift files, you might need to install those if precision is required, but often small datum differences are negligible for many uses. **Practical tip:** Always ensure the data’s CRS is correctly set *before* transforming. If `st_crs(mrc)` was NA (unknown), `st_transform` won’t know how to interpret the numbers and will likely fail or give wrong results. You would first do `st_set_crs(mrc, <EPSG or proj string>)` to assign the correct original CRS (without changing data, just labeling it), then transform. **When to reproject:** You’ll reproject data whenever layers need to align. For instance, if you have points in WGS84 but polygons in a local projection, decide on one CRS and transform the other. Also, if you plan to compute lengths or areas, using an appropriate projection (especially an equal-area projection for area) is recommended. Measuring area in degrees is meaningless; one degree of latitude is not a constant distance. Instead, transform to a CRS where a unit is a real length (meter) in your region, then use `st_area` or `st_length`. One example: If calculating the area of each `mrc` polygon, do `st_area(mrc_proj)` after projecting to a suitable meter-based CRS. If you used `st_area(mrc)` while it was in lat/long, `sf` (by default, as of recent versions) might actually use spherical geometry to approximate the area on the ellipsoid, which can yield results in squared meters even from degrees (thanks to the **s2** geometry engine integration). This can be handy, but it’s sometimes confusing. Many prefer to explicitly project to a planar CRS and then calculate area. **Multiple CRSs in a project:** It’s common to switch CRS depending on task. You might keep data in WGS84 for ease of sharing and because it’s the common denominator, but project to local coordinate systems for analysis or mapping. For example, you could have your raw data in 4326, then create a transformed version in a local projection for distance calculations, and perhaps another transformed to Web Mercator if you want to overlay on a Google map in a web context. The key is to keep track of which object is in which CRS (naming like `mrc_ll` vs `mrc_proj` can help). The `st_crs` metadata will guide you, so always check it if unsure. Coordinate reference systems ensure spatial data from different sources align correctly on the Earth’s surface. Proper use of CRSs allows us to integrate maps seamlessly and perform accurate spatial analysis (for instance, when combining layers, a CRS acts like a common language of location). With R’s `st_transform` function, we can easily convert CRSs, thus overcoming one of the historically tricky parts of GIS work. This removes a major barrier to combining data globally and lets us focus on analysis, confident that “a meter means a meter” and “a degree means a degree” as appropriate in our computations. ## Customizing Maps with Visualization Libraries So far we have discussed plotting for exploratory purposes. When it comes to presenting results or creating publication-quality maps, we often need more control over the appearance. This is where R’s powerful **ggplot2** package comes in. Thanks to the integration of `sf` with ggplot2, making complex maps with legends, scales, and multiple layers is straightforward. Additionally, specialized mapping packages like **tmap** (thematic maps) exist, but here we will illustrate with ggplot2 since it’s widely used and very flexible. ### Using ggplot2 for Static Maps Recall that `sf` objects can be used directly in ggplot. ggplot2 has a geom specifically for spatial data: `geom_sf()`. This geom will automatically handle the projection (using the coordinate system of the data) and can plot points, lines, or polygons appropriately based on the geometry type. **R Example – Layered Map with ggplot2:** ```r library(ggplot2) ggplot() + geom_sf(data = mrc_proj, fill = "gray90", color = "white") + geom_sf(data = plots_sf, aes(color = cover_type), size = 1.5) + labs(title = "Sample Plots and Municipal Regions", subtitle = "Cover types of plots overlaid on region boundaries", color = "Cover Type") + theme_minimal() ``` Let’s break down what this does: * We start with `ggplot()` and add layers with `+`. The first layer `geom_sf(data = mrc_proj, ...)` draws the municipal regions (polygons). We set `fill = "gray90"` to fill polygons light gray, and `color = "white"` to draw white borders (this gives a subtle boundary effect). If we had an attribute to color by (say population density), we could do `aes(fill = pop_density)` inside geom\_sf to create a choropleth. Here we chose a uniform fill. * The second layer `geom_sf(data = plots_sf, aes(color = cover_type), size = 1.5)` adds point locations for the plots, using the attribute `cover_type` to color them. Each unique cover type (e.g., forest, grassland, etc.) will get a different color automatically. The `size = 1.5` makes the points a bit larger for visibility. * We add `labs` to set a title, subtitle, and rename the legend for color to "Cover Type". By mapping `color = cover_type` in `aes()`, ggplot knows to produce a legend for that. * We apply a theme (minimal theme here) to give a clean look. We could further customize, e.g., adding `theme(legend.position="bottom")` or `theme(axis.text=element_blank(), axis.ticks=element_blank())` if we want to hide coordinate axes for a cleaner map (often maps don’t need lat-long labels if it’s not important). The output is a layered map: a base of polygons and points on top, with a legend indicating the cover type colors. This demonstrates how ggplot’s layering system works naturally with spatial data, just like with other data types. **Coordinate systems in ggplot:** By default, when using `geom_sf`, ggplot will use the CRS of the data for plotting (and include an appropriate coordinate grid if axes are drawn). If you want to change the map projection for display, you can use `coord_sf(crs = <desired CRS>)`. For example, you might have data in a projected CRS but want to show it in a geographic (unprojected) form or vice versa. `coord_sf` will project the geometries on the fly for display. However, if your data is already in a suitable projection for the region shown, it’s often fine to let ggplot use it directly. **Further customization:** You have full ggplot2 capabilities, which include: * Adjusting color scales (e.g., `scale_color_brewer(palette="Set2")` to use a colorblind-friendly palette for categories). * If using fill for polygons, `scale_fill_viridis_c()` might be nice for numeric choropleth, or `scale_fill_manual()` for custom discrete colors. * Adding annotations: ggplot makes it easy to add titles, captions, or even map elements like scale bars or north arrows (not built-in, but packages like **ggspatial** provide those). * Faceting: You could facet maps by an attribute. For example, if `plots_sf` had surveys from different years, you could facet by year to compare spatial patterns over time side by side. * Combining with non-spatial layers: Because ggplot is general, you can overlay non-map data too. For instance, adding `geom_text()` at specific coordinates for labels, or even using `annotation_raster()` to put an image under the map, etc. An important consideration for map presentation is coordinate aspect ratio. `coord_sf` by default ensures the x and y scales are in correct proportion (1 unit in x equals 1 unit in y), which prevents distortion of shapes. If you ever override that (not usually advisable for maps), you could stretch the map, but normally you keep aspect ratio fixed to reality. **Example – Choropleth Map:** As another scenario, say we want to color municipalities by population density. We could do: ```r mrc_proj <- mrc_proj %>% mutate(pop_density = pop2016 / area_km2) ggplot(mrc_proj) + geom_sf(aes(fill = pop_density)) + scale_fill_viridis_c(trans = "log", name="People per km^2") + theme_bw() ``` This single-layer plot will color each polygon by `pop_density` using a viridis color scale (log-transformed to handle skewed distribution). The legend is labeled accordingly. `geom_sf` knows to take the data from `mrc_proj` and map fill to that attribute. The result is a classic choropleth map. We used `theme_bw()` for a base theme with gray background; one could further tweak it. Through these examples, you can see that advanced maps in R benefit from combining `sf` for data and ggplot2 for aesthetics. The result is high-quality visualizations that can be output to PNG/PDF for reports. They are also reproducible: the code fully specifies how the map is generated, so you can rerun it with updated data or share it with collaborators for transparency. ## Interactive Maps Static maps are great for reports and printed media, but interactive maps provide a dynamic way to explore spatial data. By panning, zooming, and clicking on features, users can glean insights that might be hard to convey in a static image. R makes it easy to create interactive web maps through packages like **leaflet** (an R interface to the popular Leaflet.js library) and **mapview**. In this section, we’ll focus on **`leaflet`** for creating customizable interactive maps. Leaflet in R works by creating a widget (an HTML map) that you can view in RStudio’s viewer or in a web browser. The basic workflow is to initialize a map with `leaflet()` and then add layers to it (tiles, markers, polygons, etc.) with `add*` functions. ### Using Leaflet in R To start, ensure you have `leaflet` installed (`install.packages("leaflet")`). Then: **R Example – Basic Interactive Map:** ```r library(leaflet) leaflet(mrc_proj) %>% addTiles() %>% # add default OpenStreetMap tile background addPolygons() ``` In this snippet, `leaflet(mrc_proj)` initializes a leaflet map with the `mrc_proj` polygons as the default data source. The `addTiles()` function adds a base map (the default is OpenStreetMap tiles, which are free and widely used). `addPolygons()` without any arguments will add the polygons from `mrc_proj` (because we provided that as the data in `leaflet()` call). The result is an interactive map where you can zoom and pan, and the municipal regions are overlaid on top of OSM street maps. By default, leaflet will give the polygons a blue fill and white outline with some transparency. This minimal example shows the core idea. But leaflet allows a lot of customization: * **Popups and tooltips:** You can specify what happens when a user clicks or hovers on a feature. For example, `addPolygons(popup = ~mrc_name)` would display the region name when a polygon is clicked. The `~mrc_name` syntax tells leaflet to use the `mrc_name` column of the data for each feature’s popup. Tooltips (appearing on hover) can be added with `label = ~some_column`. * **Theming polygons:** You can style the polygons via arguments in `addPolygons`, such as `color`, `fillColor`, `weight` (border thickness), `opacity`, `fillOpacity`, etc., and these can be set dynamically using attribute values. For example, you could color by population density by providing a vector of colors or a palette function. The `leaflet` package has functions like `colorNumeric()` to generate color palettes. E.g., ```r pal <- colorNumeric("YlOrRd", domain = mrc_proj$pop_density) leaflet(mrc_proj) %>% addTiles() %>% addPolygons(fillColor = ~pal(pop_density), color="black", weight=1, popup = ~paste(mrc_name, "<br> Density:", pop_density)) ``` This would color each polygon using a yellow-to-red palette based on density, with black borders, and a popup listing name and density. * **Adding markers or circles for point data:** If you have `plots_sf` points, you could do `addCircleMarkers(data = plots_sf, radius = 3, color = ~cover_pal(cover_type), popup = ~cover_type)`. This would put interactive circle markers for each plot, colored by cover type (using some palette function `cover_pal` you define). * **Layers control:** Leaflet allows multiple layers and a control to toggle them. For instance, you can add multiple tile layers (`addProviderTiles("CartoDB.Positron")` for a different basemap) and use `addLayersControl` to choose between base maps or toggle overlays (like turning on/off the plots layer). * **Integration with Shiny:** Leaflet maps can be made part of Shiny web applications, where user interactions on the map can trigger R code (for example, selecting points, dynamic filtering). This is beyond this chapter, but worth noting as a capability. One thing to remember is that leaflet expects data in WGS84 geographic coordinates (longitude/latitude) under the hood. If your data isn’t in lat/long, leaflet will still try to display it but the results may be distorted or offset. Many spatial packages automatically transform data for leaflet if needed. It’s good practice to ensure the data given to `leaflet()` is in EPSG:4326. In our example, `mrc_proj` was in a projected CRS, so technically we should transform it: `leaflet(st_transform(mrc_proj, 4326)) %>% ...`. The example might have worked if `mrc_proj` had a CRS attribute that leaflet recognized and transformed, but to be safe, supply lat/long. Interactive maps are especially useful during exploratory analysis and for stakeholder engagement. For example, if analyzing health outcomes by region, an interactive map allows a user to click on their region and see details, or to zoom into neighborhoods. The ability to overlay multiple layers and toggle them helps in understanding relationships (e.g., toggling a flood risk raster vs. population density layer). In R, other packages like **mapview** provide even quicker interactive plotting (one line to view an `sf` object on a map). Mapview is great for quick looks, but leaflet gives finer control for making a polished interactive map. There’s also **tmap** which can produce static and interactive maps with the same code (switching `tmap_mode("view")` for interactive), which is very convenient. For our purposes, the main takeaway is that creating an interactive map in R is relatively simple: use `leaflet()`, add some base tiles, and add spatial data layers. The result can be saved as an HTML or viewed in RStudio. Such maps are a powerful way to communicate spatial findings, as users can explore the data themselves. ## Conclusion In this chapter, we covered advanced techniques for geospatial data manipulation in R, focusing on practical skills with the `sf` and `stars` packages. We began by highlighting the importance of spatial analysis across fields and noting the explosive growth in spatial data availability. The introduction of modern spatial packages in R – notably `sf` for vector data (a package that *“has revolutionized spatial handling in R”* by adopting the standardized simple features approach) and `stars` for raster/array data – has made it much easier to integrate geospatial analysis into data science workflows. **Key takeaways from this chapter include:** * **Reading and exploring spatial data:** We learned how to load vector data from various formats with `st_read()` and inspect its structure (attributes, CRS, bounding box). Understanding the data’s coordinate reference system is critical to avoid misalignment; a CRS defines how spatial coordinates map to real locations, and we practiced checking and setting CRSs appropriately. * **Data manipulation and subsetting:** Treating `sf` objects like data frames allows leveraging of powerful dplyr verbs. We filtered spatial features by attribute (e.g., extracting municipalities above a population threshold) and by spatial relationship (using intersections for selection). We performed group-wise summarization to aggregate smaller units into larger regions, during which `sf` automatically handled the geometric union of features in each group – simplifying the creation of dissolved boundaries for aggregated data. * **Raster data handling with stars:** We introduced the `stars` package for raster and multi-dimensional data. Reading rasters with `read_stars()` and plotting them gave us a way to incorporate continuous spatial data (like satellite imagery or climate surfaces) into analysis. We discussed basic raster operations and how `stars` and `sf` can be combined, for example extracting raster values at vector point locations or converting raster data to vector features (and vice versa). This opens the door to rich spatial analyses, such as overlaying population rasters with administrative boundaries to calculate population statistics per region. * **Coordinate transformations:** We emphasized the need for and ease of reprojecting data with `st_transform()`. Whether to prepare data for distance calculations or to match the projection of a map background, transforming CRSs ensures apples-to-apples comparisons in space. The examples illustrated transforming lat/long data to a local projected CRS for accurate area measurements. We cited how sf leverages the PROJ library for these transformations, ensuring high accuracy. * **Visualization techniques:** Creating maps is often the end goal of spatial analysis. We saw how base plotting can give a quick view of data, but for polished outputs ggplot2 with `geom_sf` is extremely powerful. With ggplot we made layered maps, combining polygons and points with custom colors and legends. The integration of `sf` with ggplot2 means we can directly map attributes to aesthetics (fill, color, size) and produce thematic maps with minimal effort. We also touched on adding titles, scales, and choosing appropriate color palettes – crucial for making the map interpretable and visually appealing. * **Interactive mapping:** Finally, we explored making interactive maps using leaflet. This capability transforms static results into engaging tools where users can navigate the data spatially. By adding tiles and data layers to a leaflet widget, we created an interactive map of regions and sample plots. Features like popups demonstrated how additional information can be revealed on demand, making the map a richer experience. Interactive maps are invaluable for presentations and web dashboards, as they invite exploration and can help convey complex spatial stories (for instance, seeing how different layers overlap by toggling them). Throughout the chapter, we have reinforced that geospatial analysis in R is not an isolated specialty – it connects with general data manipulation and visualization workflows. You can join tables, aggregate data, and pipe through transformations in a manner consistent with non-spatial analyses, thanks to the design of packages like `sf`. Moreover, critical computational geometry tasks are handled by robust libraries (GDAL, GEOS, PROJ), giving you reliable results without needing GIS-specific software. With these advanced techniques, you can confidently tackle a wide range of spatial data tasks. For example, you could: **merge** disparate spatial datasets (such as linking health data points to census tract polygons), **transform** coordinates to analyze distances (such as computing how far each school is from the nearest earthquake fault line), **rasterize** vector data or vice versa (for instance, converting land use polygons into a raster grid for modeling), and **visualize** the outputs in compelling ways (static reports or interactive web apps). As you apply these skills, keep in mind best practices like checking data for errors (invalid geometries can be fixed with `st_make_valid` if needed), using appropriate projections for the task (distortion matters!), and documenting your workflow (perhaps using R Markdown to integrate code, analysis, and maps in one reproducible document). Geospatial analysis can be complex, but the structured approach presented in this chapter should provide a strong foundation. With practice, you’ll be able to extend these techniques – for example, into spatial modeling, point pattern analysis, network analysis, and more advanced topics covered in later chapters. Geospatial data manipulation in R empowers us to add the “where” dimension to our analyses. By mastering vector and raster handling, coordinate systems, and mapping techniques, we unlock new insights that purely non-spatial analysis might miss. The techniques learned here will enable you to build maps that inform policy, identify patterns like clusters or hotspots, allocate resources efficiently by location, and generally make better decisions grounded in spatial context. These are highly valuable capabilities in our increasingly location-aware world. Happy mapping, and may your future analyses always find the *geo* in *data*!