1 Introduction to Geospatial Data Science with R
In an era of ubiquitous data and rapid technological advancement, geospatial data science has emerged as a critical interdisciplinary field. Geospatial data science combines principles from geography, statistics, and computer science to collect, analyze, and visualize data that has a spatial or geographic component (Anselin, 2020). What sets geospatial data apart is that it includes location information — data points are tied to specific places on Earth’s surface. This spatial context enriches data analysis by revealing patterns and relationships that are not apparent in non-spatial data, leading to deeper insights in fields as diverse as public health, environmental sustainability, urban planning, and economics.
Over the past decade, the growth of location-based data has been explosive. We have shifted from a historically data-scarce environment to today’s data-rich world, where vast streams of georeferenced information flow continuously from smartphones, satellites, sensors, and social media (Miller & Goodchild, 2015). These data exhibit not only high volume, but also great variety (e.g. maps, GPS traces, satellite imagery, location-tagged text) and velocity (real-time feeds), characteristics often associated with “Big Data” (Miller & Goodchild, 2015). The emergence of this geo-big data ecosystem has prompted what some researchers call a “data-driven geography,” in which computational methods and data science techniques are increasingly central to geographic research (Miller & Goodchild, 2015; Goodchild, 2020). Crucially, the longstanding geographic principle that “location matters” has gained renewed importance: spatial context fundamentally influences social, economic, and environmental processes (Goodchild, 2020). This idea echoes the famous First Law of Geography – “everything is related to everything else, but near things are more related than distant things” (Tobler, 1970) – underscoring that understanding spatial relationships is vital to making sense of the world’s data.
1.1 Defining Geospatial Data Science
Geospatial data science can be defined as the science of analyzing and extracting knowledge from data that has a geographic or spatial aspect. It is often viewed as a modern evolution of geographic information systems (GIScience) into the age of big data and artificial intelligence (Goodchild, 2020). Luc Anselin, a pioneer in spatial analysis, describes spatial data science as a subset of data science that specifically accounts for the unique properties of spatial data in analytical methods and tools (Anselin, 2020). These unique properties include spatial autocorrelation (the interdependence of observations based on location), spatial heterogeneity (relationships that vary over space), scale and projection issues, and the often high dimensionality of spatial-temporal data.
Geospatial data science is inherently interdisciplinary. It bridges traditional GIS and cartography with modern computational and statistical techniques (Singleton & Arribas-Bel, 2021). In practical terms, this field draws upon methods from exploratory spatial data analysis, spatial statistics, machine learning, optimization, remote sensing, and database management. The goal is not only to map where things happen, but to understand why they happen where they do, and to use that understanding for prediction and decision-making. As Singleton and Arribas-Bel (2021) note, geographic (or spatial) data science integrates the rich theoretical foundation of geography with the powerful algorithms of data science, yielding a new perspective that neither domain could achieve alone. In essence, geospatial data science allows us to analyze “space and place” quantitatively, discovering spatial patterns and relationships that inform scientific inquiry and policy.
Importantly, geospatial data science extends beyond static maps. It involves spatial modeling and simulation, spatio-temporal analysis (looking at how processes unfold over space and time), and increasingly, geospatial artificial intelligence. For example, recent advances in GeoAI – geospatial artificial intelligence – combine GIS data with machine learning and deep learning techniques to classify land cover from satellite images, detect patterns like traffic congestion or deforestation, and even generate predictive models for urban growth (Li et al., 2020; Goodchild, 2020). By incorporating AI, geospatial data science can leverage unstructured data (such as high-resolution imagery or geo-tagged text) and handle the complexity of large-scale, real-time spatial data. This evolving GeoAI frontier exemplifies how geospatial data science continues to push methodological boundaries, enabling more automation and intelligence in spatial analysis than traditional GIS techniques.
1.2 Why Use R for Geospatial Analysis?
R is one of the principal programming languages for data science and statistics, and it has become a powerhouse for geospatial analysis. There are several reasons why R is especially well-suited for geospatial data science:
Statistical Strength and Libraries: R was built for statistical computing and has a vast ecosystem of packages. For spatial analysis, R provides mature libraries developed over decades by the research community. Classic packages like
sp
andrgdal
(Pebesma & Bivand, 2005) laid the foundation for spatial data handling in R, and more recent packages likesf
(simple features),raster
/terra
(for raster data), andtmap
orggplot2
(for mapping and visualization) offer state-of-the-art functionality. The introduction of thesf
package marked a significant improvement by adopting modern data standards for vector data, making spatial operations in R both faster and more in line with GIS standards (Pebesma, 2018). Because R’s spatial packages are often developed by leading experts, they implement cutting-edge methods from academic research (Bivand et al., 2013). This means analysts have access to validated, peer-reviewed techniques for tasks like spatial interpolation, geostatistics, or spatial regression right out of the box.Reproducibility and Scripting: R encourages a scripted, reproducible workflow. Unlike some traditional GIS software which might rely on manual steps, R allows you to write scripts that document every step of data processing, analysis, and visualization. This is crucial for academic research and any analytic work that demands transparency and reproducibility (Bivand et al., 2013). By using R, one can integrate data cleaning, analysis, and map-making in a single reproducible script or report (e.g., using R Markdown or Quarto for dynamic documents). This fosters better collaboration and trust in the results.
Integration of Spatial and Non-Spatial Analysis: Geospatial problems often require combining spatial data with other types of data (tables, time-series, text) and applying both geographic and non-geographic analyses. R excels at data integration and has libraries for virtually any analytical task — from classical statistics and econometrics to machine learning and text mining. Within one R environment, an analyst can perform a regression analysis that includes spatial features, run a clustering algorithm on geographic coordinates, or apply time-series models to environmental sensor data. This seamless integration prevents the siloing of spatial analysis and allows truly holistic data science workflows (Longley et al., 2015).
Community and Support: R has a robust user community, including a specialized r-spatial community of practitioners and developers. There are extensive tutorials, forums, and active development on spatial packages. Many academic textbooks and papers provide code examples in R (e.g., Applied Spatial Data Analysis with R by Bivand, Pebesma & Gómez-Rubio (2013) is a standard reference). This means learners and professionals can readily find support and patterns for implementing new methods. Additionally, R’s open-source nature encourages sharing of data and code, which aligns with the open science ethos in geospatial research.
While R is the focus of this book, it is worth noting that geospatial data science is a broad field with tools also available in other languages and platforms (Python, Julia, GIS software, etc.). Each has its strengths, but R’s combination of statistical power, rich package ecosystem, and emphasis on reproducible analysis make it a compelling choice for both beginners and advanced spatial data scientists. By mastering geospatial techniques in R, you gain access to a flexible toolset that can tackle problems ranging from small-scale local studies to large-scale global analyses.
1.3 Applications and Importance of Geospatial Data Science
Geospatial data science has far-reaching applications across numerous domains. Its importance stems from the fundamental role that location plays in natural and human phenomena. Here we highlight a few key areas where geospatial analysis is making transformative contributions:
Public Health and Epidemiology: Location is crucial for understanding the spread of diseases and the distribution of health resources. A classic historical example is John Snow’s 19th-century cholera map in London, which used spatial analysis (albeit manually) to identify a contaminated water pump as the outbreak source – an early triumph of geospatial reasoning. Today, with modern data, health officials use geospatial methods to track and predict disease outbreaks, map the spread of epidemics, and optimize the placement of healthcare facilities. For instance, during the COVID-19 pandemic, spatial dashboards and models were used worldwide to monitor case hotspots, assess travel-related risks, and guide lockdown policies. A review by Franch-Pardo et al. (2020) found an explosion of studies in early 2020 that leveraged GIS and spatial statistics to understand COVID-19 patterns and correlations with demographics, proving how essential geospatial analysis has become for public health decision-making.
Environmental Monitoring and Climate Science: Environmental processes are inherently spatial. Geospatial data science allows scientists to monitor deforestation, biodiversity loss, urban expansion, and climate change effects with unprecedented detail. Remote sensing (satellite and aerial imagery) combined with spatial analytics enables global-scale environmental assessments. A landmark study by Hansen et al. (2013), for example, used satellite data (Landsat imagery) and big-data processing to produce high-resolution global maps of forest cover change. Their analysis quantified forest losses and gains across the entire planet at 30-meter resolution, highlighting deforestation hotspots in the tropics. Such work guides international policy on conservation and climate change by pinpointing where changes are happening. Similarly, climate scientists use spatial models to downscale global climate predictions to regional impacts, helping local communities prepare for sea-level rise, heatwaves, or water shortages. Geospatial techniques are also pivotal in disaster management – from flood risk mapping to real-time wildfire monitoring – allowing more effective and timely responses to environmental hazards.
Urban Planning and Transportation: Cities are fundamentally spatial systems. Urban planners and geographers employ spatial data science to design smarter, more livable cities. This might involve analyzing urban sprawl through time-series of land use maps, optimizing public transit routes using commuter origin-destination data, or locating the best sites for new infrastructure like schools and parks by examining demographic and geographic criteria. Modern cities generate vast amounts of spatial data (e.g., traffic speeds from GPS in vehicles, pedestrian movement from mobile phones, urban environment data from IoT sensors). By analyzing these data, cities can improve traffic flow, reduce accidents by identifying high-risk locations, and plan development that minimizes commute times and environmental impact. Spatial data science also supports the concept of smart cities, where real-time data and spatial models improve urban management (Batty, 2018). For example, analyzing where traffic congestion or air pollution is worst at different times can inform dynamic policies or interventions.
Economic Geography and Business: Businesses increasingly use geospatial analysis for market research, logistics, and site selection. Location intelligence can reveal where customer bases are concentrated, how sales vary by region, or where to place a new store or warehouse for maximum efficiency. In economic geography and regional science, researchers analyze spatial distributions of economic activity, trade flows, and regional development. Techniques like spatial econometrics measure how one region’s economy is impacted by neighboring regions (LeSage & Pace, 2009). Geospatial data science thus helps both private and public sectors make location-informed decisions – from choosing the next retail location to allocating funds to regions most in need of economic stimulus.
Humanitarian and Social Applications: Geospatial data science plays a crucial role in humanitarian efforts such as crisis mapping and resource allocation. During natural disasters or conflicts, spatial analysis of satellite imagery and crowd-sourced data (e.g., through platforms like OpenStreetMap) can identify affected areas, guide relief distribution, and assist search-and-rescue operations. Spatial data is also key in studying social issues: researchers examine patterns of inequality, segregation, or crime by mapping incidents and socio-economic data (Longley et al., 2015). For instance, mapping the availability of public services across a city can reveal underserved neighborhoods, informing more equitable urban policies.
Across all these examples, a common thread is that geospatial data science helps uncover the “where” factor – where things are happening and how different places connect or differ. This spatial perspective is increasingly recognized as essential for solving complex problems. Whether one is managing a public health crisis, planning sustainable cities, or conserving ecosystems, ignoring spatial relationships can lead to incomplete or flawed conclusions. By incorporating location into data analysis, geospatial data science provides a more holistic understanding of problems and often yields insights that purely aspatial analysis would miss (Goodchild, 2020).
1.4 Structure of this Book
This book is designed to take you on a comprehensive journey through geospatial data science using R, from foundational concepts to advanced techniques. It is organized into three main parts:
Part 1 – Foundations: We begin with an introduction to geospatial data science and a brief history of the field’s development. You will learn about the types and formats of spatial data (such as vector and raster data), the history of GIS and spatial analysis, and the key concepts that underpin modern geospatial science. We also cover some advanced data manipulation techniques in R that will be useful for handling geospatial data, ensuring you are comfortable with R basics and data wrangling before diving into spatial specifics.
Part 2 – Geospatial Data Handling and Analysis: This central portion of the book covers the complete workflow of geospatial analysis. Chapters will guide you through geospatial data acquisition, where you’ll learn how to obtain spatial datasets from various sources including open data repositories, web APIs, and remote sensing archives. Next, we address geospatial data processing, teaching you how to clean, transform, and prepare spatial data (e.g., projecting data to appropriate coordinate systems, handling missing data, and merging different datasets). We then focus on geospatial data visualization, demonstrating how to create informative maps and interactive visuals that communicate spatial data effectively. Following that, we delve into geospatial data analysis techniques, including exploratory spatial data analysis (ESDA) to detect patterns, and introduce methods for spatial interpolation, density estimation, and cluster detection. Finally, we discuss geospatial data communication, highlighting ways to present results through dashboards, web maps, and reports to inform decision-makers and stakeholders.
Part 3 – Advanced Topics: The final part of the book explores advanced and emerging topics in geospatial data science. We examine specialized spatial techniques such as spatial autocorrelation measures (e.g., Moran’s I, Geary’s C) and spatial regression models that account for dependency among locations. A chapter on spatial autocorrelation delves deeper into why and how nearby locations can exhibit similar values and how to quantify these relationships. We then cover data integration, for instance combining spatial data with other data types (like temporal data for space-time analysis or integrating socio-economic data with environmental data for comprehensive models). Subsequent chapters introduce geospatial models, including both traditional models (like geographically weighted regression or spatial econometric models) and machine learning approaches tailored to spatial data (such as spatial clustering and classification with spatial features). Finally, we look at the frontier of generative AI in geospatial data science, discussing how recent advances in artificial intelligence and deep learning (for example, generative models and large language models) are being applied to create new spatial data (like simulated satellite images) or to enhance analysis (such as automated map interpretation and geo-annotation via AI). This part equips you with knowledge of cutting-edge developments and prepares you for future trends in the field.
Throughout the book, each chapter provides hands-on examples and code in R, ensuring that you not only read about concepts but also implement them. We use real-world datasets from various domains to illustrate each technique, so you gain practical experience. By the end, you will have developed a robust toolkit for geospatial data science in R and the confidence to apply spatial thinking to complex problems.
Geospatial data science is a dynamic and fast-growing field. By learning it through the lens of R, you are tapping into a rich ecosystem of tools that academic researchers, government analysts, and industry professionals worldwide use for spatial analysis. We hope this book empowers you to harness the power of location in data, unlocking new insights and solutions in your own area of interest. Now, let’s begin this journey into the world of geospatial data science with R!
References
- Anselin, L. (2020). Spatial Data Science. In D. Richardson et al. (Eds.), The International Encyclopedia of Geography. Wiley.
- Bivand, R. S., Pebesma, E., & Gómez-Rubio, V. (2013). Applied Spatial Data Analysis with R (2nd ed.). Springer.
- Franch-Pardo, I., Napoletano, B. M., Rosete-Verges, F., & Billa, L. (2020). Spatial analysis and GIS in the study of COVID-19: A review. Science of the Total Environment, 739, 140033.
- Goodchild, M. F. (2020). Geospatial data science: Future prospects. International Journal of Geographical Information Science, 34(6), 1041–1051.
- Hansen, M. C., et al. (2013). High-resolution global maps of 21st-century forest cover change. Science, 342(6160), 850–853.
- Longley, P. A., Goodchild, M. F., Maguire, D. J., & Rhind, D. W. (2015). Geographic Information Systems and Science (4th ed.). Wiley.
- Miller, H. J., & Goodchild, M. F. (2015). Data-driven geography. GeoJournal, 80(4), 449–461.
- Pebesma, E. (2018). Simple Features for R: Standardized support for spatial vector data. The R Journal, 10(1), 439–446.
- Singleton, A. D., & Arribas-Bel, D. (2021). Geographic data science. Geographical Analysis, 53(1), 61–75.
- Tobler, W. R. (1970). A computer movie simulating urban growth in the Detroit region. Economic Geography, 46(2), 234–240.