7 Automating Data Collection with APIs
In data science projects, having access to interesting datasets is essential fuel for analysis. While one approach is to manually download CSV files from various websites, a more dynamic and powerful method is to use APIs (Application Programming Interfaces). An API is essentially a set of rules or protocols that enables different software applications to communicate and exchange data and functionality. In practical terms, many data providers offer web APIs that allow direct retrieval of up-to-date data via code, rather than manually downloading files. Using APIs can ensure you are working with the latest data and can automate the data acquisition process.
In this chapter, we will focus on how to import data using APIs in R through specialized packages. Each package acts as a wrapper around an external data source’s API, handling the communication details so that you can simply call R functions to fetch data. Before diving into specific examples, it’s important to clarify what we mean by an argument in this context. In programming, an argument is a value or input that you pass to a function when you call it, which influences how the function operates. For instance, if a function is defined to take a country code as an argument, you would provide a specific country code value when calling that function. Understanding arguments is crucial because all the API-related functions we’ll use require certain arguments (like indicator codes, country codes, years, etc.) to specify what data you want.
At the end of the chapter, you should be able to:
- Know what an argument is. You will understand how functions use arguments (inputs) to modify their behavior, and how to supply the correct arguments to get the data you need.
- Import data using an API in R. You will learn to use various R packages that interface with web APIs to retrieve data from online sources, avoiding manual downloads and making your data acquisition process reproducible and up-to-date.
We will explore several real-world examples of R packages that provide easy access to data via APIs. These include:
- WDI: Access World Bank’s World Development Indicators and other datasets.
- OECD: Retrieve indicators and datasets from the OECD (Organisation for Economic Co-operation and Development).
- spiR: Access the Social Progress Index data.
- statcanR: Download data from Statistics Canada’s open data portal.
- EpiBibR: Obtain bibliographic references (especially COVID-19 related literature data).
- coronavirus: Get daily COVID-19 statistics (cases, deaths, etc.) from the Johns Hopkins University dataset.
For each of these, we will discuss the data source, the key functions provided by the R package, and walk through examples of how to use them. By the end, you will see how APIs combined with R packages allow you to seamlessly pull in data from a variety of domains with just a few lines of code.
7.1 WDI
Database Description
The World Development Indicators (WDI) database is a flagship dataset of the World Bank containing a wide range of global development statistics. It is a compilation of relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty. The database includes over 1,600 time-series indicators for 217 economies and more than 40 country groups, with data for many indicators going back over 50 years. These indicators cover topics such as economic growth, education, health, poverty, environmental factors, and much more.
What makes the WDI particularly powerful is that it’s part of a larger collection of data sources provided by the World Bank. In fact, the R package WDI
allows users not only to access the main World Development Indicators but also dozens of other datasets hosted by the World Bank (e.g., International Debt Statistics, Doing Business indicators, Human Capital Index, etc.). This means that through one package, you can tap into a rich variety of development data.
Using the World Bank’s API through the WDI package has several advantages: the data are always up-to-date (as of the last World Bank update), you can easily retrieve long time series for multiple countries, and you can programmatically search for indicators by keywords. The data returned by the API is typically in a tidy country-year format – each row corresponds to a country and year, with columns for the indicator values (and possibly country codes and other metadata). This format is convenient for analysis and plotting in R once you have the data.
Functions
The WDI
R package provides a couple of key functions to interact with the World Bank data:
-
WDIsearch()
– Search for indicators by keyword. -
WDI()
– Download data for specified indicators, countries, and time ranges.
Additionally, the package includes some utility functions (such as WDIcache
and WDIbulk
) for advanced usage, but our focus will be on the two main functions above, which cover most typical needs. We will go through each of these functions with examples to illustrate how they work and how to use the correct arguments to get the data you want.
WDIsearch()
The function WDIsearch()
allows you to find indicators in the World Bank datasets by searching for keywords in the indicator name or description. This is extremely useful when you know the topic you’re interested in (e.g., GDP, life expectancy, CO2 emissions) but need to find the exact indicator code that the World Bank uses for that data. The WDIsearch()
function takes a character string as an argument – this string is the keyword or pattern you want to search for. The function returns a data frame of indicators whose names or descriptions match the search term.
For example, suppose we want to find all indicators related to GDP. We can use WDIsearch("GDP")
to retrieve a list of such indicators:
In the code above, WDIsearch("GDP")
searches the World Bank’s catalog of series for the substring “GDP”. The result is assigned to listOfIndicators
, which will be a data frame. The data frame typically has columns like “indicator” (the official indicator code used by the API) and “name” (a human-readable description of the indicator). By printing the first 5 rows (listOfIndicators[1:5,]
), we might see something like:
indicator name
[1,] "NY.GDP.MKTP.CD" "GDP (current US$)"
[2,] "NY.GDP.MKTP.KD.ZG" "GDP growth (annual %)"
[3,] "NY.GDP.PCAP.CD" "GDP per capita (current US$)"
[4,] "NY.GDP.PCAP.KD.ZG" "GDP per capita growth (annual %)"
[5,] "NE.GDI.TOTL.ZS" "Gross capital formation (% of GDP)"
Example output: The above is an illustrative example of what the search results might contain (actual results may differ or appear in a different order). Each row shows an indicator code and its description. For instance, "NY.GDP.MKTP.CD"
is the code for total GDP in current US dollars, and "NY.GDP.PCAP.CD"
is GDP per capita in current US dollars, etc. Using this list, you can identify the specific indicator code you need for your analysis.
The ability to search by keyword (case-insensitive and even supporting regular expressions in WDIsearch
) makes it much easier to find the right data without having to manually browse the World Bank website. Once you have the indicator code(s) you need, you can use the WDI()
function to download the data.
WDI()
The WDI()
function is used to actually retrieve data for one or more indicators and one or more countries over a specified time range. The main arguments for WDI()
are:
-
indicator
: The indicator code or codes you want to download (as a string or a vector of strings). -
country
: The country code or codes for which you want the data. Typically these are ISO-2 or ISO-3 country codes (the World Bank uses ISO-2 country codes by default). You can also use special codes like"all"
to retrieve all countries, or groups like"OECD"
for OECD countries as a whole, etc. -
start
: The starting year for the data (numerical year, or in some cases a string if using quarterly/monthly data). -
end
: The ending year for the data.
There are additional optional arguments as well. For example, extra = TRUE
can be used to fetch additional columns (like region, income level, etc., for each country), and cache
can be used to supply or update the list of indicators cached locally. But in many cases you can ignore these extras and just specify the main four arguments above.
Let’s consider a concrete example. Suppose we are interested in the indicator “Stocks traded, total value (% of GDP)”, which has the code CM.MKT.TRAD.GD.ZS
in the WDI database. We want to gather this data for four countries – France, Canada, the United States, and China – over the period 2000 to 2016. Using WDI()
, we can do this as follows:
library(WDI)
# Access and store data for "Stocks traded, total value (% of GDP)"
# for France (FR), Canada (CA), USA (US), and China (CN) from 2000 to 2016
stockTraded <- WDI(indicator = "CM.MKT.TRAD.GD.ZS",
country = c("FR", "CA", "US", "CN"),
start = 2000,
end = 2016)
# Peek at the first few rows of the retrieved data
head(stockTraded)
When this code runs, the WDI()
function contacts the World Bank API behind the scenes and downloads the requested data. The result, stored in stockTraded
, is a data frame. If we inspect it (using head(stockTraded)
to see the first several rows), we might see something like:
iso2c country year CM.MKT.TRAD.GD.ZS
1 CA Canada 2016 147.92
2 CA Canada 2015 126.85
3 CA Canada 2014 123.45
4 CA Canada 2013 115.67
5 CA Canada 2012 105.23
6 CA Canada 2011 98.10
...
Example explanation: Each row of the data frame represents a country-year observation. In this example, the columns include: an ISO 2-letter country code (iso2c
), the country name, the year, and a column named after the indicator code (CM.MKT.TRAD.GD.ZS
) which contains the value of the stocks traded (% of GDP) for that country in that year. The data frame is sorted by country and year (typically, countries in alphabetical order by the code, and years descending, but the ordering may vary). For instance, above we see data for Canada (CA) for 2016, 2015, etc. If we scroll further down the data frame, we would find the entries for China (CN), France (FR), and the USA (US) similarly each with values from 2000 up to 2016.
With this data frame in hand, you could proceed to analyze it or visualize it (e.g., compare the trends of that indicator across the four countries). The key point is that a single function call gave us a structured dataset ready for use, which is much more convenient than manually finding each country’s data from a website.
Note: If you request a range of years where some countries don’t have data for the entire range, the resulting data frame will have NA
(missing) for those years and countries where data is absent. This is normal since not all indicators have values for every country-year.
As a final tip, you can actually request multiple indicators in one go by providing a vector of indicator codes to the indicator
argument. In that case, WDI()
will return one column per indicator (plus the country, code, and year columns). There’s even a feature where if you name the elements of the indicator vector in R, those names will be used as column names in the result. For example, WDI(indicator = c(gdp="NY.GDP.MKTP.CD", pop="SP.POP.TOTL"), country="US", start=2010, end=2020)
would fetch GDP and population for the U.S. and name the columns gdp
and pop
respectively in the output data frame.
If you want to explore more about the WDI package and see another detailed example of its usage, you can refer to the World Bank data application example provided by the package authors, which demonstrates a case study of retrieving and analyzing data from the WDI.
tl;dr
# Loading the WDI library
library(WDI)
# Search all indicators with the term "GDP"
listOfIndicators <- WDIsearch("GDP")
# List the first 5 indicators found
listOfIndicators[1:5, ]
# Retrieve data for a specific indicator and set of countries over a time range
stockTraded <- WDI(indicator = "CM.MKT.TRAD.GD.ZS",
country = c("FR", "CA", "US", "CN"),
start = 2000,
end = 2016)
head(stockTraded)
(The above code summary shows how to search for indicators containing “GDP” and how to download one example indicator for multiple countries. In practice, replace the indicator code and country codes with those relevant to your needs.)
7.2 OECD
Database Description
The Organisation for Economic Co-operation and Development (OECD) maintains a rich database of economic and social indicators for its member countries (and in some cases, non-members). The OECD data spans numerous categories such as agriculture, finance, health, education, labor, and more. The public-facing OECD data portal (often accessed via data.oecd.org) features around 300 key indicators organized into about a dozen categories. These include high-level indicators like unemployment rates, GDP for OECD countries, population statistics, etc., which are curated for easy browsing.
However, the actual breadth of data available through the OECD’s API is much larger. The OECD provides a flexible API that allows access to a wide array of datasets, each identified by a unique dataset code. Each dataset can contain many series and dimensions (for example, a dataset might contain data broken down by country, by sex, by age group, by year, etc.). The R package OECD
is designed to help users discover and download data from this API without needing to manually construct complex query URLs.
The OECD data service gives you access to up-to-date statistics across various domains for many countries (mostly OECD members). This can include indicators like employment rates, economic outlook figures, education enrollment, health outcomes, and so on. The data is often structured by multiple dimensions (country, year, and often other classifications like gender, age, industry sector, etc., depending on the dataset). Using the R OECD
package, we can search for datasets and then retrieve the specific slices of data we need by specifying filters.
Functions
The OECD
R package provides several functions to interact with the OECD API. The main functions we will highlight are:
-
get_datasets()
– Retrieve a list of all available datasets (by ID and description). -
search_dataset()
– Search within the dataset list for keywords, to find relevant dataset IDs by topic. -
get_data_structure()
– Given a specific dataset ID, get the structure of that dataset (i.e., what dimensions it has and the valid codes for each dimension). -
get_dataset()
– Download data from a specific dataset, optionally filtering by dimension codes (such as specific countries, years, etc.).
Each of these functions is useful at a different stage: first discovering what data exists, then understanding the structure of a specific dataset, and finally retrieving the data of interest. We will go through examples of their usage.
(Note: The function names in the package are a bit confusing with singular/plural. get_datasets()
(plural) returns the list of dataset identifiers and descriptions. By contrast, get_dataset()
(singular) is used to download the data for one dataset. Be careful to use the correct function.)
get_datasets()
– Listing available datasets
Before searching or fetching data, you may want to know what datasets are available via the OECD API. The function get_datasets()
returns a data frame of all dataset identifiers along with their descriptions. This is typically a large list (since OECD has many datasets). You can store this list and then use it for searching. For example:
# Loading OECD package
library(OECD)
# List all available datasets and store in a data frame
dataset_list <- get_datasets()
# Check the first few entries in the dataset list
head(dataset_list)
After running get_datasets()
, dataset_list
might contain entries like:
id description
1 AFLPMEAN Average length of parental leave, measured in weeks
2 AGRI_INV Agricultural Innovation Indicators...
3 AEO African Economic Outlook...
4 BLI Better Life Index...
...
This is just illustrative. Each row has a dataset id
(a short code like “AEO”, “BLI”, etc.) and a longer description explaining what that dataset is. Given that the list can be long, it’s often more practical to search for a keyword in this list rather than scrolling through it manually. That’s where search_dataset()
comes in.
search_dataset()
The function search_dataset()
helps you find which dataset(s) might contain the data you need by searching within the dataset names and descriptions. You provide a keyword (or regex pattern) and a data frame of datasets to search (by default, you can use the result of get_datasets()
as that input).
For example, if we are interested in data about unemployment, we can search the dataset list for the term “unemployment”:
# Assuming dataset_list has been obtained via get_datasets()
search_results <- search_dataset("unemployment", data = dataset_list)
This call will return a data frame of datasets whose descriptions contain “unemployment”. For instance, one likely result is a dataset that deals with unemployment by duration. The output might look like this (for illustration):
id description
1 DUR_D Unemployment duration by age group and gender
2 ... ... (other related datasets if any)
From this search result, suppose we identify that DUR_D
is the dataset we want (just as an example, DUR_D
might stand for “Duration of unemployment (in days or months) by demographic breakdown”). Now that we have a specific dataset ID, the next step is to see what the structure of that dataset is (what dimensions and codes it uses) so we can query it properly.
get_data_structure()
Every OECD dataset can have multiple dimensions. For example, a dataset might be broken down by country, by year, by gender, by age group, etc. To know how to formulate our query and what filters to use, we need to know what dimensions a dataset has and what the valid codes for those dimensions are. The function get_data_structure(dataset_id)
returns an object (often a list of data frames) describing the dataset’s structure.
Continuing our example with dataset "DUR_D"
(unemployment duration), let’s retrieve its structure:
# Get the structure of the "DUR_D" dataset
dstruc <- get_data_structure("DUR_D")
# Examine the structure object
str(dstruc, max.level = 1)
When you call get_data_structure("DUR_D")
, behind the scenes R fetches metadata about that dataset from the OECD API. The result dstruc
is typically a list where each element corresponds to a dimension of the dataset. For instance, dstruc
might contain elements like $COUNTRY
, $AGE
, $SEX
, $TIME
(these are hypothetical) each of which is a data frame listing the codes and meanings for that dimension.
Using str(dstruc, max.level = 1)
will print an overview of the list structure. You might see something like:
List of 4
$ COUNTRY: 'data.frame': ... (country codes and names)
$ AGE : 'data.frame': ... (age group codes and descriptions)
$ SEX : 'data.frame': ... (sex codes and descriptions)
$ TIME : 'data.frame': ... (time period info, possibly years available)
This tells us that the dataset DUR_D
is broken down by Country, Age, Sex, and Time (year). To know the actual codes, we could inspect each of those. For example, dstruc$COUNTRY
might show entries like USA = "United States"
, FRA = "France"
, etc. dstruc$SEX
might show codes like M = "Male"
, W = "Female"
, MW = "Total (Male+Female)"
. dstruc$AGE
might list codes like TOTAL = "All ages"
, Y15-24 = "15-24 years"
, etc. (The actual codes can vary; these are just plausible examples.)
Armed with this information, we can now decide what subset of the data we want. Let’s say we want to get data on unemployment duration for a few specific countries, for both males and females combined, focusing on young adults age 20–24, for the most recent year(s) available. We will need to assemble a filter that specifies those choices for each dimension.
get_dataset()
The function get_dataset()
is used to download the actual data from a specified OECD dataset. You must provide the dataset ID and you can provide a filter
argument to narrow down which slices of the data you want. The filter
argument expects a list of vectors, where each element of the list corresponds to one dimension of the dataset, in the order that the dimensions are defined.
If no filter is provided at all, get_dataset("XYZ")
would attempt to download the entire dataset “XYZ” (which could be huge, so usually you do want to filter it). If you provide a partial filter (for some dimensions), you typically need to specify something for each dimension, even if it’s just an “all” wildcard. The get_dataset
function documentation suggests that if you leave filters empty it will get everything, but you can also explicitly use NULL
or an empty string for dimensions you don’t want to filter (depending on how the function is implemented).
In our example with DUR_D
, suppose DUR_D
’s dimensions are in order: Country, Sex, Age, Time. We want: Country = {Germany, France, Canada, USA}, Sex = {Total (both sexes)}, Age = {20-24}, and Time = (we could filter a range of years or leave it to get all years). Let’s assume we want all available years for those filters.
From the structure, we identified the codes:
- Germany = “DEU”, France = “FRA”, Canada = “CAN”, USA = “USA” (these are standard ISO country codes and likely used by OECD).
- Sex total = maybe “T” or “ALL” or “MW”. Warin’s example uses “MW” which likely stands for male+female combined.
- Age 20-24 might have a code like “Y20-24” or simply “2024” as given in the example.
In the content provided, the filter list was: filter_list <- list(c("DEU", "FRA", "CAN", "USA"), "MW", "2024")
. This implies:
- First element: a vector of country codes.
- Second element: “MW” presumably meaning both sexes.
- Third element: “2024” representing the 20-24 years age group.
- (Likely the time dimension was left unfiltered, meaning all years, since they didn’t include a fourth element for time. The
get_dataset
might interpret an omitted dimension as no filtering on it.)
Now we use get_dataset
with these filters:
# Define filters: Countries = DEU, FRA, CAN, USA; Sex = MW (both male & female); Age group = 20-24
filter_list <- list(c("DEU", "FRA", "CAN", "USA"),
"MW",
"2024")
# Retrieve the filtered data from dataset "DUR_D"
unemploymentOECD <- get_dataset(dataset = "DUR_D", filter = filter_list)
# Inspect the first 6 rows of the result
unemploymentOECD[1:6, ]
After running the above, unemploymentOECD
will contain a data frame of the requested data. Each row will correspond to one combination of the dimensions we specified (country, sex, age, and time). Since we restricted sex to “MW” and age to “2024”, each country will have data for those categories over a series of years. The columns of the data frame typically include the dimensions and the measured value. For example, it might have columns like COUNTRY
, SEX
, AGE
, Time
, and Value
(or similar naming). The first 6 rows might look like:
COUNTRY SEX AGE Time Value
1 CAN MW 2024 2005 12.3
2 CAN MW 2026 2010 14.1
3 CAN MW 2024 2015 10.7
4 DEU MW 2024 2005 8.5
5 DEU MW 2024 2010 7.9
6 DEU MW 2024 2015 6.4
...
Note: The above is illustrative; actual values and formatting may differ. The idea is that we get unemployment duration (perhaps measured in months or some index) for each country at age 20-24 for each year. We see Canada (CAN) and Germany (DEU) for the years 2005, 2010, 2015 in this snippet, with some values. The data likely goes through all years available up to the latest.
Using these functions, you can mix and match as needed: first find a dataset of interest (search_dataset
), then get its structure (get_data_structure
), then fetch data (get_dataset
). Sometimes, if you already know the dataset ID and the codes required, you can go straight to get_dataset
and supply the filters.
One thing to be aware of is that the OECD API might have time as a separate dimension (like “TIME” or “Year”), or sometimes the year is part of the data frame’s row index. In the R package output, typically year/time is given as a column or included in the data frame explicitly. The examples above show it as Time
or year
. You can usually filter time by adding an element to the filter list or by using the start_time
and end_time
arguments in get_dataset
(e.g., start_time = 2000, end_time = 2020
to restrict to years 2000–2020).
In our example, because we left time unspecified in the filter list, get_dataset
likely returned all years available for those parameters. We could have added, say, another element to filter_list
for years (if the API expects a code for year) or more simply used start_time
/end_time
arguments.
tl;dr
# Loading OECD package
library(OECD)
# List all available datasets
dataset_list <- get_datasets()
# Search all datasets with the term "unemployment" in their description
search_dataset("unemployment", data = dataset_list)
# Examine the structure of a specific dataset (e.g., "DUR_D")
dstruc <- get_data_structure("DUR_D")
str(dstruc, max.level = 1)
# Define a filter to narrow the data (e.g., countries=DEU,FRA,CAN,USA; Sex=MW; Age=2024)
filter_list <- list(c("DEU", "FRA", "CAN", "USA"), "MW", "2024")
# Retrieve the dataset "DUR_D" with the specified filter
unemploymentOECD <- get_dataset(dataset = "DUR_D", filter = filter_list)
unemploymentOECD[1:6, ]
(The above code shows the steps to search for a dataset related to “unemployment”, inspect its structure, and download a subset of that data. In practice, replace the search term and dataset ID with those relevant to your needs. Always adjust the filter list according to the actual dimensions of the dataset you are querying.)
7.3 spiR
Database Description
The Social Progress Index (SPI) is a comprehensive measure of a country’s social and environmental performance, independent of economic metrics. It was developed between 2009 and 2013 by the nonprofit organization Social Progress Imperative, with input from scholars and experts (including Michael Porter and others) to better capture human well-being and societal progress. The index is composed of 52 indicators that collectively measure how well countries provide for the essential needs of their citizens, establish the building blocks that allow citizens to improve their lives, and create conditions for all individuals to reach their full potential.
The 52 indicators of the SPI are grouped into three broad dimensions:
- Basic Human Needs: This includes indicators related to nutrition and basic medical care, water and sanitation, shelter, personal safety, etc. (Things like access to food, clean water, safe housing, and security are fundamental needs).
- Foundations of Well-being: This covers education, access to information, health and wellness, and environmental quality (e.g., literacy rates, school enrollments, access to technology and information, life expectancy, pollution levels, etc.).
- Opportunity: This dimension looks at personal rights, personal freedom and choice, inclusiveness, and access to advanced education (for example, indicators on political rights, freedom from discrimination, access to higher education, corruption, etc.).
Each of these three dimensions is further broken down into components, and each component is measured by several specific indicators. All 52 indicators together roll up into an overall Social Progress Index score for each country. The SPI is usually reported on an annual basis (in early years, not all countries were covered, but it has expanded over time). The goal of SPI is to provide a more holistic measure beyond GDP to evaluate how well societies are doing in converting economic gains into improved social outcomes.
The spiR
package provides an interface to the Social Progress Imperative’s data, allowing R users to retrieve SPI data and related information. It essentially wraps the SPI API (or data source) to make it easy to get data on various countries and indicators. This includes retrieving the overall SPI scores as well as specific component indicators if needed.
Functions
The spiR
package offers a few key functions to work with the Social Progress Index data:
-
spir_country()
– Search for countries and retrieve their ISO codes as used by the SPI database. -
spir_indicator()
– Search for indicators in the SPI database and retrieve their codes (such as the code for a specific indicator or for the overall SPI). -
spir_data()
– Download the actual data for specified countries, years, and indicators from the SPI dataset.
These functions are designed to help you find the correct arguments (country codes, indicator codes) and then pull the data. Let’s go through them one by one with examples.
spir_country()
To request data from the SPI API, you will need to use country codes (likely standardized codes, possibly ISO 3-letter country codes). The function spir_country()
helps you find the correct country code for a given country name. It takes a country name (or partial name) as an argument and returns the matching country code(s) used in the SPI system.
For example, suppose we want to get data for Canada. We should confirm what code the SPI uses for Canada. We can do:
# Loading the spiR package
library(spiR)
# Get the ISO country code for "Canada"
myCountry <- spir_country("Canada")
myCountry
If we run this, myCountry
will contain the result of the search. Likely, since “Canada” is a unique match, it will return a data frame or vector with Canada’s code. In many datasets, Canada’s code is “CAN” (ISO 3-letter code). The output might look like:
country_name iso3
1 Canada CAN
So, spir_country("Canada")
yields “CAN” as the code. If you search a more ambiguous string, like spir_country("United")
, you might get multiple matches (e.g., United States, United Kingdom, United Arab Emirates, etc., each with their code). If you call spir_country()
with no argument, it might list all available countries and their codes.
Knowing the country codes is helpful because the main data function spir_data()
expects country codes as input.
spir_indicator()
Similarly, we need to know the indicator codes for the data we want. The SPI has one overall index (often code “SPI”) and many sub-indicators (each likely has its own code or name). The function spir_indicator()
allows you to search the indicators by a keyword.
For instance, if we wanted to find indicators related to mortality (perhaps there’s an indicator about child mortality or something in the Basic Needs dimension), we could search:
# Search for an indicator containing "mortality"
myIndicator <- spir_indicator("mortality")
myIndicator
This will return any indicators whose name or description contains “mortality”. Suppose one of the SPI indicators is “Maternal Mortality Rate” or “Under-5 Mortality Rate”; the search might find it. The output could be something like:
indicator_name indicator_code
1 "Under 5 Mortality Rate (per 1,000 live births)" "mortality_u5"
2 "Maternal Mortality Rate (per 100,000 live births)" "mortality_maternal"
...
(Exact codes and names are hypothetical here, for illustration.) The idea is that spir_indicator
will provide you the code string that you need to use to request that indicator’s data.
If you call spir_indicator()
without any argument, it may list all 52 indicators and their codes. This could be useful to see the whole catalog of what’s available, including the overall “SPI” score and all components.
For the purposes of this chapter, let’s assume we are interested in the overall Social Progress Index itself, which likely has an indicator code "SPI"
for the aggregate index. That is probably what we’ll use in the example for data retrieval.
spir_data()
The function spir_data()
is the main function to get actual data from the Social Progress Index database. It requires a few arguments:
-
country
: one or more country codes (using the codes we found viaspir_country
). This is usually a character vector, e.g.,c("USA","FRA","BRA")
. -
years
: one or more years of interest (as character strings). You can specify a range or specific years, e.g.,c("2014","2015","2016")
. The SPI data in early years might not cover every year, but from 2014 onward they might have annual releases (for example, SPI 2014, SPI 2015, etc.). -
indicators
: one or more indicator codes that you want to retrieve. For example,"SPI"
for the overall index, or a specific code for a sub-index or component if desired.
Using these arguments, spir_data
will fetch the data and return it, likely as a data frame where each row corresponds to a country-year combination and columns include the country, year, and the indicator values.
Let’s retrieve the overall Social Progress Index for a set of countries over several years. For example, we’ll get the SPI from 2014 through 2019 for the USA, France, Brazil, China, South Africa, and Canada. We already suspect the indicator code is “SPI” for the main index (we could confirm that via spir_indicator("Social Progress Index")
if needed). And we have country codes: USA, FRA, BRA, CHN, ZAF (South Africa), CAN.
Once this runs, myData
will contain the SPI values for those countries and years. We used character vectors for years in the function call (notice the quotes around the years). In many cases, years could be numeric, but perhaps the API expects them as strings — using quotes ensures they’re treated as text. The output (as seen by head(myData)
) might look like:
country iso3 indicator year value
1 Brazil BRA SPI 2014 67.91
2 Brazil BRA SPI 2015 69.58
3 Brazil BRA SPI 2016 70.40
4 Brazil BRA SPI 2017 71.00
5 Brazil BRA SPI 2018 72.29
6 Brazil BRA SPI 2019 72.89
This is an example for one country (Brazil) across years 2014–2019, showing the SPI score (on some scale, possibly 0–100) for each year. In the actual myData
, all requested countries would be present, so you’d also have rows for USA, France, etc. The columns likely include the country name, the country code (iso3), the indicator (SPI), the year, and the value (the score).
You could then use this data frame to compare how different countries’ social progress scores have changed over time. For instance, you might plot each country’s SPI over the years.
The spir_data
function can also retrieve multiple indicators at once if you provide a vector of indicator codes. Then the data frame would have multiple columns (or multiple rows per indicator, depending on how it’s structured – often such API wrappers return a long format where each row is a single indicator for a single country-year, meaning you’d get an extra column for indicator code as shown above, and a value column). In the example above, since we only requested “SPI”, we see one row per country-year. If we had requested additional indicators, we might see multiple rows per country-year (one for each indicator), or the function might pivot it into columns. One would need to check the documentation for exact behavior. But you can always reshape the data after retrieval as needed.
Finally, if you want to explore more about what you can do with the SPI data, the spiR
package documentation or the in-depth application (possibly referenced by the warin.ca
post) can provide more examples, such as making dashboards or visualizations similar to those on the Social Progress Imperative’s own site.
tl;dr
# Loading the spiR package
library(spiR)
# Find the ISO code for a specific country (e.g., Canada)
myCountry <- spir_country("Canada")
myCountry # should return "CAN" for Canada
# Search for an indicator by keyword (e.g., "mortality")
myIndicator <- spir_indicator("mortality")
myIndicator # returns any indicator codes that include "mortality"
# Retrieve data for the Social Progress Index (SPI) for selected countries and years
myData <- spir_data(country = c("USA", "FRA", "BRA", "CHN", "ZAF", "CAN"),
years = c("2014","2015","2016","2017","2018","2019"),
indicators = "SPI")
head(myData)
(The above code demonstrates how to look up country and indicator codes and how to fetch the overall SPI data for a set of countries over a range of years. In practice, you can use spir_indicator()
to find other indicators and replace "SPI"
with the code of any specific sub-indicator if you wish to retrieve those.)
7.4 statcanR
Database Description
Statistics Canada (often abbreviated as StatCan) is the national statistical agency of Canada. It produces a vast amount of data on Canada’s economy, society, and environment. The Statistics Canada open data portal includes data on about 30 broad subjects, including agriculture, energy, environment, education, health, economics, demographics, and more. These data are available at various geographic levels, such as national (Canada), provincial/territorial, metropolitan areas, etc., depending on the dataset.
Historically, much of StatCan’s data was accessible via something called CANSIM tables (Canadian Socio-economic Information Management system). In recent years, they modernized their platform and now refer to data tables by a Product ID (PID) code like “27-10-0014-01”. The StatCan Open Data API (also known as the Web Data Service) allows programmatic access to these tables. Each table’s data can be retrieved by referencing its table number or product ID.
The statcanR
package provides a user-friendly way for R users to access Statistics Canada data. It essentially wraps around the web service API so that given a table ID, it will fetch the data and return it as an R data frame or tibble. This saves you from manually downloading CSV files or writing your own HTTP requests.
However, one challenge is that you need to know the table’s ID before you can fetch it (the API requires a specific table number). The statcanR
workflow typically involves two steps:
-
Find the table ID for the data you want. This is often done by using the Statistics Canada website’s search, since
statcanR
itself doesn’t provide a search function in the package (as of the information we have). You might use the StatCan data portal or an online search to identify the table number for your topic of interest. -
Use
statcan_data()
to retrieve that table’s data. Once you have the table ID, you call the function and it returns the data.
Let’s go through these steps with an example.
Functions
The main function in statcanR
we will use is statcan_data()
. But before using it, we usually need to find the table ID (unless we already know it). So we can think in terms of:
- Searching for data (table ID) outside the package. The user might have to go to the StatCan website or use an index of table IDs.
-
statcan_data()
– the function to fetch data given a table ID and language.
(There isn’t a dedicated search function like statcan_search()
in this package as far as the provided material suggests. Instead, the guidance is to use the website to find the ID.)
Search for data (Finding the table ID)
To find a Statistics Canada table ID for the data you want, you can use the official StatCan data portal search. For example, let’s say we are interested in “federal expenditures on science and technology by socio-economic objectives.” We could go to the StatCan website’s data search page (the URL given in the materials is https://www150.statcan.gc.ca/n1/en/type/data?MM=1
which is a general data search page). On that page, typing a few keywords like “federal expenditures science technology socio-economic objective” should bring up relevant results.
Assume we do that search on the website. The search results might list a table with exactly that description. In the provided content, the example found that the table number for “Federal expenditures on science and technology by socio-economic objectives” is 27-10-0014-01. This is the unique identifier for that dataset.
StatCan table IDs usually have a format like two digits, two digits, four digits, two digits (with hyphens in between). For instance, 27-10-0014-01:
- The first two digits (27) might represent a category.
- The next two (10) perhaps sub-category or just part of the coding.
- 0014 is the specific table number, and 01 might indicate the version (like some tables get updated structure over time and the last two digits change).
Anyway, once we have this ID, we are ready to fetch the data via the API.
statcan_data()
The function statcan_data(table, lang)
fetches the data for a given table number. It has two main arguments:
- The first argument is the table number as a string (e.g.,
"27-10-0014-01"
). - The second argument is the language of the data (“eng” for English or “fra” for French). StatCan publishes data in both official languages, and sometimes the table content (like column names or category labels) can be fetched in either language.
In our example, we’ll use the table ID we found, "27-10-0014-01"
, and request the data in English.
# Loading the statcanR package
library(statcanR)
# Fetch data from Statistics Canada table 27-10-0014-01 in English
mydata <- statcan_data("27-10-0014-01", lang = "eng")
# Examine the first few rows of the data
head(mydata)
After running this, mydata
will contain the data frame for that table. The head(mydata)
will show the first several rows. The structure of the data depends on the table. StatCan tables are often structured in a long format where each row is a combination of the classification dimensions with a value. The columns might include things like REF_DATE
(the time period, e.g., year), and various other dimensions like geographical area, indicator, etc., depending on what the table is about, plus a VALUE
column for the numeric value.
For instance, since this is “federal expenditures on S&T by socio-economic objective”, the dimensions might be Year, maybe type of expenditure or objective category, etc. We might see columns like:
REF_DATE GEO Objective VALUE
2018 Canada Defence 500.0
2018 Canada Economic development 300.0
2018 Canada ... ...
2019 Canada Defence 520.0
...
(This is hypothetical data to illustrate the format.) Essentially, each row is one category of expenditure in a given year, with the value being the amount spent (maybe in millions of dollars, etc.). The actual table likely has a specific breakdown.
One nice thing is that statcan_data()
probably returns a tibble (which is a modern type of data frame in R) with proper column names and factor labels in English (since we requested lang="eng"
). If we had requested French (lang="fra"
), the labels for the objectives and possibly the column names would appear in French.
At this point, we have the data needed and can proceed to analyze or visualize it. For example, we could sum up various objectives or see trends over time.
The statcanR
package makes it straightforward to get the latest data for that table without manually downloading the CSV from the website. Moreover, if the table gets updated with new data (for a new year, for example), running statcan_data()
again at a later date would fetch the updated data (assuming the table ID remains the same).
One should note: to use the StatCan API without this package, you might normally have to know the API endpoint and possibly handle CSV or JSON. statcanR
abstracts that away – you just provide the table ID.
If you are curious for more, the statcanR
documentation or the referenced blog post may have more examples (such as dealing with very large tables or manipulating the results). But the essential part is covered: find the table ID and use statcan_data()
.
tl;dr
# Loading the statcanR package
library(statcanR)
# Use statcan_data() to retrieve a specific table by its ID (English version)
mydata <- statcan_data("27-10-0014-01", "eng")
# View the first few rows of the retrieved data
head(mydata)
(In practice, replace "27-10-0014-01"
with the table ID of the dataset you need from Statistics Canada. Use "eng"
for English or "fra"
for French as the language argument. Remember to find the table ID via the StatCan website or documentation before using this function.)
7.5 EpiBibR
Database Description
The EpiBibR package is an R wrapper designed to provide easy access to a large bibliographic dataset, particularly focused on COVID-19 and other related medical research references. During the COVID-19 global crisis, the volume of scientific literature on the topic skyrocketed. Having a comprehensive bibliographic database of COVID-19 research (articles, preprints, letters, news articles, etc.) is valuable for researchers conducting literature reviews, trend analysis, or bibliometric studies. EpiBibR was created to make over 100,000 such references available directly through R.
In essence, EpiBibR contains (or connects to) a database of bibliographic entries (like what you’d find in PubMed or other scholarly databases) that are related to epidemiology and specifically COVID-19. Each entry in the database includes various fields, such as authors, title, abstract, publication year, journal, etc. This is analogous to having a huge bibliography or library catalog that you can query with code.
To give an idea of what information each reference record contains, here are some of the fields available (with their typical tags):
- AU – Authors (the list of authors of the paper)
- TI – Title of the document
- AB – Abstract of the paper
- PY – Publication Year
- DT – Document Type (e.g., Article, Letter, News, etc.)
- MESH – Medical Subject Headings (keywords/topics assigned)
- TC – Times Cited (citation count, if available)
- SO – Source (publication name, e.g., journal or news source)
- J9 – Source abbreviation
- JI – ISO source abbreviation
- ISSN – International Standard Serial Number (journal identifier)
- VOL, ISSUE – Volume and Issue number (for journal articles)
- ID – PubMed ID (if applicable)
- DE – Authors’ Keywords (keywords given by authors)
- UT – Unique Article Identifier (possibly Web of Science ID or similar)
- AU_CO – Author’s Country of Origin (which might be derived from author affiliations)
- DB – Database from which the record is sourced (e.g., which bibliographic database)
The above is a lot of information – essentially, EpiBibR is giving you a bibliographic dataset akin to a large reference manager file that you can query. The typical use would be: you query the data for certain criteria (like author name, year, keywords, etc.) and get back a subset of references matching those criteria.
Functions
The EpiBibR
package provides a main function for data retrieval and allows filtering by various fields:
-
epibibr_data()
– The primary function to retrieve bibliographic references, with arguments that allow filtering by author, country, year, title keywords, abstract keywords, and source (journal name) among others.
The usage of epibibr_data()
is very flexible: you can provide none, one, or multiple filters. Without any arguments, it returns the entire dataset (which is huge). With arguments, it filters accordingly.
Let’s go through some examples, as given in the content:
epibibr_data()
-
Retrieving the entire dataset: If you simply call
epibibr_data()
with no arguments, it will try to retrieve the entire bibliography data frame which contains all references (80,000+ or even 100,000+ entries). For example:library(EpiBibR) complete_data <- epibibr_data()
This command would populate
complete_data
with the entire bibliographic database. This might be quite large in memory, so often you might not want to do this unless you truly need everything. Instead, you might retrieve a subset based on some criteria. -
Filtering by author: If you want all references authored by a certain person, you can use the
author
argument. For example, to get all articles written by someone with last name Colson (as in Philippe Colson, a microbiologist who authored many COVID-19 papers):colson_articles <- epibibr_data(author = "Colson")
This will search the author field for “Colson” and return all entries where at least one author matches that name. The result
colson_articles
would be a data frame of all such references. Each entry would include all the fields (Title, Year, etc.) for papers that have Colson as an author. You might then checknrow(colson_articles)
to see how many papers he authored in the database, for instance. -
Filtering by author and year: You can combine filters. The function allows multiple arguments to narrow the search. For example, if we want references authored by someone named Yang in the year 2020:
yang2020 <- epibibr_data(author = "Yang", year = "2020")
This will give all records where an author’s name contains “Yang” and the publication year is 2020. The result
yang2020
would contain only those references meeting both criteria (logical AND between filters). -
Filtering by author’s country: If we want to find references based on the country of origin of the authors (perhaps the country of the corresponding author or an author affiliation country), we can use the
country
argument. For example, to get all references with at least one author from Canada:canada_articles <- epibibr_data(country = "Canada")
This will search the author address/affiliation field for “Canada” and return those references.
-
Filtering by title keyword: If we want to find articles that have a certain keyword in the title, we use the
title
argument. For example, to get all references whose titles contain “covid”:covid_articles <- epibibr_data(title = "covid")
This will likely return a lot of references (since many will have “COVID” in the title). The search might be case-insensitive and probably looks for the substring “covid” in the title.
-
Combining multiple criteria: As mentioned, we can refine searches by using multiple arguments at once, and the result will satisfy all given filters. For instance, we might want references authored by someone named “Yang”, that have “covid” in the title, and were published in 2020:
yangcovid2020_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020")
This will find references where all three conditions are true (author includes “Yang”, title includes “covid”, year is 2020). We could add even more criteria, for example, also specifying a source.
-
Adding source as another filter: The
source
argument can filter by publication source (like journal or conference name). For example, to refine the above search to only include those references in sources whose name contains “bio” (maybe “BioRxiv” or “Biology” etc.):yangcovid2020bio_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020", source = "bio")
Now the references must meet all four filters: an author name containing “Yang”, title containing “covid”, year 2020, and source containing “bio”. This will significantly narrow it down – possibly to references by authors named Yang in 2020 about COVID in some biology-related journals or preprint servers.
-
Filtering by abstract keyword: We can also search within the abstract text of the references using the
abstract
argument. For example, to find references that mention “coronavirus” in the abstract:coronavirus_articles <- epibibr_data(abstract = "coronavirus")
This will return references whose abstracts contain the word “coronavirus”. This is a powerful way to find papers that might be about coronaviruses even if the title doesn’t explicitly say so.
All these filters can be used in combination or standalone. The result of any epibibr_data
call is a data frame (or tibble) with the references that match. Each row is one reference, with columns corresponding to fields like author, title, year, etc. likely using the abbreviations or full names of those fields. For example, the resulting data frame might have columns named AU
, TI
, AB
, PY
, SO
, etc., or possibly more user-friendly names. If integrating with bibliometrix (as hinted, since they designed it to integrate with the bibliometrix package), it might preserve standard field tags so bibliometrix can read it easily.
One thing to keep in mind is that text searches (like title = “covid”) will probably match anywhere in the field, so “COVID-19” or “covid19” etc. would match because “covid” is a substring. Similarly, author = “Yang” might match “Yang” as a surname but also “Yangus” or anything with those letters — though likely it’s intended to match last names exactly or something. The specifics depend on how the search is implemented (perhaps it treats the input as a case-insensitive substring search).
The ability to combine filters means you can tailor very specific queries, which is great for slicing the data (for example, find how many papers a particular author wrote in a given year on a certain topic).
One caution: If you combine too many filters that don’t have overlapping results, you might get zero results. For instance, if no author named Yang wrote a COVID-titled paper in 2020 in a source with “bio” in its name, then yangcovid2020bio_articles
would be empty (0 rows).
EpiBibR essentially puts a research literature database at your fingertips. A researcher could use this to do things like trend analysis (how many COVID papers per year, etc.), network analysis of collaborations (using authors and their countries), or topic modeling on abstracts, etc.
tl;dr
# Loading the EpiBibR package
library(EpiBibR)
# Retrieve the entire dataset (all references) - large output
epidata <- epibibr_data()
# Examples of filtered searches:
complete_data <- epibibr_data() # same as epidata, full dataset
colson_articles <- epibibr_data(author = "Colson") # all references with an author named Colson
yang2020 <- epibibr_data(author = "Yang", year = "2020") # references authored by "Yang" in the year 2020
canada_articles <- epibibr_data(country = "Canada") # references where an author's country is Canada
covid_articles <- epibibr_data(title = "covid") # references with "covid" in the title
yangcovid2020_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020")
# references that satisfy: author name has "Yang", title has "covid", and year is 2020
yangcovid2020bio_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020", source = "bio")
# further narrows above: in addition, source contains "bio" (perhaps BioRxiv or similar)
coronavirus_articles <- epibibr_data(abstract = "coronavirus")
# references with "coronavirus" in the abstract
(The above code block demonstrates how to use epibibr_data()
in various ways: with no filters (all data) and with different combinations of filters for author, year, country, title, source, and abstract. These are examples – you can adjust the strings to search for different authors, keywords, years, etc., depending on your research needs.)