7 Automating Data Collection with APIs

In data science projects, having access to interesting datasets is essential fuel for analysis. While one approach is to manually download CSV files from various websites, a more dynamic and powerful method is to use APIs (Application Programming Interfaces). An API is essentially a set of rules or protocols that enables different software applications to communicate and exchange data and functionality. In practical terms, many data providers offer web APIs that allow direct retrieval of up-to-date data via code, rather than manually downloading files. Using APIs can ensure you are working with the latest data and can automate the data acquisition process.

In this chapter, we will focus on how to import data using APIs in R through specialized packages. Each package acts as a wrapper around an external data source’s API, handling the communication details so that you can simply call R functions to fetch data. Before diving into specific examples, it’s important to clarify what we mean by an argument in this context. In programming, an argument is a value or input that you pass to a function when you call it, which influences how the function operates. For instance, if a function is defined to take a country code as an argument, you would provide a specific country code value when calling that function. Understanding arguments is crucial because all the API-related functions we’ll use require certain arguments (like indicator codes, country codes, years, etc.) to specify what data you want.

At the end of the chapter, you should be able to:

Know what an argument is. You will understand how functions use arguments (inputs) to modify their behavior, and how to supply the correct arguments to get the data you need.
Import data using an API in R. You will learn to use various R packages that interface with web APIs to retrieve data from online sources, avoiding manual downloads and making your data acquisition process reproducible and up-to-date.

We will explore several real-world examples of R packages that provide easy access to data via APIs. These include:

WDI: Access World Bank’s World Development Indicators and other datasets.
OECD: Retrieve indicators and datasets from the OECD (Organisation for Economic Co-operation and Development).
spiR: Access the Social Progress Index data.
statcanR: Download data from Statistics Canada’s open data portal.
EpiBibR: Obtain bibliographic references (especially COVID-19 related literature data).
coronavirus: Get daily COVID-19 statistics (cases, deaths, etc.) from the Johns Hopkins University dataset.

For each of these, we will discuss the data source, the key functions provided by the R package, and walk through examples of how to use them. By the end, you will see how APIs combined with R packages allow you to seamlessly pull in data from a variety of domains with just a few lines of code.

7.1 WDI

Database Description

The World Development Indicators (WDI) database is a flagship dataset of the World Bank containing a wide range of global development statistics. It is a compilation of relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty. The database includes over 1,600 time-series indicators for 217 economies and more than 40 country groups, with data for many indicators going back over 50 years. These indicators cover topics such as economic growth, education, health, poverty, environmental factors, and much more.

What makes the WDI particularly powerful is that it’s part of a larger collection of data sources provided by the World Bank. In fact, the R package WDI allows users not only to access the main World Development Indicators but also dozens of other datasets hosted by the World Bank (e.g., International Debt Statistics, Doing Business indicators, Human Capital Index, etc.). This means that through one package, you can tap into a rich variety of development data.

Using the World Bank’s API through the WDI package has several advantages: the data are always up-to-date (as of the last World Bank update), you can easily retrieve long time series for multiple countries, and you can programmatically search for indicators by keywords. The data returned by the API is typically in a tidy country-year format – each row corresponds to a country and year, with columns for the indicator values (and possibly country codes and other metadata). This format is convenient for analysis and plotting in R once you have the data.

Functions

The WDI R package provides a couple of key functions to interact with the World Bank data:

WDIsearch() – Search for indicators by keyword.
WDI() – Download data for specified indicators, countries, and time ranges.

Additionally, the package includes some utility functions (such as WDIcache and WDIbulk) for advanced usage, but our focus will be on the two main functions above, which cover most typical needs. We will go through each of these functions with examples to illustrate how they work and how to use the correct arguments to get the data you want.

WDIsearch()

The function WDIsearch() allows you to find indicators in the World Bank datasets by searching for keywords in the indicator name or description. This is extremely useful when you know the topic you’re interested in (e.g., GDP, life expectancy, CO2 emissions) but need to find the exact indicator code that the World Bank uses for that data. The WDIsearch() function takes a character string as an argument – this string is the keyword or pattern you want to search for. The function returns a data frame of indicators whose names or descriptions match the search term.

For example, suppose we want to find all indicators related to GDP. We can use WDIsearch("GDP") to retrieve a list of such indicators:

# Loading the WDI package
library(WDI)

# Search all indicators with the term "GDP"
listOfIndicators <- WDIsearch("GDP")

# Inspect the first 5 indicators found
listOfIndicators[1:5, ]

In the code above, WDIsearch("GDP") searches the World Bank’s catalog of series for the substring “GDP”. The result is assigned to listOfIndicators, which will be a data frame. The data frame typically has columns like “indicator” (the official indicator code used by the API) and “name” (a human-readable description of the indicator). By printing the first 5 rows (listOfIndicators[1:5,]), we might see something like:

     indicator                name                                        
[1,] "NY.GDP.MKTP.CD"        "GDP (current US$)"                          
[2,] "NY.GDP.MKTP.KD.ZG"     "GDP growth (annual %)"                      
[3,] "NY.GDP.PCAP.CD"        "GDP per capita (current US$)"               
[4,] "NY.GDP.PCAP.KD.ZG"     "GDP per capita growth (annual %)"           
[5,] "NE.GDI.TOTL.ZS"        "Gross capital formation (% of GDP)"

Example output: The above is an illustrative example of what the search results might contain (actual results may differ or appear in a different order). Each row shows an indicator code and its description. For instance, "NY.GDP.MKTP.CD" is the code for total GDP in current US dollars, and "NY.GDP.PCAP.CD" is GDP per capita in current US dollars, etc. Using this list, you can identify the specific indicator code you need for your analysis.

The ability to search by keyword (case-insensitive and even supporting regular expressions in WDIsearch) makes it much easier to find the right data without having to manually browse the World Bank website. Once you have the indicator code(s) you need, you can use the WDI() function to download the data.

WDI()

The WDI() function is used to actually retrieve data for one or more indicators and one or more countries over a specified time range. The main arguments for WDI() are:

indicator: The indicator code or codes you want to download (as a string or a vector of strings).
country: The country code or codes for which you want the data. Typically these are ISO-2 or ISO-3 country codes (the World Bank uses ISO-2 country codes by default). You can also use special codes like "all" to retrieve all countries, or groups like "OECD" for OECD countries as a whole, etc.
start: The starting year for the data (numerical year, or in some cases a string if using quarterly/monthly data).
end: The ending year for the data.

There are additional optional arguments as well. For example, extra = TRUE can be used to fetch additional columns (like region, income level, etc., for each country), and cache can be used to supply or update the list of indicators cached locally. But in many cases you can ignore these extras and just specify the main four arguments above.

Let’s consider a concrete example. Suppose we are interested in the indicator “Stocks traded, total value (% of GDP)”, which has the code CM.MKT.TRAD.GD.ZS in the WDI database. We want to gather this data for four countries – France, Canada, the United States, and China – over the period 2000 to 2016. Using WDI(), we can do this as follows:

library(WDI)

# Access and store data for "Stocks traded, total value (% of GDP)" 
# for France (FR), Canada (CA), USA (US), and China (CN) from 2000 to 2016
stockTraded <- WDI(indicator = "CM.MKT.TRAD.GD.ZS", 
                   country   = c("FR", "CA", "US", "CN"), 
                   start     = 2000, 
                   end       = 2016)

# Peek at the first few rows of the retrieved data
head(stockTraded)

When this code runs, the WDI() function contacts the World Bank API behind the scenes and downloads the requested data. The result, stored in stockTraded, is a data frame. If we inspect it (using head(stockTraded) to see the first several rows), we might see something like:

  iso2c   country   year CM.MKT.TRAD.GD.ZS
1   CA    Canada    2016      147.92
2   CA    Canada    2015      126.85
3   CA    Canada    2014      123.45
4   CA    Canada    2013      115.67
5   CA    Canada    2012      105.23
6   CA    Canada    2011       98.10
...

Example explanation: Each row of the data frame represents a country-year observation. In this example, the columns include: an ISO 2-letter country code (iso2c), the country name, the year, and a column named after the indicator code (CM.MKT.TRAD.GD.ZS) which contains the value of the stocks traded (% of GDP) for that country in that year. The data frame is sorted by country and year (typically, countries in alphabetical order by the code, and years descending, but the ordering may vary). For instance, above we see data for Canada (CA) for 2016, 2015, etc. If we scroll further down the data frame, we would find the entries for China (CN), France (FR), and the USA (US) similarly each with values from 2000 up to 2016.

With this data frame in hand, you could proceed to analyze it or visualize it (e.g., compare the trends of that indicator across the four countries). The key point is that a single function call gave us a structured dataset ready for use, which is much more convenient than manually finding each country’s data from a website.

Note: If you request a range of years where some countries don’t have data for the entire range, the resulting data frame will have NA (missing) for those years and countries where data is absent. This is normal since not all indicators have values for every country-year.

As a final tip, you can actually request multiple indicators in one go by providing a vector of indicator codes to the indicator argument. In that case, WDI() will return one column per indicator (plus the country, code, and year columns). There’s even a feature where if you name the elements of the indicator vector in R, those names will be used as column names in the result. For example, WDI(indicator = c(gdp="NY.GDP.MKTP.CD", pop="SP.POP.TOTL"), country="US", start=2010, end=2020) would fetch GDP and population for the U.S. and name the columns gdp and pop respectively in the output data frame.

If you want to explore more about the WDI package and see another detailed example of its usage, you can refer to the World Bank data application example provided by the package authors, which demonstrates a case study of retrieving and analyzing data from the WDI.

tl;dr

# Loading the WDI library
library(WDI)

# Search all indicators with the term "GDP"
listOfIndicators <- WDIsearch("GDP")

# List the first 5 indicators found
listOfIndicators[1:5, ]

# Retrieve data for a specific indicator and set of countries over a time range
stockTraded <- WDI(indicator = "CM.MKT.TRAD.GD.ZS", 
                   country   = c("FR", "CA", "US", "CN"), 
                   start     = 2000, 
                   end       = 2016)

head(stockTraded)

(The above code summary shows how to search for indicators containing “GDP” and how to download one example indicator for multiple countries. In practice, replace the indicator code and country codes with those relevant to your needs.)

7.2 OECD

Database Description

The Organisation for Economic Co-operation and Development (OECD) maintains a rich database of economic and social indicators for its member countries (and in some cases, non-members). The OECD data spans numerous categories such as agriculture, finance, health, education, labor, and more. The public-facing OECD data portal (often accessed via data.oecd.org) features around 300 key indicators organized into about a dozen categories. These include high-level indicators like unemployment rates, GDP for OECD countries, population statistics, etc., which are curated for easy browsing.

However, the actual breadth of data available through the OECD’s API is much larger. The OECD provides a flexible API that allows access to a wide array of datasets, each identified by a unique dataset code. Each dataset can contain many series and dimensions (for example, a dataset might contain data broken down by country, by sex, by age group, by year, etc.). The R package OECD is designed to help users discover and download data from this API without needing to manually construct complex query URLs.

The OECD data service gives you access to up-to-date statistics across various domains for many countries (mostly OECD members). This can include indicators like employment rates, economic outlook figures, education enrollment, health outcomes, and so on. The data is often structured by multiple dimensions (country, year, and often other classifications like gender, age, industry sector, etc., depending on the dataset). Using the R OECD package, we can search for datasets and then retrieve the specific slices of data we need by specifying filters.

Functions

The OECD R package provides several functions to interact with the OECD API. The main functions we will highlight are:

get_datasets() – Retrieve a list of all available datasets (by ID and description).
search_dataset() – Search within the dataset list for keywords, to find relevant dataset IDs by topic.
get_data_structure() – Given a specific dataset ID, get the structure of that dataset (i.e., what dimensions it has and the valid codes for each dimension).
get_dataset() – Download data from a specific dataset, optionally filtering by dimension codes (such as specific countries, years, etc.).

Each of these functions is useful at a different stage: first discovering what data exists, then understanding the structure of a specific dataset, and finally retrieving the data of interest. We will go through examples of their usage.

(Note: The function names in the package are a bit confusing with singular/plural. get_datasets() (plural) returns the list of dataset identifiers and descriptions. By contrast, get_dataset() (singular) is used to download the data for one dataset. Be careful to use the correct function.)

`get_datasets()` – Listing available datasets

Before searching or fetching data, you may want to know what datasets are available via the OECD API. The function get_datasets() returns a data frame of all dataset identifiers along with their descriptions. This is typically a large list (since OECD has many datasets). You can store this list and then use it for searching. For example:

# Loading OECD package
library(OECD)

# List all available datasets and store in a data frame
dataset_list <- get_datasets()

# Check the first few entries in the dataset list
head(dataset_list)

After running get_datasets(), dataset_list might contain entries like:

       id            description
1   AFLPMEAN  Average length of parental leave, measured in weeks
2   AGRI_INV  Agricultural Innovation Indicators...
3   AEO       African Economic Outlook...
4   BLI       Better Life Index...
...

This is just illustrative. Each row has a dataset id (a short code like “AEO”, “BLI”, etc.) and a longer description explaining what that dataset is. Given that the list can be long, it’s often more practical to search for a keyword in this list rather than scrolling through it manually. That’s where search_dataset() comes in.

search_dataset()

The function search_dataset() helps you find which dataset(s) might contain the data you need by searching within the dataset names and descriptions. You provide a keyword (or regex pattern) and a data frame of datasets to search (by default, you can use the result of get_datasets() as that input).

For example, if we are interested in data about unemployment, we can search the dataset list for the term “unemployment”:

# Assuming dataset_list has been obtained via get_datasets()
search_results <- search_dataset("unemployment", data = dataset_list)

This call will return a data frame of datasets whose descriptions contain “unemployment”. For instance, one likely result is a dataset that deals with unemployment by duration. The output might look like this (for illustration):

      id       description
1   DUR_D    Unemployment duration by age group and gender
2   ...      ... (other related datasets if any)

From this search result, suppose we identify that DUR_D is the dataset we want (just as an example, DUR_D might stand for “Duration of unemployment (in days or months) by demographic breakdown”). Now that we have a specific dataset ID, the next step is to see what the structure of that dataset is (what dimensions and codes it uses) so we can query it properly.

get_data_structure()

Every OECD dataset can have multiple dimensions. For example, a dataset might be broken down by country, by year, by gender, by age group, etc. To know how to formulate our query and what filters to use, we need to know what dimensions a dataset has and what the valid codes for those dimensions are. The function get_data_structure(dataset_id) returns an object (often a list of data frames) describing the dataset’s structure.

Continuing our example with dataset "DUR_D" (unemployment duration), let’s retrieve its structure:

# Get the structure of the "DUR_D" dataset
dstruc <- get_data_structure("DUR_D")

# Examine the structure object
str(dstruc, max.level = 1)

When you call get_data_structure("DUR_D"), behind the scenes R fetches metadata about that dataset from the OECD API. The result dstruc is typically a list where each element corresponds to a dimension of the dataset. For instance, dstruc might contain elements like $COUNTRY, $AGE, $SEX, $TIME (these are hypothetical) each of which is a data frame listing the codes and meanings for that dimension.

Using str(dstruc, max.level = 1) will print an overview of the list structure. You might see something like:

List of 4
 $ COUNTRY: 'data.frame':  ... (country codes and names)
 $ AGE    : 'data.frame':  ... (age group codes and descriptions)
 $ SEX    : 'data.frame':  ... (sex codes and descriptions)
 $ TIME   : 'data.frame':  ... (time period info, possibly years available)

This tells us that the dataset DUR_D is broken down by Country, Age, Sex, and Time (year). To know the actual codes, we could inspect each of those. For example, dstruc$COUNTRY might show entries like USA = "United States", FRA = "France", etc. dstruc$SEX might show codes like M = "Male", W = "Female", MW = "Total (Male+Female)". dstruc$AGE might list codes like TOTAL = "All ages", Y15-24 = "15-24 years", etc. (The actual codes can vary; these are just plausible examples.)

Armed with this information, we can now decide what subset of the data we want. Let’s say we want to get data on unemployment duration for a few specific countries, for both males and females combined, focusing on young adults age 20–24, for the most recent year(s) available. We will need to assemble a filter that specifies those choices for each dimension.

get_dataset()

The function get_dataset() is used to download the actual data from a specified OECD dataset. You must provide the dataset ID and you can provide a filter argument to narrow down which slices of the data you want. The filter argument expects a list of vectors, where each element of the list corresponds to one dimension of the dataset, in the order that the dimensions are defined.

If no filter is provided at all, get_dataset("XYZ") would attempt to download the entire dataset “XYZ” (which could be huge, so usually you do want to filter it). If you provide a partial filter (for some dimensions), you typically need to specify something for each dimension, even if it’s just an “all” wildcard. The get_dataset function documentation suggests that if you leave filters empty it will get everything, but you can also explicitly use NULL or an empty string for dimensions you don’t want to filter (depending on how the function is implemented).

In our example with DUR_D, suppose DUR_D’s dimensions are in order: Country, Sex, Age, Time. We want: Country = {Germany, France, Canada, USA}, Sex = {Total (both sexes)}, Age = {20-24}, and Time = (we could filter a range of years or leave it to get all years). Let’s assume we want all available years for those filters.

From the structure, we identified the codes:

Germany = “DEU”, France = “FRA”, Canada = “CAN”, USA = “USA” (these are standard ISO country codes and likely used by OECD).
Sex total = maybe “T” or “ALL” or “MW”. Warin’s example uses “MW” which likely stands for male+female combined.
Age 20-24 might have a code like “Y20-24” or simply “2024” as given in the example.

In the content provided, the filter list was: filter_list <- list(c("DEU", "FRA", "CAN", "USA"), "MW", "2024"). This implies:

First element: a vector of country codes.
Second element: “MW” presumably meaning both sexes.
Third element: “2024” representing the 20-24 years age group.
(Likely the time dimension was left unfiltered, meaning all years, since they didn’t include a fourth element for time. The get_dataset might interpret an omitted dimension as no filtering on it.)

Now we use get_dataset with these filters:

# Define filters: Countries = DEU, FRA, CAN, USA; Sex = MW (both male & female); Age group = 20-24
filter_list <- list(c("DEU", "FRA", "CAN", "USA"), 
                    "MW", 
                    "2024")

# Retrieve the filtered data from dataset "DUR_D"
unemploymentOECD <- get_dataset(dataset = "DUR_D", filter = filter_list)

# Inspect the first 6 rows of the result
unemploymentOECD[1:6, ]

After running the above, unemploymentOECD will contain a data frame of the requested data. Each row will correspond to one combination of the dimensions we specified (country, sex, age, and time). Since we restricted sex to “MW” and age to “2024”, each country will have data for those categories over a series of years. The columns of the data frame typically include the dimensions and the measured value. For example, it might have columns like COUNTRY, SEX, AGE, Time, and Value (or similar naming). The first 6 rows might look like:

 COUNTRY SEX AGE  Time   Value
1   CAN   MW 2024 2005   12.3
2   CAN   MW 2026 2010   14.1
3   CAN   MW 2024 2015   10.7
4   DEU   MW 2024 2005   8.5
5   DEU   MW 2024 2010   7.9
6   DEU   MW 2024 2015   6.4
...

Note: The above is illustrative; actual values and formatting may differ. The idea is that we get unemployment duration (perhaps measured in months or some index) for each country at age 20-24 for each year. We see Canada (CAN) and Germany (DEU) for the years 2005, 2010, 2015 in this snippet, with some values. The data likely goes through all years available up to the latest.

Using these functions, you can mix and match as needed: first find a dataset of interest (search_dataset), then get its structure (get_data_structure), then fetch data (get_dataset). Sometimes, if you already know the dataset ID and the codes required, you can go straight to get_dataset and supply the filters.

One thing to be aware of is that the OECD API might have time as a separate dimension (like “TIME” or “Year”), or sometimes the year is part of the data frame’s row index. In the R package output, typically year/time is given as a column or included in the data frame explicitly. The examples above show it as Time or year. You can usually filter time by adding an element to the filter list or by using the start_time and end_time arguments in get_dataset (e.g., start_time = 2000, end_time = 2020 to restrict to years 2000–2020).

In our example, because we left time unspecified in the filter list, get_dataset likely returned all years available for those parameters. We could have added, say, another element to filter_list for years (if the API expects a code for year) or more simply used start_time/end_time arguments.

tl;dr

# Loading OECD package
library(OECD)

# List all available datasets
dataset_list <- get_datasets()

# Search all datasets with the term "unemployment" in their description
search_dataset("unemployment", data = dataset_list)

# Examine the structure of a specific dataset (e.g., "DUR_D")
dstruc <- get_data_structure("DUR_D")
str(dstruc, max.level = 1)

# Define a filter to narrow the data (e.g., countries=DEU,FRA,CAN,USA; Sex=MW; Age=2024)
filter_list <- list(c("DEU", "FRA", "CAN", "USA"), "MW", "2024")

# Retrieve the dataset "DUR_D" with the specified filter
unemploymentOECD <- get_dataset(dataset = "DUR_D", filter = filter_list)
unemploymentOECD[1:6, ]

(The above code shows the steps to search for a dataset related to “unemployment”, inspect its structure, and download a subset of that data. In practice, replace the search term and dataset ID with those relevant to your needs. Always adjust the filter list according to the actual dimensions of the dataset you are querying.)

7.3 spiR

Database Description

The Social Progress Index (SPI) is a comprehensive measure of a country’s social and environmental performance, independent of economic metrics. It was developed between 2009 and 2013 by the nonprofit organization Social Progress Imperative, with input from scholars and experts (including Michael Porter and others) to better capture human well-being and societal progress. The index is composed of 52 indicators that collectively measure how well countries provide for the essential needs of their citizens, establish the building blocks that allow citizens to improve their lives, and create conditions for all individuals to reach their full potential.

The 52 indicators of the SPI are grouped into three broad dimensions:

Basic Human Needs: This includes indicators related to nutrition and basic medical care, water and sanitation, shelter, personal safety, etc. (Things like access to food, clean water, safe housing, and security are fundamental needs).
Foundations of Well-being: This covers education, access to information, health and wellness, and environmental quality (e.g., literacy rates, school enrollments, access to technology and information, life expectancy, pollution levels, etc.).
Opportunity: This dimension looks at personal rights, personal freedom and choice, inclusiveness, and access to advanced education (for example, indicators on political rights, freedom from discrimination, access to higher education, corruption, etc.).

Each of these three dimensions is further broken down into components, and each component is measured by several specific indicators. All 52 indicators together roll up into an overall Social Progress Index score for each country. The SPI is usually reported on an annual basis (in early years, not all countries were covered, but it has expanded over time). The goal of SPI is to provide a more holistic measure beyond GDP to evaluate how well societies are doing in converting economic gains into improved social outcomes.

The spiR package provides an interface to the Social Progress Imperative’s data, allowing R users to retrieve SPI data and related information. It essentially wraps the SPI API (or data source) to make it easy to get data on various countries and indicators. This includes retrieving the overall SPI scores as well as specific component indicators if needed.

Functions

The spiR package offers a few key functions to work with the Social Progress Index data:

spir_country() – Search for countries and retrieve their ISO codes as used by the SPI database.
spir_indicator() – Search for indicators in the SPI database and retrieve their codes (such as the code for a specific indicator or for the overall SPI).
spir_data() – Download the actual data for specified countries, years, and indicators from the SPI dataset.

These functions are designed to help you find the correct arguments (country codes, indicator codes) and then pull the data. Let’s go through them one by one with examples.

spir_country()

To request data from the SPI API, you will need to use country codes (likely standardized codes, possibly ISO 3-letter country codes). The function spir_country() helps you find the correct country code for a given country name. It takes a country name (or partial name) as an argument and returns the matching country code(s) used in the SPI system.

For example, suppose we want to get data for Canada. We should confirm what code the SPI uses for Canada. We can do:

# Loading the spiR package
library(spiR)

# Get the ISO country code for "Canada"
myCountry <- spir_country("Canada")
myCountry

If we run this, myCountry will contain the result of the search. Likely, since “Canada” is a unique match, it will return a data frame or vector with Canada’s code. In many datasets, Canada’s code is “CAN” (ISO 3-letter code). The output might look like:

  country_name    iso3
1       Canada    CAN

So, spir_country("Canada") yields “CAN” as the code. If you search a more ambiguous string, like spir_country("United"), you might get multiple matches (e.g., United States, United Kingdom, United Arab Emirates, etc., each with their code). If you call spir_country() with no argument, it might list all available countries and their codes.

Knowing the country codes is helpful because the main data function spir_data() expects country codes as input.

spir_indicator()

Similarly, we need to know the indicator codes for the data we want. The SPI has one overall index (often code “SPI”) and many sub-indicators (each likely has its own code or name). The function spir_indicator() allows you to search the indicators by a keyword.

For instance, if we wanted to find indicators related to mortality (perhaps there’s an indicator about child mortality or something in the Basic Needs dimension), we could search:

# Search for an indicator containing "mortality"
myIndicator <- spir_indicator("mortality")
myIndicator

This will return any indicators whose name or description contains “mortality”. Suppose one of the SPI indicators is “Maternal Mortality Rate” or “Under-5 Mortality Rate”; the search might find it. The output could be something like:

   indicator_name                          indicator_code
1  "Under 5 Mortality Rate (per 1,000 live births)"    "mortality_u5"
2  "Maternal Mortality Rate (per 100,000 live births)" "mortality_maternal"
...

(Exact codes and names are hypothetical here, for illustration.) The idea is that spir_indicator will provide you the code string that you need to use to request that indicator’s data.

If you call spir_indicator() without any argument, it may list all 52 indicators and their codes. This could be useful to see the whole catalog of what’s available, including the overall “SPI” score and all components.

For the purposes of this chapter, let’s assume we are interested in the overall Social Progress Index itself, which likely has an indicator code "SPI" for the aggregate index. That is probably what we’ll use in the example for data retrieval.

spir_data()

The function spir_data() is the main function to get actual data from the Social Progress Index database. It requires a few arguments:

country: one or more country codes (using the codes we found via spir_country). This is usually a character vector, e.g., c("USA","FRA","BRA").
years: one or more years of interest (as character strings). You can specify a range or specific years, e.g., c("2014","2015","2016"). The SPI data in early years might not cover every year, but from 2014 onward they might have annual releases (for example, SPI 2014, SPI 2015, etc.).
indicators: one or more indicator codes that you want to retrieve. For example, "SPI" for the overall index, or a specific code for a sub-index or component if desired.

Using these arguments, spir_data will fetch the data and return it, likely as a data frame where each row corresponds to a country-year combination and columns include the country, year, and the indicator values.

Let’s retrieve the overall Social Progress Index for a set of countries over several years. For example, we’ll get the SPI from 2014 through 2019 for the USA, France, Brazil, China, South Africa, and Canada. We already suspect the indicator code is “SPI” for the main index (we could confirm that via spir_indicator("Social Progress Index") if needed). And we have country codes: USA, FRA, BRA, CHN, ZAF (South Africa), CAN.

# Extracting SPI data for selected countries and years
myData <- spir_data(country    = c("USA", "FRA", "BRA", "CHN", "ZAF", "CAN"),
                    years      = c("2014","2015","2016","2017","2018","2019"),
                    indicators = "SPI")

head(myData)

Once this runs, myData will contain the SPI values for those countries and years. We used character vectors for years in the function call (notice the quotes around the years). In many cases, years could be numeric, but perhaps the API expects them as strings — using quotes ensures they’re treated as text. The output (as seen by head(myData)) might look like:

  country iso3  indicator year value
1 Brazil  BRA   SPI       2014  67.91
2 Brazil  BRA   SPI       2015  69.58
3 Brazil  BRA   SPI       2016  70.40
4 Brazil  BRA   SPI       2017  71.00
5 Brazil  BRA   SPI       2018  72.29
6 Brazil  BRA   SPI       2019  72.89

This is an example for one country (Brazil) across years 2014–2019, showing the SPI score (on some scale, possibly 0–100) for each year. In the actual myData, all requested countries would be present, so you’d also have rows for USA, France, etc. The columns likely include the country name, the country code (iso3), the indicator (SPI), the year, and the value (the score).

You could then use this data frame to compare how different countries’ social progress scores have changed over time. For instance, you might plot each country’s SPI over the years.

The spir_data function can also retrieve multiple indicators at once if you provide a vector of indicator codes. Then the data frame would have multiple columns (or multiple rows per indicator, depending on how it’s structured – often such API wrappers return a long format where each row is a single indicator for a single country-year, meaning you’d get an extra column for indicator code as shown above, and a value column). In the example above, since we only requested “SPI”, we see one row per country-year. If we had requested additional indicators, we might see multiple rows per country-year (one for each indicator), or the function might pivot it into columns. One would need to check the documentation for exact behavior. But you can always reshape the data after retrieval as needed.

Finally, if you want to explore more about what you can do with the SPI data, the spiR package documentation or the in-depth application (possibly referenced by the warin.ca post) can provide more examples, such as making dashboards or visualizations similar to those on the Social Progress Imperative’s own site.

tl;dr

# Loading the spiR package
library(spiR)

# Find the ISO code for a specific country (e.g., Canada)
myCountry <- spir_country("Canada")
myCountry   # should return "CAN" for Canada

# Search for an indicator by keyword (e.g., "mortality")
myIndicator <- spir_indicator("mortality")
myIndicator # returns any indicator codes that include "mortality"

# Retrieve data for the Social Progress Index (SPI) for selected countries and years
myData <- spir_data(country    = c("USA", "FRA", "BRA", "CHN", "ZAF", "CAN"),
                    years      = c("2014","2015","2016","2017","2018","2019"),
                    indicators = "SPI")
head(myData)

(The above code demonstrates how to look up country and indicator codes and how to fetch the overall SPI data for a set of countries over a range of years. In practice, you can use spir_indicator() to find other indicators and replace "SPI" with the code of any specific sub-indicator if you wish to retrieve those.)

7.4 statcanR

Database Description

Statistics Canada (often abbreviated as StatCan) is the national statistical agency of Canada. It produces a vast amount of data on Canada’s economy, society, and environment. The Statistics Canada open data portal includes data on about 30 broad subjects, including agriculture, energy, environment, education, health, economics, demographics, and more. These data are available at various geographic levels, such as national (Canada), provincial/territorial, metropolitan areas, etc., depending on the dataset.

Historically, much of StatCan’s data was accessible via something called CANSIM tables (Canadian Socio-economic Information Management system). In recent years, they modernized their platform and now refer to data tables by a Product ID (PID) code like “27-10-0014-01”. The StatCan Open Data API (also known as the Web Data Service) allows programmatic access to these tables. Each table’s data can be retrieved by referencing its table number or product ID.

The statcanR package provides a user-friendly way for R users to access Statistics Canada data. It essentially wraps around the web service API so that given a table ID, it will fetch the data and return it as an R data frame or tibble. This saves you from manually downloading CSV files or writing your own HTTP requests.

However, one challenge is that you need to know the table’s ID before you can fetch it (the API requires a specific table number). The statcanR workflow typically involves two steps:

Find the table ID for the data you want. This is often done by using the Statistics Canada website’s search, since statcanR itself doesn’t provide a search function in the package (as of the information we have). You might use the StatCan data portal or an online search to identify the table number for your topic of interest.
Use statcan_data() to retrieve that table’s data. Once you have the table ID, you call the function and it returns the data.

Let’s go through these steps with an example.

Functions

The main function in statcanR we will use is statcan_data(). But before using it, we usually need to find the table ID (unless we already know it). So we can think in terms of:

Searching for data (table ID) outside the package. The user might have to go to the StatCan website or use an index of table IDs.
statcan_data() – the function to fetch data given a table ID and language.

(There isn’t a dedicated search function like statcan_search() in this package as far as the provided material suggests. Instead, the guidance is to use the website to find the ID.)

Search for data (Finding the table ID)

To find a Statistics Canada table ID for the data you want, you can use the official StatCan data portal search. For example, let’s say we are interested in “federal expenditures on science and technology by socio-economic objectives.” We could go to the StatCan website’s data search page (the URL given in the materials is https://www150.statcan.gc.ca/n1/en/type/data?MM=1 which is a general data search page). On that page, typing a few keywords like “federal expenditures science technology socio-economic objective” should bring up relevant results.

Assume we do that search on the website. The search results might list a table with exactly that description. In the provided content, the example found that the table number for “Federal expenditures on science and technology by socio-economic objectives” is 27-10-0014-01. This is the unique identifier for that dataset.

StatCan table IDs usually have a format like two digits, two digits, four digits, two digits (with hyphens in between). For instance, 27-10-0014-01:

The first two digits (27) might represent a category.
The next two (10) perhaps sub-category or just part of the coding.
0014 is the specific table number, and 01 might indicate the version (like some tables get updated structure over time and the last two digits change).

Anyway, once we have this ID, we are ready to fetch the data via the API.

statcan_data()

The function statcan_data(table, lang) fetches the data for a given table number. It has two main arguments:

The first argument is the table number as a string (e.g., "27-10-0014-01").
The second argument is the language of the data (“eng” for English or “fra” for French). StatCan publishes data in both official languages, and sometimes the table content (like column names or category labels) can be fetched in either language.

In our example, we’ll use the table ID we found, "27-10-0014-01", and request the data in English.

# Loading the statcanR package
library(statcanR)

# Fetch data from Statistics Canada table 27-10-0014-01 in English
mydata <- statcan_data("27-10-0014-01", lang = "eng")

# Examine the first few rows of the data
head(mydata)

After running this, mydata will contain the data frame for that table. The head(mydata) will show the first several rows. The structure of the data depends on the table. StatCan tables are often structured in a long format where each row is a combination of the classification dimensions with a value. The columns might include things like REF_DATE (the time period, e.g., year), and various other dimensions like geographical area, indicator, etc., depending on what the table is about, plus a VALUE column for the numeric value.

For instance, since this is “federal expenditures on S&T by socio-economic objective”, the dimensions might be Year, maybe type of expenditure or objective category, etc. We might see columns like:

 REF_DATE  GEO              Objective                            VALUE
 2018      Canada           Defence                                500.0
 2018      Canada           Economic development                    300.0
 2018      Canada           ...                                     ...
 2019      Canada           Defence                                520.0
 ...

(This is hypothetical data to illustrate the format.) Essentially, each row is one category of expenditure in a given year, with the value being the amount spent (maybe in millions of dollars, etc.). The actual table likely has a specific breakdown.

One nice thing is that statcan_data() probably returns a tibble (which is a modern type of data frame in R) with proper column names and factor labels in English (since we requested lang="eng"). If we had requested French (lang="fra"), the labels for the objectives and possibly the column names would appear in French.

At this point, we have the data needed and can proceed to analyze or visualize it. For example, we could sum up various objectives or see trends over time.

The statcanR package makes it straightforward to get the latest data for that table without manually downloading the CSV from the website. Moreover, if the table gets updated with new data (for a new year, for example), running statcan_data() again at a later date would fetch the updated data (assuming the table ID remains the same).

One should note: to use the StatCan API without this package, you might normally have to know the API endpoint and possibly handle CSV or JSON. statcanR abstracts that away – you just provide the table ID.

If you are curious for more, the statcanR documentation or the referenced blog post may have more examples (such as dealing with very large tables or manipulating the results). But the essential part is covered: find the table ID and use statcan_data().

tl;dr

# Loading the statcanR package
library(statcanR)

# Use statcan_data() to retrieve a specific table by its ID (English version)
mydata <- statcan_data("27-10-0014-01", "eng")

# View the first few rows of the retrieved data
head(mydata)

(In practice, replace "27-10-0014-01" with the table ID of the dataset you need from Statistics Canada. Use "eng" for English or "fra" for French as the language argument. Remember to find the table ID via the StatCan website or documentation before using this function.)

7.5 EpiBibR

Database Description

The EpiBibR package is an R wrapper designed to provide easy access to a large bibliographic dataset, particularly focused on COVID-19 and other related medical research references. During the COVID-19 global crisis, the volume of scientific literature on the topic skyrocketed. Having a comprehensive bibliographic database of COVID-19 research (articles, preprints, letters, news articles, etc.) is valuable for researchers conducting literature reviews, trend analysis, or bibliometric studies. EpiBibR was created to make over 100,000 such references available directly through R.

In essence, EpiBibR contains (or connects to) a database of bibliographic entries (like what you’d find in PubMed or other scholarly databases) that are related to epidemiology and specifically COVID-19. Each entry in the database includes various fields, such as authors, title, abstract, publication year, journal, etc. This is analogous to having a huge bibliography or library catalog that you can query with code.

To give an idea of what information each reference record contains, here are some of the fields available (with their typical tags):

AU – Authors (the list of authors of the paper)
TI – Title of the document
AB – Abstract of the paper
PY – Publication Year
DT – Document Type (e.g., Article, Letter, News, etc.)
MESH – Medical Subject Headings (keywords/topics assigned)
TC – Times Cited (citation count, if available)
SO – Source (publication name, e.g., journal or news source)
J9 – Source abbreviation
JI – ISO source abbreviation
ISSN – International Standard Serial Number (journal identifier)
VOL, ISSUE – Volume and Issue number (for journal articles)
ID – PubMed ID (if applicable)
DE – Authors’ Keywords (keywords given by authors)
UT – Unique Article Identifier (possibly Web of Science ID or similar)
AU_CO – Author’s Country of Origin (which might be derived from author affiliations)
DB – Database from which the record is sourced (e.g., which bibliographic database)

The above is a lot of information – essentially, EpiBibR is giving you a bibliographic dataset akin to a large reference manager file that you can query. The typical use would be: you query the data for certain criteria (like author name, year, keywords, etc.) and get back a subset of references matching those criteria.

Functions

The EpiBibR package provides a main function for data retrieval and allows filtering by various fields:

epibibr_data() – The primary function to retrieve bibliographic references, with arguments that allow filtering by author, country, year, title keywords, abstract keywords, and source (journal name) among others.

The usage of epibibr_data() is very flexible: you can provide none, one, or multiple filters. Without any arguments, it returns the entire dataset (which is huge). With arguments, it filters accordingly.

Let’s go through some examples, as given in the content:

epibibr_data()

Retrieving the entire dataset: If you simply call epibibr_data() with no arguments, it will try to retrieve the entire bibliography data frame which contains all references (80,000+ or even 100,000+ entries). For example:
```
library(EpiBibR)
complete_data <- epibibr_data()
```
This command would populate complete_data with the entire bibliographic database. This might be quite large in memory, so often you might not want to do this unless you truly need everything. Instead, you might retrieve a subset based on some criteria.
Filtering by author: If you want all references authored by a certain person, you can use the author argument. For example, to get all articles written by someone with last name Colson (as in Philippe Colson, a microbiologist who authored many COVID-19 papers):
```
colson_articles <- epibibr_data(author = "Colson")
```
This will search the author field for “Colson” and return all entries where at least one author matches that name. The result colson_articles would be a data frame of all such references. Each entry would include all the fields (Title, Year, etc.) for papers that have Colson as an author. You might then check nrow(colson_articles) to see how many papers he authored in the database, for instance.
Filtering by author and year: You can combine filters. The function allows multiple arguments to narrow the search. For example, if we want references authored by someone named Yang in the year 2020:
```
yang2020 <- epibibr_data(author = "Yang", year = "2020")
```
This will give all records where an author’s name contains “Yang” and the publication year is 2020. The result yang2020 would contain only those references meeting both criteria (logical AND between filters).
Filtering by author’s country: If we want to find references based on the country of origin of the authors (perhaps the country of the corresponding author or an author affiliation country), we can use the country argument. For example, to get all references with at least one author from Canada:
```
canada_articles <- epibibr_data(country = "Canada")
```
This will search the author address/affiliation field for “Canada” and return those references.
Filtering by title keyword: If we want to find articles that have a certain keyword in the title, we use the title argument. For example, to get all references whose titles contain “covid”:
```
covid_articles <- epibibr_data(title = "covid")
```
This will likely return a lot of references (since many will have “COVID” in the title). The search might be case-insensitive and probably looks for the substring “covid” in the title.
Combining multiple criteria: As mentioned, we can refine searches by using multiple arguments at once, and the result will satisfy all given filters. For instance, we might want references authored by someone named “Yang”, that have “covid” in the title, and were published in 2020:
```
yangcovid2020_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020")
```
This will find references where all three conditions are true (author includes “Yang”, title includes “covid”, year is 2020). We could add even more criteria, for example, also specifying a source.
Adding source as another filter: The source argument can filter by publication source (like journal or conference name). For example, to refine the above search to only include those references in sources whose name contains “bio” (maybe “BioRxiv” or “Biology” etc.):
```
yangcovid2020bio_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020", source = "bio")
```
Now the references must meet all four filters: an author name containing “Yang”, title containing “covid”, year 2020, and source containing “bio”. This will significantly narrow it down – possibly to references by authors named Yang in 2020 about COVID in some biology-related journals or preprint servers.
Filtering by abstract keyword: We can also search within the abstract text of the references using the abstract argument. For example, to find references that mention “coronavirus” in the abstract:
```
coronavirus_articles <- epibibr_data(abstract = "coronavirus")
```
This will return references whose abstracts contain the word “coronavirus”. This is a powerful way to find papers that might be about coronaviruses even if the title doesn’t explicitly say so.

All these filters can be used in combination or standalone. The result of any epibibr_data call is a data frame (or tibble) with the references that match. Each row is one reference, with columns corresponding to fields like author, title, year, etc. likely using the abbreviations or full names of those fields. For example, the resulting data frame might have columns named AU, TI, AB, PY, SO, etc., or possibly more user-friendly names. If integrating with bibliometrix (as hinted, since they designed it to integrate with the bibliometrix package), it might preserve standard field tags so bibliometrix can read it easily.

One thing to keep in mind is that text searches (like title = “covid”) will probably match anywhere in the field, so “COVID-19” or “covid19” etc. would match because “covid” is a substring. Similarly, author = “Yang” might match “Yang” as a surname but also “Yangus” or anything with those letters — though likely it’s intended to match last names exactly or something. The specifics depend on how the search is implemented (perhaps it treats the input as a case-insensitive substring search).

The ability to combine filters means you can tailor very specific queries, which is great for slicing the data (for example, find how many papers a particular author wrote in a given year on a certain topic).

One caution: If you combine too many filters that don’t have overlapping results, you might get zero results. For instance, if no author named Yang wrote a COVID-titled paper in 2020 in a source with “bio” in its name, then yangcovid2020bio_articles would be empty (0 rows).

EpiBibR essentially puts a research literature database at your fingertips. A researcher could use this to do things like trend analysis (how many COVID papers per year, etc.), network analysis of collaborations (using authors and their countries), or topic modeling on abstracts, etc.

tl;dr

# Loading the EpiBibR package
library(EpiBibR)

# Retrieve the entire dataset (all references) - large output
epidata <- epibibr_data()

# Examples of filtered searches:
complete_data <- epibibr_data()  # same as epidata, full dataset

colson_articles <- epibibr_data(author = "Colson")            # all references with an author named Colson

yang2020 <- epibibr_data(author = "Yang", year = "2020")      # references authored by "Yang" in the year 2020

canada_articles <- epibibr_data(country = "Canada")           # references where an author's country is Canada

covid_articles <- epibibr_data(title = "covid")               # references with "covid" in the title

yangcovid2020_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020")
# references that satisfy: author name has "Yang", title has "covid", and year is 2020

yangcovid2020bio_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020", source = "bio")
# further narrows above: in addition, source contains "bio" (perhaps BioRxiv or similar)

coronavirus_articles <- epibibr_data(abstract = "coronavirus") 
# references with "coronavirus" in the abstract

(The above code block demonstrates how to use epibibr_data() in various ways: with no filters (all data) and with different combinations of filters for author, year, country, title, source, and abstract. These are examples – you can adjust the strings to search for different authors, keywords, years, etc., depending on your research needs.)

7.6 coronavirus

Database Description

The coronavirus package is an R package that provides a tidy formatted dataset of the 2019 Novel Coronavirus COVID-19 outbreak, sourced from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) data repository. In early 2020, the JHU CSSE team started compiling daily reports of COVID-19 cases and deaths from around the world, and they made this data public via a GitHub repository. This quickly became one of the standard references for tracking the global spread of COVID-19.

The coronavirus R package (maintained by Rami Krispin and others as part of the Covid19R Project) takes that JHU data and puts it into a consistent, tidy format for easier analysis in R. “Tidy format” here means a long-form data frame where each row represents a single observation (for a given date and location), and columns represent variables like the date, location, case type, and number of cases.

Specifically, the dataset in this package includes daily counts of COVID-19 confirmed cases, deaths, and recoveries (where available) for each region. By region, we mean often country level, and in some cases sub-regions (e.g., states or provinces for large countries like the US, Canada, China, Australia, etc., since JHU dataset had those breakdowns). The data covers from the beginning of the pandemic (around January 2020) and continues through the course of 2020 and 2021, etc. Notably:

It has a column for the date of the observation.
Columns for geographic information (province/state, country/region, and possibly coordinates).
A column indicating the type of case (usually “confirmed”, “deaths”, or “recovered”).
A column for the number of cases on that date (could be daily new cases or cumulative, but in the tidy JHU data, typically each row is daily counts).
Additional columns such as ISO country codes, population, etc., might be included to enrich the data.

According to the package’s updated description, the dataset includes daily new cases and death cases from January 2020 through March 2023, and recovery cases up until August 2022. This reflects the fact that JHU stopped tracking certain metrics at certain times (they stopped updating recoveries in mid-2022 and stopped all data collection in March 2023 as the pandemic data collection wound down).

The coronavirus package internally contains a snapshot of the data (which is periodically updated on CRAN) and also provides utilities to update it from the source.

Functions

The key functions and usage for this package are:

data("coronavirus") – This isn’t a function per se, but a way to load the included dataset. The package comes with a dataset named coronavirus. By running data("coronavirus"), you load that dataset into your R environment.
update_dataset() – This function will fetch the latest version of the data from the package’s GitHub source, updating the in-memory dataset (and possibly the package’s internal data, requiring a restart to use). Essentially, it lets you get the most recent data beyond what was shipped with the CRAN version.
refresh_coronavirus_jhu() – This is an alternative function (part of the Covid19R project standard) that pulls data directly from the JHU repository and returns it as a data frame, without relying on the package’s built-in dataset. It ensures you get the absolute latest data in the standardized format.

Let’s see how these work:

`data("coronavirus")`

After installing and loading the coronavirus package, you can load the dataset by calling:

library(coronavirus)

data("coronavirus")

This will make a data frame called coronavirus available in your R session (basically attaching the dataset). If you then do:

head(coronavirus)

You will see the first few rows of the dataset. The dataset’s columns (as of the Covid19R format) include:

date – the date of observation (class Date).
province – the name of the state/province (if applicable; could be NA or empty for country-level if no subregion).
country – the country/region name.
lat – latitude coordinate of the location (for mapping).
long – longitude coordinate.
type – the type of case (“confirmed”, “death”, or “recovered”).
cases – the number of cases on that date (for that type, location).
uid – a unique identifier (like a numeric country code possibly).
province_state – (sometimes duplicate info of province, depending on source).
iso2, iso3, code3 – ISO 2-letter, ISO 3-letter, and numeric country codes.
fips – FIPS code (for US counties, etc., if applicable).
combined_key – a combined location key (like “Country, State” string).
population – population of that region (useful for per-capita calculations).
continent_name, continent_code – the continent of that country.

That’s a lot of columns, but essentially each row might look like:

date       province   country    lat    long   type       cases   ... etc.
2020-01-01 <NA>       China     30.0   112.5   confirmed   1
2020-01-01 <NA>       China     30.0   112.5   death       0
...
2020-03-15 New York   US        40.7   -73.9   confirmed   730
2020-03-15 New York   US        40.7   -73.9   death       2
...

(Just illustrative, not actual values.)

So by doing head(coronavirus), you might see the earliest entries (which might be rows for China in January 2020, as shown above).

The dataset in the CRAN version might not be the absolutely latest if the pandemic is ongoing. CRAN versions were updated every month or two, as noted. That’s where update_dataset() comes in.

update_dataset()

The update_dataset() function is provided to let users of the CRAN version get the most recent data without waiting for the next CRAN release. When you call update_dataset(), it will attempt to fetch the latest data from the GitHub (development) version of the package (or directly from the JHU source) and then update the coronavirus dataset in your R environment.

For example:

# Assuming library(coronavirus) is already called
update_dataset()

After running this, you should re-run data("coronavirus") or otherwise reload the data, possibly by re-initializing the package or simply by using the updated in-memory data (the documentation suggests restarting R for the changes to take effect). Essentially, update_dataset() will download the new CSV from the JHU repo (if available) and replace the package’s internal data file.

One important note mentioned is that you must restart the R session to have the updates available. This implies that update_dataset() writes to the package’s environment or a global environment but you might need to reload it to see the updated dataset as coronavirus. A safer approach might be to use the next function if you want fresh data on the fly without needing to restart.

refresh_coronavirus_jhu()

The function refresh_coronavirus_jhu() pulls the data directly and returns it as a data frame (without needing to load it as coronavirus). This function is part of a standardized approach by the Covid19R project. It fetches the data from the JHU repository (which contains daily time series or daily reports) and returns it in the same standardized format as the coronavirus dataset.

Example usage:

covid19_df <- refresh_coronavirus_jhu()
head(covid19_df)

After this, covid19_df will be a data frame with the latest data. Doing head(covid19_df) should show the first few rows (which likely correspond to the earliest date’s data, similar to what coronavirus initial entries would show). Since refresh_coronavirus_jhu() always goes to source, it ensures you get updates even beyond what update_dataset() can do (which might be limited by how the package was built).

As of early 2023, JHU has ceased updates (the last date with data is March 10, 2023 for new cases, and August 2022 for recoveries). So beyond that, the dataset is static. But from 2020 through early 2023, this was actively updated daily.

For historical or offline analysis, using the packaged data or the update function suffices. For real-time tracking (when the data was live), one would likely use refresh_coronavirus_jhu() daily to get new numbers and then merge or analyze.

tl;dr

# Loading the coronavirus package
library(coronavirus)

# Load the bundled coronavirus dataset
data("coronavirus")

# Peek at the data structure and first rows
head(coronavirus) 

# Update the dataset to the latest available data (requires restarting R to use updated data)
update_dataset()

# Alternatively, fetch the latest JHU COVID-19 data directly into a data frame
covid19_df <- refresh_coronavirus_jhu()
head(covid19_df)

(The above code demonstrates how to load the COVID-19 dataset included in the package, how to update it to the latest data, and how to directly retrieve the latest data via the JHU source. Use head() or similar to inspect the data. Remember that after using update_dataset(), you might need to restart R to reload the updated coronavirus dataset.)

TL;DR – Summary of Key Code from this Chapter

# Loading the WDI package and searching/downloading data
library(WDI)
listOfIndicators <- WDIsearch("GDP")                     # search indicators for "GDP"
listOfIndicators[1:5, ]                                   # view first 5 results
stockTraded <- WDI(indicator = "CM.MKT.TRAD.GD.ZS", 
                   country   = c("FR", "CA", "US", "CN"), 
                   start     = 2000, 
                   end       = 2016)
head(stockTraded)

# Loading the OECD package and retrieving a filtered dataset
library(OECD)
dataset_list <- get_datasets()                            # get list of all datasets
search_dataset("unemployment", data = dataset_list)       # find datasets related to "unemployment"
dstruc <- get_data_structure("DUR_D")                     # get structure of a specific dataset
str(dstruc, max.level = 1)
filter_list <- list(c("DEU", "FRA", "CAN", "USA"), "MW", "2024")  # set up filters (countries, sex, age)
unemploymentOECD <- get_dataset(dataset = "DUR_D", filter = filter_list)
unemploymentOECD[1:6, ]

# Loading the spiR package and retrieving Social Progress Index data
library(spiR)
mycountry <- spir_country("Canada")                       # lookup country code for Canada
mycountry
myIndicator <- spir_indicator("mortality")                # search indicators with "mortality"
myIndicator
myData <- spir_data(country    = c("USA", "FRA", "BRA", "CHN", "ZAF", "CAN"),
                    years      = c("2014","2015","2016", "2017", "2018", "2019"),
                    indicators = "SPI")
head(myData)

# Loading the statcanR package and fetching a StatCan data table
library(statcanR)
mydata <- statcan_data("27-10-0014-01", "eng")             # get data for table 27-10-0014-01 in English
head(mydata)

# Loading the EpiBibR package and querying the bibliographic database
library(EpiBibR)
epidata <- epibibr_data()                                 # load entire dataset (very large)
complete_data <- epibibr_data()                           # (same as above)
colson_articles <- epibibr_data(author = "Colson")        # papers authored by someone named Colson
yang2020 <- epibibr_data(author = "Yang", year = "2020")  # papers by "Yang" in 2020
canada_articles <- epibibr_data(country = "Canada")       # papers with an author from Canada
covid_articles <- epibibr_data(title = "covid")           # papers with "covid" in the title
yangcovid2020_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020")
yangcovid2020bio_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020", source = "bio")
coronavirus_articles <- epibibr_data(abstract = "coronavirus")

# Loading the coronavirus package and using its dataset
library(coronavirus)
data("coronavirus")
head(coronavirus) 
update_dataset()
covid19_df <- refresh_coronavirus_jhu()
head(covid19_df)

Commands introduced in this chapter and their purpose:

Command	Detail / Purpose
`WDIsearch()`	Search for World Bank indicators by keyword. Returns a list of indicator codes and names that match the search term.
`WDI()`	Retrieve data for specified World Bank indicator(s) and country(ies). Allows specifying start and end years.
`get_datasets()`	List all available OECD datasets (IDs and descriptions). Useful as a first step to see what data can be accessed.
`search_dataset()`	Search within the OECD dataset list for a keyword. Helps find the dataset ID relevant to a topic.
`get_data_structure()`	Get the structure (dimensions and valid codes) of an OECD dataset. Use this to determine how to filter `get_dataset` queries.
`get_dataset()`	Download data from a specific OECD dataset, optionally filtering by dimensions (e.g., countries, years, etc.). Returns a data frame of the selected data.
`spir_country()`	Look up country codes in the Social Progress Index database by country name. Helps to get the correct country ISO3 codes.
`spir_indicator()`	Look up indicator codes in the Social Progress Index database by keyword. Helps to find the code for a specific SPI indicator.
`spir_data()`	Retrieve Social Progress Index data for given countries, years, and indicator(s). Returns a data frame of SPI values.
`statcan_data()`	Fetch data from a Statistics Canada table given its table ID and language. Returns a data frame of that table’s data.
`epibibr_data()`	Query the bibliographic database for COVID-19/medical references. Can filter by author, year, country, title, abstract, source. Returns a data frame of references matching the criteria.
`data("coronavirus")`	Load the included COVID-19 dataset from the `coronavirus` package. Provides a tidy data frame of daily cases.
`update_dataset()`	Update the `coronavirus` package’s dataset to the latest available data from the source. (Requires restarting R to use the updated data.)
`refresh_coronavirus_jhu()`	Directly pull the latest COVID-19 data from the JHU source in the standardized format. Returns a fresh data frame without needing a package update.

# Automating Data Collection with APIs ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, eval = FALSE) ``` In data science projects, having access to interesting datasets is essential fuel for analysis. While one approach is to manually download CSV files from various websites, a more dynamic and powerful method is to use **APIs (Application Programming Interfaces)**. An API is essentially a set of rules or protocols that enables different software applications to communicate and exchange data and functionality. In practical terms, many data providers offer web APIs that allow direct retrieval of up-to-date data via code, rather than manually downloading files. Using APIs can ensure you are working with the latest data and can automate the data acquisition process. In this chapter, we will focus on how to import data using APIs in R through specialized packages. Each package acts as a wrapper around an external data source’s API, handling the communication details so that you can simply call R functions to fetch data. Before diving into specific examples, it’s important to clarify what we mean by an **argument** in this context. In programming, an *argument* is a value or input that you pass to a function when you call it, which influences how the function operates. For instance, if a function is defined to take a country code as an argument, you would provide a specific country code value when calling that function. Understanding arguments is crucial because all the API-related functions we’ll use require certain arguments (like indicator codes, country codes, years, etc.) to specify what data you want. **At the end of the chapter, you should be able to:** 1. **Know what an argument is.** You will understand how functions use arguments (inputs) to modify their behavior, and how to supply the correct arguments to get the data you need. 2. **Import data using an API in R.** You will learn to use various R packages that interface with web APIs to retrieve data from online sources, avoiding manual downloads and making your data acquisition process reproducible and up-to-date. We will explore several real-world examples of R packages that provide easy access to data via APIs. These include: * **WDI:** Access World Bank’s World Development Indicators and other datasets. * **OECD:** Retrieve indicators and datasets from the OECD (Organisation for Economic Co-operation and Development). * **spiR:** Access the Social Progress Index data. * **statcanR:** Download data from Statistics Canada’s open data portal. * **EpiBibR:** Obtain bibliographic references (especially COVID-19 related literature data). * **coronavirus:** Get daily COVID-19 statistics (cases, deaths, etc.) from the Johns Hopkins University dataset. For each of these, we will discuss the data source, the key functions provided by the R package, and walk through examples of how to use them. By the end, you will see how APIs combined with R packages allow you to seamlessly pull in data from a variety of domains with just a few lines of code. ## WDI ### Database Description The **World Development Indicators (WDI)** database is a flagship dataset of the World Bank containing a wide range of global development statistics. It is a compilation of relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty. The database includes over 1,600 time-series indicators for 217 economies and more than 40 country groups, with data for many indicators going back over 50 years. These indicators cover topics such as economic growth, education, health, poverty, environmental factors, and much more. What makes the WDI particularly powerful is that it’s part of a larger collection of data sources provided by the World Bank. In fact, the R package `WDI` allows users not only to access the main World Development Indicators but also dozens of other datasets hosted by the World Bank (e.g., International Debt Statistics, Doing Business indicators, Human Capital Index, etc.). This means that through one package, you can tap into a rich variety of development data. Using the World Bank’s API through the WDI package has several advantages: the data are always up-to-date (as of the last World Bank update), you can easily retrieve long time series for multiple countries, and you can programmatically search for indicators by keywords. The data returned by the API is typically in a tidy **country-year** format – each row corresponds to a country and year, with columns for the indicator values (and possibly country codes and other metadata). This format is convenient for analysis and plotting in R once you have the data. ### Functions The `WDI` R package provides a couple of key functions to interact with the World Bank data: * **`WDIsearch()`** – Search for indicators by keyword. * **`WDI()`** – Download data for specified indicators, countries, and time ranges. Additionally, the package includes some utility functions (such as `WDIcache` and `WDIbulk`) for advanced usage, but our focus will be on the two main functions above, which cover most typical needs. We will go through each of these functions with examples to illustrate how they work and how to use the correct arguments to get the data you want. #### WDIsearch() The function `WDIsearch()` allows you to find indicators in the World Bank datasets by searching for keywords in the indicator name or description. This is extremely useful when you know the topic you’re interested in (e.g., GDP, life expectancy, CO2 emissions) but need to find the exact indicator code that the World Bank uses for that data. The `WDIsearch()` function takes a character string as an argument – this string is the keyword or pattern you want to search for. The function returns a data frame of indicators whose names or descriptions match the search term. For example, suppose we want to find all indicators related to **GDP**. We can use `WDIsearch("GDP")` to retrieve a list of such indicators: ```{r} # Loading the WDI package library(WDI) # Search all indicators with the term "GDP" listOfIndicators <- WDIsearch("GDP") # Inspect the first 5 indicators found listOfIndicators[1:5, ] ``` In the code above, `WDIsearch("GDP")` searches the World Bank’s catalog of series for the substring "GDP". The result is assigned to `listOfIndicators`, which will be a data frame. The data frame typically has columns like “indicator” (the official indicator code used by the API) and “name” (a human-readable description of the indicator). By printing the first 5 rows (`listOfIndicators[1:5,]`), we might see something like: ``` indicator name [1,] "NY.GDP.MKTP.CD" "GDP (current US$)" [2,] "NY.GDP.MKTP.KD.ZG" "GDP growth (annual %)" [3,] "NY.GDP.PCAP.CD" "GDP per capita (current US$)" [4,] "NY.GDP.PCAP.KD.ZG" "GDP per capita growth (annual %)" [5,] "NE.GDI.TOTL.ZS" "Gross capital formation (% of GDP)" ``` *Example output:* The above is an illustrative example of what the search results might contain (actual results may differ or appear in a different order). Each row shows an indicator code and its description. For instance, `"NY.GDP.MKTP.CD"` is the code for total GDP in current US dollars, and `"NY.GDP.PCAP.CD"` is GDP per capita in current US dollars, etc. Using this list, you can identify the specific indicator code you need for your analysis. The ability to search by keyword (case-insensitive and even supporting regular expressions in `WDIsearch`) makes it much easier to find the right data without having to manually browse the World Bank website. Once you have the indicator code(s) you need, you can use the `WDI()` function to download the data. #### WDI() The `WDI()` function is used to actually retrieve data for one or more indicators and one or more countries over a specified time range. The main arguments for `WDI()` are: * `indicator`: The indicator code or codes you want to download (as a string or a vector of strings). * `country`: The country code or codes for which you want the data. Typically these are ISO-2 or ISO-3 country codes (the World Bank uses ISO-2 country codes by default). You can also use special codes like `"all"` to retrieve all countries, or groups like `"OECD"` for OECD countries as a whole, etc. * `start`: The starting year for the data (numerical year, or in some cases a string if using quarterly/monthly data). * `end`: The ending year for the data. There are additional optional arguments as well. For example, `extra = TRUE` can be used to fetch additional columns (like region, income level, etc., for each country), and `cache` can be used to supply or update the list of indicators cached locally. But in many cases you can ignore these extras and just specify the main four arguments above. Let’s consider a concrete example. Suppose we are interested in the indicator *“Stocks traded, total value (% of GDP)”*, which has the code `CM.MKT.TRAD.GD.ZS` in the WDI database. We want to gather this data for four countries – France, Canada, the United States, and China – over the period 2000 to 2016. Using `WDI()`, we can do this as follows: ```{r} library(WDI) # Access and store data for "Stocks traded, total value (% of GDP)" # for France (FR), Canada (CA), USA (US), and China (CN) from 2000 to 2016 stockTraded <- WDI(indicator = "CM.MKT.TRAD.GD.ZS", country = c("FR", "CA", "US", "CN"), start = 2000, end = 2016) # Peek at the first few rows of the retrieved data head(stockTraded) ``` When this code runs, the `WDI()` function contacts the World Bank API behind the scenes and downloads the requested data. The result, stored in `stockTraded`, is a data frame. If we inspect it (using `head(stockTraded)` to see the first several rows), we might see something like: ``` iso2c country year CM.MKT.TRAD.GD.ZS 1 CA Canada 2016 147.92 2 CA Canada 2015 126.85 3 CA Canada 2014 123.45 4 CA Canada 2013 115.67 5 CA Canada 2012 105.23 6 CA Canada 2011 98.10 ... ``` *Example explanation:* Each row of the data frame represents a country-year observation. In this example, the columns include: an ISO 2-letter country code (`iso2c`), the country name, the year, and a column named after the indicator code (`CM.MKT.TRAD.GD.ZS`) which contains the value of the stocks traded (% of GDP) for that country in that year. The data frame is sorted by country and year (typically, countries in alphabetical order by the code, and years descending, but the ordering may vary). For instance, above we see data for Canada (CA) for 2016, 2015, etc. If we scroll further down the data frame, we would find the entries for China (CN), France (FR), and the USA (US) similarly each with values from 2000 up to 2016. With this data frame in hand, you could proceed to analyze it or visualize it (e.g., compare the trends of that indicator across the four countries). The key point is that a single function call gave us a structured dataset ready for use, which is much more convenient than manually finding each country’s data from a website. **Note:** If you request a range of years where some countries don’t have data for the entire range, the resulting data frame will have `NA` (missing) for those years and countries where data is absent. This is normal since not all indicators have values for every country-year. As a final tip, you can actually request multiple indicators in one go by providing a vector of indicator codes to the `indicator` argument. In that case, `WDI()` will return one column per indicator (plus the country, code, and year columns). There’s even a feature where if you name the elements of the indicator vector in R, those names will be used as column names in the result. For example, `WDI(indicator = c(gdp="NY.GDP.MKTP.CD", pop="SP.POP.TOTL"), country="US", start=2010, end=2020)` would fetch GDP and population for the U.S. and name the columns `gdp` and `pop` respectively in the output data frame. If you want to explore more about the WDI package and see another detailed example of its usage, you can refer to the World Bank data application example provided by the package authors, which demonstrates a case study of retrieving and analyzing data from the WDI. ### tl;dr ```{r} # Loading the WDI library library(WDI) # Search all indicators with the term "GDP" listOfIndicators <- WDIsearch("GDP") # List the first 5 indicators found listOfIndicators[1:5, ] # Retrieve data for a specific indicator and set of countries over a time range stockTraded <- WDI(indicator = "CM.MKT.TRAD.GD.ZS", country = c("FR", "CA", "US", "CN"), start = 2000, end = 2016) head(stockTraded) ``` *(The above code summary shows how to search for indicators containing "GDP" and how to download one example indicator for multiple countries. In practice, replace the indicator code and country codes with those relevant to your needs.)* ## OECD ### Database Description The **Organisation for Economic Co-operation and Development (OECD)** maintains a rich database of economic and social indicators for its member countries (and in some cases, non-members). The OECD data spans numerous categories such as agriculture, finance, health, education, labor, and more. The public-facing OECD data portal (often accessed via [data.oecd.org](https://data.oecd.org)) features around 300 key indicators organized into about a dozen categories. These include high-level indicators like unemployment rates, GDP for OECD countries, population statistics, etc., which are curated for easy browsing. However, the actual breadth of data available through the OECD’s API is much larger. The OECD provides a flexible API that allows access to a wide array of datasets, each identified by a unique dataset code. Each dataset can contain many series and dimensions (for example, a dataset might contain data broken down by country, by sex, by age group, by year, etc.). The R package `OECD` is designed to help users discover and download data from this API without needing to manually construct complex query URLs. The OECD data service gives you access to up-to-date statistics across various domains for many countries (mostly OECD members). This can include indicators like employment rates, economic outlook figures, education enrollment, health outcomes, and so on. The data is often structured by multiple dimensions (country, year, and often other classifications like gender, age, industry sector, etc., depending on the dataset). Using the R `OECD` package, we can search for datasets and then retrieve the specific slices of data we need by specifying filters. ### Functions The `OECD` R package provides several functions to interact with the OECD API. The main functions we will highlight are: * **`get_datasets()`** – Retrieve a list of all available datasets (by ID and description). * **`search_dataset()`** – Search within the dataset list for keywords, to find relevant dataset IDs by topic. * **`get_data_structure()`** – Given a specific dataset ID, get the structure of that dataset (i.e., what dimensions it has and the valid codes for each dimension). * **`get_dataset()`** – Download data from a specific dataset, optionally filtering by dimension codes (such as specific countries, years, etc.). Each of these functions is useful at a different stage: first discovering what data exists, then understanding the structure of a specific dataset, and finally retrieving the data of interest. We will go through examples of their usage. *(Note: The function names in the package are a bit confusing with singular/plural. `get_datasets()` (plural) returns the list of dataset identifiers and descriptions. By contrast, `get_dataset()` (singular) is used to download the data for one dataset. Be careful to use the correct function.)* #### `get_datasets()` – Listing available datasets Before searching or fetching data, you may want to know what datasets are available via the OECD API. The function `get_datasets()` returns a data frame of all dataset identifiers along with their descriptions. This is typically a large list (since OECD has many datasets). You can store this list and then use it for searching. For example: ```{r} # Loading OECD package library(OECD) # List all available datasets and store in a data frame dataset_list <- get_datasets() # Check the first few entries in the dataset list head(dataset_list) ``` After running `get_datasets()`, `dataset_list` might contain entries like: ``` id description 1 AFLPMEAN Average length of parental leave, measured in weeks 2 AGRI_INV Agricultural Innovation Indicators... 3 AEO African Economic Outlook... 4 BLI Better Life Index... ... ``` This is just illustrative. Each row has a dataset `id` (a short code like "AEO", "BLI", etc.) and a longer description explaining what that dataset is. Given that the list can be long, it’s often more practical to search for a keyword in this list rather than scrolling through it manually. That’s where `search_dataset()` comes in. #### search\_dataset() The function `search_dataset()` helps you find which dataset(s) might contain the data you need by searching within the dataset names and descriptions. You provide a keyword (or regex pattern) and a data frame of datasets to search (by default, you can use the result of `get_datasets()` as that input). For example, if we are interested in data about **unemployment**, we can search the dataset list for the term "unemployment": ```{r} # Assuming dataset_list has been obtained via get_datasets() search_results <- search_dataset("unemployment", data = dataset_list) ``` This call will return a data frame of datasets whose descriptions contain "unemployment". For instance, one likely result is a dataset that deals with unemployment by duration. The output might look like this (for illustration): ``` id description 1 DUR_D Unemployment duration by age group and gender 2 ... ... (other related datasets if any) ``` From this search result, suppose we identify that `DUR_D` is the dataset we want (just as an example, `DUR_D` might stand for “Duration of unemployment (in days or months) by demographic breakdown”). Now that we have a specific dataset ID, the next step is to see what the structure of that dataset is (what dimensions and codes it uses) so we can query it properly. #### get\_data\_structure() Every OECD dataset can have multiple dimensions. For example, a dataset might be broken down by country, by year, by gender, by age group, etc. To know how to formulate our query and what filters to use, we need to know what dimensions a dataset has and what the valid codes for those dimensions are. The function `get_data_structure(dataset_id)` returns an object (often a list of data frames) describing the dataset’s structure. Continuing our example with dataset `"DUR_D"` (unemployment duration), let’s retrieve its structure: ```{r} # Get the structure of the "DUR_D" dataset dstruc <- get_data_structure("DUR_D") # Examine the structure object str(dstruc, max.level = 1) ``` When you call `get_data_structure("DUR_D")`, behind the scenes R fetches metadata about that dataset from the OECD API. The result `dstruc` is typically a list where each element corresponds to a dimension of the dataset. For instance, `dstruc` might contain elements like `$COUNTRY`, `$AGE`, `$SEX`, `$TIME` (these are hypothetical) each of which is a data frame listing the codes and meanings for that dimension. Using `str(dstruc, max.level = 1)` will print an overview of the list structure. You might see something like: ``` List of 4 $ COUNTRY: 'data.frame': ... (country codes and names) $ AGE : 'data.frame': ... (age group codes and descriptions) $ SEX : 'data.frame': ... (sex codes and descriptions) $ TIME : 'data.frame': ... (time period info, possibly years available) ``` This tells us that the dataset `DUR_D` is broken down by Country, Age, Sex, and Time (year). To know the actual codes, we could inspect each of those. For example, `dstruc$COUNTRY` might show entries like `USA = "United States"`, `FRA = "France"`, etc. `dstruc$SEX` might show codes like `M = "Male"`, `W = "Female"`, `MW = "Total (Male+Female)"`. `dstruc$AGE` might list codes like `TOTAL = "All ages"`, `Y15-24 = "15-24 years"`, etc. (The actual codes can vary; these are just plausible examples.) Armed with this information, we can now decide what subset of the data we want. Let’s say we want to get data on unemployment duration for a few specific countries, for both males and females combined, focusing on young adults age 20–24, for the most recent year(s) available. We will need to assemble a filter that specifies those choices for each dimension. #### get\_dataset() The function `get_dataset()` is used to download the actual data from a specified OECD dataset. You must provide the dataset ID and you can provide a `filter` argument to narrow down which slices of the data you want. The `filter` argument expects a list of vectors, where each element of the list corresponds to one dimension of the dataset, in the order that the dimensions are defined. If no filter is provided at all, `get_dataset("XYZ")` would attempt to download the entire dataset “XYZ” (which could be huge, so usually you do want to filter it). If you provide a partial filter (for some dimensions), you typically need to specify something for each dimension, even if it’s just an "all" wildcard. The `get_dataset` function documentation suggests that if you leave filters empty it will get everything, but you can also explicitly use `NULL` or an empty string for dimensions you don't want to filter (depending on how the function is implemented). In our example with `DUR_D`, suppose `DUR_D`’s dimensions are in order: Country, Sex, Age, Time. We want: Country = {Germany, France, Canada, USA}, Sex = {Total (both sexes)}, Age = {20-24}, and Time = (we could filter a range of years or leave it to get all years). Let’s assume we want all available years for those filters. From the structure, we identified the codes: * Germany = "DEU", France = "FRA", Canada = "CAN", USA = "USA" (these are standard ISO country codes and likely used by OECD). * Sex total = maybe "T" or "ALL" or "MW". Warin’s example uses "MW" which likely stands for male+female combined. * Age 20-24 might have a code like "Y20-24" or simply "2024" as given in the example. In the content provided, the filter list was: `filter_list <- list(c("DEU", "FRA", "CAN", "USA"), "MW", "2024")`. This implies: * First element: a vector of country codes. * Second element: "MW" presumably meaning both sexes. * Third element: "2024" representing the 20-24 years age group. * (Likely the time dimension was left unfiltered, meaning all years, since they didn’t include a fourth element for time. The `get_dataset` might interpret an omitted dimension as no filtering on it.) Now we use `get_dataset` with these filters: ```{r} # Define filters: Countries = DEU, FRA, CAN, USA; Sex = MW (both male & female); Age group = 20-24 filter_list <- list(c("DEU", "FRA", "CAN", "USA"), "MW", "2024") # Retrieve the filtered data from dataset "DUR_D" unemploymentOECD <- get_dataset(dataset = "DUR_D", filter = filter_list) # Inspect the first 6 rows of the result unemploymentOECD[1:6, ] ``` After running the above, `unemploymentOECD` will contain a data frame of the requested data. Each row will correspond to one combination of the dimensions we specified (country, sex, age, and time). Since we restricted sex to "MW" and age to "2024", each country will have data for those categories over a series of years. The columns of the data frame typically include the dimensions and the measured value. For example, it might have columns like `COUNTRY`, `SEX`, `AGE`, `Time`, and `Value` (or similar naming). The first 6 rows might look like: ``` COUNTRY SEX AGE Time Value 1 CAN MW 2024 2005 12.3 2 CAN MW 2026 2010 14.1 3 CAN MW 2024 2015 10.7 4 DEU MW 2024 2005 8.5 5 DEU MW 2024 2010 7.9 6 DEU MW 2024 2015 6.4 ... ``` *Note:* The above is illustrative; actual values and formatting may differ. The idea is that we get unemployment duration (perhaps measured in months or some index) for each country at age 20-24 for each year. We see Canada (CAN) and Germany (DEU) for the years 2005, 2010, 2015 in this snippet, with some values. The data likely goes through all years available up to the latest. Using these functions, you can mix and match as needed: first find a dataset of interest (`search_dataset`), then get its structure (`get_data_structure`), then fetch data (`get_dataset`). Sometimes, if you already know the dataset ID and the codes required, you can go straight to `get_dataset` and supply the filters. One thing to be aware of is that the OECD API might have *time* as a separate dimension (like "TIME" or "Year"), or sometimes the year is part of the data frame’s row index. In the R package output, typically year/time is given as a column or included in the data frame explicitly. The examples above show it as `Time` or `year`. You can usually filter time by adding an element to the filter list or by using the `start_time` and `end_time` arguments in `get_dataset` (e.g., `start_time = 2000, end_time = 2020` to restrict to years 2000–2020). In our example, because we left time unspecified in the filter list, `get_dataset` likely returned all years available for those parameters. We could have added, say, another element to `filter_list` for years (if the API expects a code for year) or more simply used `start_time`/`end_time` arguments. ### tl;dr ```{r} # Loading OECD package library(OECD) # List all available datasets dataset_list <- get_datasets() # Search all datasets with the term "unemployment" in their description search_dataset("unemployment", data = dataset_list) # Examine the structure of a specific dataset (e.g., "DUR_D") dstruc <- get_data_structure("DUR_D") str(dstruc, max.level = 1) # Define a filter to narrow the data (e.g., countries=DEU,FRA,CAN,USA; Sex=MW; Age=2024) filter_list <- list(c("DEU", "FRA", "CAN", "USA"), "MW", "2024") # Retrieve the dataset "DUR_D" with the specified filter unemploymentOECD <- get_dataset(dataset = "DUR_D", filter = filter_list) unemploymentOECD[1:6, ] ``` *(The above code shows the steps to search for a dataset related to "unemployment", inspect its structure, and download a subset of that data. In practice, replace the search term and dataset ID with those relevant to your needs. Always adjust the filter list according to the actual dimensions of the dataset you are querying.)* ## spiR ### Database Description The **Social Progress Index (SPI)** is a comprehensive measure of a country’s social and environmental performance, independent of economic metrics. It was developed between 2009 and 2013 by the nonprofit organization Social Progress Imperative, with input from scholars and experts (including Michael Porter and others) to better capture human well-being and societal progress. The index is composed of 52 indicators that collectively measure how well countries provide for the essential needs of their citizens, establish the building blocks that allow citizens to improve their lives, and create conditions for all individuals to reach their full potential. The 52 indicators of the SPI are grouped into three broad dimensions: * **Basic Human Needs:** This includes indicators related to nutrition and basic medical care, water and sanitation, shelter, personal safety, etc. (Things like access to food, clean water, safe housing, and security are fundamental needs). * **Foundations of Well-being:** This covers education, access to information, health and wellness, and environmental quality (e.g., literacy rates, school enrollments, access to technology and information, life expectancy, pollution levels, etc.). * **Opportunity:** This dimension looks at personal rights, personal freedom and choice, inclusiveness, and access to advanced education (for example, indicators on political rights, freedom from discrimination, access to higher education, corruption, etc.). Each of these three dimensions is further broken down into components, and each component is measured by several specific indicators. All 52 indicators together roll up into an overall Social Progress Index score for each country. The SPI is usually reported on an annual basis (in early years, not all countries were covered, but it has expanded over time). The goal of SPI is to provide a more holistic measure beyond GDP to evaluate how well societies are doing in converting economic gains into improved social outcomes. The `spiR` package provides an interface to the Social Progress Imperative’s data, allowing R users to retrieve SPI data and related information. It essentially wraps the SPI API (or data source) to make it easy to get data on various countries and indicators. This includes retrieving the overall SPI scores as well as specific component indicators if needed. ### Functions The `spiR` package offers a few key functions to work with the Social Progress Index data: * **`spir_country()`** – Search for countries and retrieve their ISO codes as used by the SPI database. * **`spir_indicator()`** – Search for indicators in the SPI database and retrieve their codes (such as the code for a specific indicator or for the overall SPI). * **`spir_data()`** – Download the actual data for specified countries, years, and indicators from the SPI dataset. These functions are designed to help you find the correct arguments (country codes, indicator codes) and then pull the data. Let’s go through them one by one with examples. #### spir\_country() To request data from the SPI API, you will need to use country codes (likely standardized codes, possibly ISO 3-letter country codes). The function `spir_country()` helps you find the correct country code for a given country name. It takes a country name (or partial name) as an argument and returns the matching country code(s) used in the SPI system. For example, suppose we want to get data for Canada. We should confirm what code the SPI uses for Canada. We can do: ```{r} # Loading the spiR package library(spiR) # Get the ISO country code for "Canada" myCountry <- spir_country("Canada") myCountry ``` If we run this, `myCountry` will contain the result of the search. Likely, since "Canada" is a unique match, it will return a data frame or vector with Canada’s code. In many datasets, Canada’s code is "CAN" (ISO 3-letter code). The output might look like: ``` country_name iso3 1 Canada CAN ``` So, `spir_country("Canada")` yields "CAN" as the code. If you search a more ambiguous string, like `spir_country("United")`, you might get multiple matches (e.g., United States, United Kingdom, United Arab Emirates, etc., each with their code). If you call `spir_country()` with no argument, it might list all available countries and their codes. Knowing the country codes is helpful because the main data function `spir_data()` expects country codes as input. #### spir\_indicator() Similarly, we need to know the indicator codes for the data we want. The SPI has one overall index (often code "SPI") and many sub-indicators (each likely has its own code or name). The function `spir_indicator()` allows you to search the indicators by a keyword. For instance, if we wanted to find indicators related to *mortality* (perhaps there's an indicator about child mortality or something in the Basic Needs dimension), we could search: ```{r} # Search for an indicator containing "mortality" myIndicator <- spir_indicator("mortality") myIndicator ``` This will return any indicators whose name or description contains "mortality". Suppose one of the SPI indicators is "Maternal Mortality Rate" or "Under-5 Mortality Rate"; the search might find it. The output could be something like: ``` indicator_name indicator_code 1 "Under 5 Mortality Rate (per 1,000 live births)" "mortality_u5" 2 "Maternal Mortality Rate (per 100,000 live births)" "mortality_maternal" ... ``` (Exact codes and names are hypothetical here, for illustration.) The idea is that `spir_indicator` will provide you the code string that you need to use to request that indicator’s data. If you call `spir_indicator()` without any argument, it may list all 52 indicators and their codes. This could be useful to see the whole catalog of what’s available, including the overall "SPI" score and all components. For the purposes of this chapter, let’s assume we are interested in the overall Social Progress Index itself, which likely has an indicator code `"SPI"` for the aggregate index. That is probably what we’ll use in the example for data retrieval. #### spir\_data() The function `spir_data()` is the main function to get actual data from the Social Progress Index database. It requires a few arguments: * `country`: one or more country codes (using the codes we found via `spir_country`). This is usually a character vector, e.g., `c("USA","FRA","BRA")`. * `years`: one or more years of interest (as character strings). You can specify a range or specific years, e.g., `c("2014","2015","2016")`. The SPI data in early years might not cover every year, but from 2014 onward they might have annual releases (for example, SPI 2014, SPI 2015, etc.). * `indicators`: one or more indicator codes that you want to retrieve. For example, `"SPI"` for the overall index, or a specific code for a sub-index or component if desired. Using these arguments, `spir_data` will fetch the data and return it, likely as a data frame where each row corresponds to a country-year combination and columns include the country, year, and the indicator values. Let’s retrieve the overall Social Progress Index for a set of countries over several years. For example, we’ll get the SPI from 2014 through 2019 for the USA, France, Brazil, China, South Africa, and Canada. We already suspect the indicator code is "SPI" for the main index (we could confirm that via `spir_indicator("Social Progress Index")` if needed). And we have country codes: USA, FRA, BRA, CHN, ZAF (South Africa), CAN. ```{r} # Extracting SPI data for selected countries and years myData <- spir_data(country = c("USA", "FRA", "BRA", "CHN", "ZAF", "CAN"), years = c("2014","2015","2016","2017","2018","2019"), indicators = "SPI") head(myData) ``` Once this runs, `myData` will contain the SPI values for those countries and years. We used character vectors for years in the function call (notice the quotes around the years). In many cases, years could be numeric, but perhaps the API expects them as strings — using quotes ensures they’re treated as text. The output (as seen by `head(myData)`) might look like: ``` country iso3 indicator year value 1 Brazil BRA SPI 2014 67.91 2 Brazil BRA SPI 2015 69.58 3 Brazil BRA SPI 2016 70.40 4 Brazil BRA SPI 2017 71.00 5 Brazil BRA SPI 2018 72.29 6 Brazil BRA SPI 2019 72.89 ``` This is an example for one country (Brazil) across years 2014–2019, showing the SPI score (on some scale, possibly 0–100) for each year. In the actual `myData`, all requested countries would be present, so you’d also have rows for USA, France, etc. The columns likely include the country name, the country code (iso3), the indicator (SPI), the year, and the value (the score). You could then use this data frame to compare how different countries’ social progress scores have changed over time. For instance, you might plot each country’s SPI over the years. The `spir_data` function can also retrieve multiple indicators at once if you provide a vector of indicator codes. Then the data frame would have multiple columns (or multiple rows per indicator, depending on how it’s structured – often such API wrappers return a long format where each row is a single indicator for a single country-year, meaning you’d get an extra column for indicator code as shown above, and a value column). In the example above, since we only requested "SPI", we see one row per country-year. If we had requested additional indicators, we might see multiple rows per country-year (one for each indicator), or the function might pivot it into columns. One would need to check the documentation for exact behavior. But you can always reshape the data after retrieval as needed. Finally, if you want to explore more about what you can do with the SPI data, the `spiR` package documentation or the in-depth application (possibly referenced by the `warin.ca` post) can provide more examples, such as making dashboards or visualizations similar to those on the Social Progress Imperative’s own site. ### tl;dr ```{r} # Loading the spiR package library(spiR) # Find the ISO code for a specific country (e.g., Canada) myCountry <- spir_country("Canada") myCountry # should return "CAN" for Canada # Search for an indicator by keyword (e.g., "mortality") myIndicator <- spir_indicator("mortality") myIndicator # returns any indicator codes that include "mortality" # Retrieve data for the Social Progress Index (SPI) for selected countries and years myData <- spir_data(country = c("USA", "FRA", "BRA", "CHN", "ZAF", "CAN"), years = c("2014","2015","2016","2017","2018","2019"), indicators = "SPI") head(myData) ``` *(The above code demonstrates how to look up country and indicator codes and how to fetch the overall SPI data for a set of countries over a range of years. In practice, you can use `spir_indicator()` to find other indicators and replace `"SPI"` with the code of any specific sub-indicator if you wish to retrieve those.)* ## statcanR ### Database Description **Statistics Canada** (often abbreviated as *StatCan*) is the national statistical agency of Canada. It produces a vast amount of data on Canada’s economy, society, and environment. The Statistics Canada open data portal includes data on about 30 broad subjects, including agriculture, energy, environment, education, health, economics, demographics, and more. These data are available at various geographic levels, such as national (Canada), provincial/territorial, metropolitan areas, etc., depending on the dataset. Historically, much of StatCan’s data was accessible via something called CANSIM tables (Canadian Socio-economic Information Management system). In recent years, they modernized their platform and now refer to data tables by a Product ID (PID) code like “27-10-0014-01”. The StatCan Open Data API (also known as the Web Data Service) allows programmatic access to these tables. Each table’s data can be retrieved by referencing its table number or product ID. The `statcanR` package provides a user-friendly way for R users to access Statistics Canada data. It essentially wraps around the web service API so that given a table ID, it will fetch the data and return it as an R data frame or tibble. This saves you from manually downloading CSV files or writing your own HTTP requests. However, one challenge is that you need to know the table’s ID before you can fetch it (the API requires a specific table number). The `statcanR` workflow typically involves two steps: 1. **Find the table ID for the data you want.** This is often done by using the Statistics Canada website’s search, since `statcanR` itself doesn’t provide a search function in the package (as of the information we have). You might use the StatCan data portal or an online search to identify the table number for your topic of interest. 2. **Use `statcan_data()` to retrieve that table’s data.** Once you have the table ID, you call the function and it returns the data. Let’s go through these steps with an example. ### Functions The main function in `statcanR` we will use is `statcan_data()`. But before using it, we usually need to find the table ID (unless we already know it). So we can think in terms of: * **Searching for data (table ID) outside the package.** The user might have to go to the StatCan website or use an index of table IDs. * **`statcan_data()`** – the function to fetch data given a table ID and language. *(There isn’t a dedicated search function like `statcan_search()` in this package as far as the provided material suggests. Instead, the guidance is to use the website to find the ID.)* #### Search for data (Finding the table ID) To find a Statistics Canada table ID for the data you want, you can use the official StatCan data portal search. For example, let’s say we are interested in "federal expenditures on science and technology by socio-economic objectives." We could go to the StatCan website’s data search page (the URL given in the materials is `https://www150.statcan.gc.ca/n1/en/type/data?MM=1` which is a general data search page). On that page, typing a few keywords like "federal expenditures science technology socio-economic objective" should bring up relevant results. Assume we do that search on the website. The search results might list a table with exactly that description. In the provided content, the example found that the table number for "Federal expenditures on science and technology by socio-economic objectives" is **27-10-0014-01**. This is the unique identifier for that dataset. StatCan table IDs usually have a format like two digits, two digits, four digits, two digits (with hyphens in between). For instance, 27-10-0014-01: * The first two digits (27) might represent a category. * The next two (10) perhaps sub-category or just part of the coding. * 0014 is the specific table number, and 01 might indicate the version (like some tables get updated structure over time and the last two digits change). Anyway, once we have this ID, we are ready to fetch the data via the API. #### statcan\_data() The function `statcan_data(table, lang)` fetches the data for a given table number. It has two main arguments: * The first argument is the **table number** as a string (e.g., `"27-10-0014-01"`). * The second argument is the **language** of the data (“eng” for English or “fra” for French). StatCan publishes data in both official languages, and sometimes the table content (like column names or category labels) can be fetched in either language. In our example, we’ll use the table ID we found, `"27-10-0014-01"`, and request the data in English. ```{r} # Loading the statcanR package library(statcanR) # Fetch data from Statistics Canada table 27-10-0014-01 in English mydata <- statcan_data("27-10-0014-01", lang = "eng") # Examine the first few rows of the data head(mydata) ``` After running this, `mydata` will contain the data frame for that table. The `head(mydata)` will show the first several rows. The structure of the data depends on the table. StatCan tables are often structured in a long format where each row is a combination of the classification dimensions with a value. The columns might include things like `REF_DATE` (the time period, e.g., year), and various other dimensions like geographical area, indicator, etc., depending on what the table is about, plus a `VALUE` column for the numeric value. For instance, since this is "federal expenditures on S\&T by socio-economic objective", the dimensions might be Year, maybe type of expenditure or objective category, etc. We might see columns like: ``` REF_DATE GEO Objective VALUE 2018 Canada Defence 500.0 2018 Canada Economic development 300.0 2018 Canada ... ... 2019 Canada Defence 520.0 ... ``` (This is hypothetical data to illustrate the format.) Essentially, each row is one category of expenditure in a given year, with the value being the amount spent (maybe in millions of dollars, etc.). The actual table likely has a specific breakdown. One nice thing is that `statcan_data()` probably returns a tibble (which is a modern type of data frame in R) with proper column names and factor labels in English (since we requested `lang="eng"`). If we had requested French (`lang="fra"`), the labels for the objectives and possibly the column names would appear in French. At this point, we have the data needed and can proceed to analyze or visualize it. For example, we could sum up various objectives or see trends over time. The `statcanR` package makes it straightforward to get the latest data for that table without manually downloading the CSV from the website. Moreover, if the table gets updated with new data (for a new year, for example), running `statcan_data()` again at a later date would fetch the updated data (assuming the table ID remains the same). One should note: to use the StatCan API without this package, you might normally have to know the API endpoint and possibly handle CSV or JSON. `statcanR` abstracts that away – you just provide the table ID. If you are curious for more, the `statcanR` documentation or the referenced blog post may have more examples (such as dealing with very large tables or manipulating the results). But the essential part is covered: find the table ID and use `statcan_data()`. ### tl;dr ```{r} # Loading the statcanR package library(statcanR) # Use statcan_data() to retrieve a specific table by its ID (English version) mydata <- statcan_data("27-10-0014-01", "eng") # View the first few rows of the retrieved data head(mydata) ``` *(In practice, replace `"27-10-0014-01"` with the table ID of the dataset you need from Statistics Canada. Use `"eng"` for English or `"fra"` for French as the language argument. Remember to find the table ID via the StatCan website or documentation before using this function.)* ## EpiBibR ### Database Description The **EpiBibR** package is an R wrapper designed to provide easy access to a large bibliographic dataset, particularly focused on COVID-19 and other related medical research references. During the COVID-19 global crisis, the volume of scientific literature on the topic skyrocketed. Having a comprehensive bibliographic database of COVID-19 research (articles, preprints, letters, news articles, etc.) is valuable for researchers conducting literature reviews, trend analysis, or bibliometric studies. EpiBibR was created to make over 100,000 such references available directly through R. In essence, EpiBibR contains (or connects to) a database of bibliographic entries (like what you’d find in PubMed or other scholarly databases) that are related to epidemiology and specifically COVID-19. Each entry in the database includes various fields, such as authors, title, abstract, publication year, journal, etc. This is analogous to having a huge bibliography or library catalog that you can query with code. To give an idea of what information each reference record contains, here are some of the fields available (with their typical tags): * **AU** – Authors (the list of authors of the paper) * **TI** – Title of the document * **AB** – Abstract of the paper * **PY** – Publication Year * **DT** – Document Type (e.g., Article, Letter, News, etc.) * **MESH** – Medical Subject Headings (keywords/topics assigned) * **TC** – Times Cited (citation count, if available) * **SO** – Source (publication name, e.g., journal or news source) * **J9** – Source abbreviation * **JI** – ISO source abbreviation * **ISSN** – International Standard Serial Number (journal identifier) * **VOL**, **ISSUE** – Volume and Issue number (for journal articles) * **ID** – PubMed ID (if applicable) * **DE** – Authors’ Keywords (keywords given by authors) * **UT** – Unique Article Identifier (possibly Web of Science ID or similar) * **AU\_CO** – Author’s Country of Origin (which might be derived from author affiliations) * **DB** – Database from which the record is sourced (e.g., which bibliographic database) The above is a lot of information – essentially, EpiBibR is giving you a bibliographic dataset akin to a large reference manager file that you can query. The typical use would be: you query the data for certain criteria (like author name, year, keywords, etc.) and get back a subset of references matching those criteria. ### Functions The `EpiBibR` package provides a main function for data retrieval and allows filtering by various fields: * **`epibibr_data()`** – The primary function to retrieve bibliographic references, with arguments that allow filtering by author, country, year, title keywords, abstract keywords, and source (journal name) among others. The usage of `epibibr_data()` is very flexible: you can provide none, one, or multiple filters. Without any arguments, it returns the entire dataset (which is huge). With arguments, it filters accordingly. Let’s go through some examples, as given in the content: #### epibibr\_data() * **Retrieving the entire dataset:** If you simply call `epibibr_data()` with no arguments, it will try to retrieve the entire bibliography data frame which contains all references (80,000+ or even 100,000+ entries). For example: ```{r} library(EpiBibR) complete_data <- epibibr_data() ``` This command would populate `complete_data` with the entire bibliographic database. This might be quite large in memory, so often you might not want to do this unless you truly need everything. Instead, you might retrieve a subset based on some criteria. * **Filtering by author:** If you want all references authored by a certain person, you can use the `author` argument. For example, to get all articles written by someone with last name Colson (as in Philippe Colson, a microbiologist who authored many COVID-19 papers): ```{r} colson_articles <- epibibr_data(author = "Colson") ``` This will search the author field for "Colson" and return all entries where at least one author matches that name. The result `colson_articles` would be a data frame of all such references. Each entry would include all the fields (Title, Year, etc.) for papers that have Colson as an author. You might then check `nrow(colson_articles)` to see how many papers he authored in the database, for instance. * **Filtering by author and year:** You can combine filters. The function allows multiple arguments to narrow the search. For example, if we want references authored by someone named Yang in the year 2020: ```{r} yang2020 <- epibibr_data(author = "Yang", year = "2020") ``` This will give all records where an author’s name contains "Yang" **and** the publication year is 2020. The result `yang2020` would contain only those references meeting both criteria (logical AND between filters). * **Filtering by author’s country:** If we want to find references based on the country of origin of the authors (perhaps the country of the corresponding author or an author affiliation country), we can use the `country` argument. For example, to get all references with at least one author from Canada: ```{r} canada_articles <- epibibr_data(country = "Canada") ``` This will search the author address/affiliation field for "Canada" and return those references. * **Filtering by title keyword:** If we want to find articles that have a certain keyword in the title, we use the `title` argument. For example, to get all references whose titles contain "covid": ```{r} covid_articles <- epibibr_data(title = "covid") ``` This will likely return a lot of references (since many will have "COVID" in the title). The search might be case-insensitive and probably looks for the substring "covid" in the title. * **Combining multiple criteria:** As mentioned, we can refine searches by using multiple arguments at once, and the result will satisfy all given filters. For instance, we might want references authored by someone named "Yang", that have "covid" in the title, and were published in 2020: ```{r} yangcovid2020_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020") ``` This will find references where *all three* conditions are true (author includes "Yang", title includes "covid", year is 2020). We could add even more criteria, for example, also specifying a source. * **Adding source as another filter:** The `source` argument can filter by publication source (like journal or conference name). For example, to refine the above search to only include those references in sources whose name contains "bio" (maybe "BioRxiv" or "Biology" etc.): ```{r} yangcovid2020bio_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020", source = "bio") ``` Now the references must meet all four filters: an author name containing "Yang", title containing "covid", year 2020, and source containing "bio". This will significantly narrow it down – possibly to references by authors named Yang in 2020 about COVID in some biology-related journals or preprint servers. * **Filtering by abstract keyword:** We can also search within the abstract text of the references using the `abstract` argument. For example, to find references that mention "coronavirus" in the abstract: ```{r} coronavirus_articles <- epibibr_data(abstract = "coronavirus") ``` This will return references whose abstracts contain the word "coronavirus". This is a powerful way to find papers that might be about coronaviruses even if the title doesn’t explicitly say so. All these filters can be used in combination or standalone. The result of any `epibibr_data` call is a data frame (or tibble) with the references that match. Each row is one reference, with columns corresponding to fields like author, title, year, etc. likely using the abbreviations or full names of those fields. For example, the resulting data frame might have columns named `AU`, `TI`, `AB`, `PY`, `SO`, etc., or possibly more user-friendly names. If integrating with bibliometrix (as hinted, since they designed it to integrate with the bibliometrix package), it might preserve standard field tags so bibliometrix can read it easily. One thing to keep in mind is that text searches (like title = "covid") will probably match anywhere in the field, so "COVID-19" or "covid19" etc. would match because "covid" is a substring. Similarly, author = "Yang" might match "Yang" as a surname but also "Yangus" or anything with those letters — though likely it's intended to match last names exactly or something. The specifics depend on how the search is implemented (perhaps it treats the input as a case-insensitive substring search). The ability to combine filters means you can tailor very specific queries, which is great for slicing the data (for example, find how many papers a particular author wrote in a given year on a certain topic). One caution: If you combine too many filters that don’t have overlapping results, you might get zero results. For instance, if no author named Yang wrote a COVID-titled paper in 2020 in a source with "bio" in its name, then `yangcovid2020bio_articles` would be empty (0 rows). EpiBibR essentially puts a research literature database at your fingertips. A researcher could use this to do things like trend analysis (how many COVID papers per year, etc.), network analysis of collaborations (using authors and their countries), or topic modeling on abstracts, etc. ### tl;dr ```{r} # Loading the EpiBibR package library(EpiBibR) # Retrieve the entire dataset (all references) - large output epidata <- epibibr_data() # Examples of filtered searches: complete_data <- epibibr_data() # same as epidata, full dataset colson_articles <- epibibr_data(author = "Colson") # all references with an author named Colson yang2020 <- epibibr_data(author = "Yang", year = "2020") # references authored by "Yang" in the year 2020 canada_articles <- epibibr_data(country = "Canada") # references where an author's country is Canada covid_articles <- epibibr_data(title = "covid") # references with "covid" in the title yangcovid2020_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020") # references that satisfy: author name has "Yang", title has "covid", and year is 2020 yangcovid2020bio_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020", source = "bio") # further narrows above: in addition, source contains "bio" (perhaps BioRxiv or similar) coronavirus_articles <- epibibr_data(abstract = "coronavirus") # references with "coronavirus" in the abstract ``` *(The above code block demonstrates how to use `epibibr_data()` in various ways: with no filters (all data) and with different combinations of filters for author, year, country, title, source, and abstract. These are examples – you can adjust the strings to search for different authors, keywords, years, etc., depending on your research needs.)* ## coronavirus ### Database Description The `coronavirus` package is an R package that provides a tidy formatted dataset of the 2019 Novel Coronavirus COVID-19 outbreak, sourced from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) data repository. In early 2020, the JHU CSSE team started compiling daily reports of COVID-19 cases and deaths from around the world, and they made this data public via a GitHub repository. This quickly became one of the standard references for tracking the global spread of COVID-19. The `coronavirus` R package (maintained by Rami Krispin and others as part of the **Covid19R Project**) takes that JHU data and puts it into a consistent, **tidy format** for easier analysis in R. "Tidy format" here means a long-form data frame where each row represents a single observation (for a given date and location), and columns represent variables like the date, location, case type, and number of cases. Specifically, the dataset in this package includes daily counts of COVID-19 confirmed cases, deaths, and recoveries (where available) for each region. By region, we mean often country level, and in some cases sub-regions (e.g., states or provinces for large countries like the US, Canada, China, Australia, etc., since JHU dataset had those breakdowns). The data covers from the beginning of the pandemic (around January 2020) and continues through the course of 2020 and 2021, etc. Notably: * It has a column for the date of the observation. * Columns for geographic information (province/state, country/region, and possibly coordinates). * A column indicating the type of case (usually "confirmed", "deaths", or "recovered"). * A column for the number of cases on that date (could be daily new cases or cumulative, but in the tidy JHU data, typically each row is daily counts). * Additional columns such as ISO country codes, population, etc., might be included to enrich the data. According to the package’s updated description, the dataset includes daily new cases and death cases from January 2020 through March 2023, and recovery cases up until August 2022. This reflects the fact that JHU stopped tracking certain metrics at certain times (they stopped updating recoveries in mid-2022 and stopped all data collection in March 2023 as the pandemic data collection wound down). The `coronavirus` package internally contains a snapshot of the data (which is periodically updated on CRAN) and also provides utilities to update it from the source. ### Functions The key functions and usage for this package are: * **`data("coronavirus")`** – This isn’t a function per se, but a way to load the included dataset. The package comes with a dataset named **`coronavirus`**. By running `data("coronavirus")`, you load that dataset into your R environment. * **`update_dataset()`** – This function will fetch the latest version of the data from the package’s GitHub source, updating the in-memory dataset (and possibly the package’s internal data, requiring a restart to use). Essentially, it lets you get the most recent data beyond what was shipped with the CRAN version. * **`refresh_coronavirus_jhu()`** – This is an alternative function (part of the Covid19R project standard) that pulls data directly from the JHU repository and returns it as a data frame, without relying on the package’s built-in dataset. It ensures you get the absolute latest data in the standardized format. Let’s see how these work: #### `data("coronavirus")` After installing and loading the `coronavirus` package, you can load the dataset by calling: ```{r} library(coronavirus) data("coronavirus") ``` This will make a data frame called `coronavirus` available in your R session (basically attaching the dataset). If you then do: ```{r} head(coronavirus) ``` You will see the first few rows of the dataset. The dataset’s columns (as of the Covid19R format) include: * `date` – the date of observation (class Date). * `province` – the name of the state/province (if applicable; could be `NA` or empty for country-level if no subregion). * `country` – the country/region name. * `lat` – latitude coordinate of the location (for mapping). * `long` – longitude coordinate. * `type` – the type of case ("confirmed", "death", or "recovered"). * `cases` – the number of cases on that date (for that type, location). * `uid` – a unique identifier (like a numeric country code possibly). * `province_state` – (sometimes duplicate info of `province`, depending on source). * `iso2`, `iso3`, `code3` – ISO 2-letter, ISO 3-letter, and numeric country codes. * `fips` – FIPS code (for US counties, etc., if applicable). * `combined_key` – a combined location key (like "Country, State" string). * `population` – population of that region (useful for per-capita calculations). * `continent_name`, `continent_code` – the continent of that country. That’s a lot of columns, but essentially each row might look like: ``` date province country lat long type cases ... etc. 2020-01-01 <NA> China 30.0 112.5 confirmed 1 2020-01-01 <NA> China 30.0 112.5 death 0 ... 2020-03-15 New York US 40.7 -73.9 confirmed 730 2020-03-15 New York US 40.7 -73.9 death 2 ... ``` (Just illustrative, not actual values.) So by doing `head(coronavirus)`, you might see the earliest entries (which might be rows for China in January 2020, as shown above). The dataset in the CRAN version might not be the absolutely latest if the pandemic is ongoing. CRAN versions were updated every month or two, as noted. That’s where `update_dataset()` comes in. #### update\_dataset() The `update_dataset()` function is provided to let users of the CRAN version get the most recent data without waiting for the next CRAN release. When you call `update_dataset()`, it will attempt to fetch the latest data from the GitHub (development) version of the package (or directly from the JHU source) and then update the `coronavirus` dataset in your R environment. For example: ```{r} # Assuming library(coronavirus) is already called update_dataset() ``` After running this, you should re-run `data("coronavirus")` or otherwise reload the data, possibly by re-initializing the package or simply by using the updated in-memory data (the documentation suggests restarting R for the changes to take effect). Essentially, `update_dataset()` will download the new CSV from the JHU repo (if available) and replace the package’s internal data file. One important note mentioned is that **you must restart the R session** to have the updates available. This implies that `update_dataset()` writes to the package’s environment or a global environment but you might need to reload it to see the updated dataset as `coronavirus`. A safer approach might be to use the next function if you want fresh data on the fly without needing to restart. #### refresh\_coronavirus\_jhu() The function `refresh_coronavirus_jhu()` pulls the data directly and returns it as a data frame (without needing to load it as `coronavirus`). This function is part of a standardized approach by the Covid19R project. It fetches the data from the JHU repository (which contains daily time series or daily reports) and returns it in the same standardized format as the `coronavirus` dataset. Example usage: ```{r} covid19_df <- refresh_coronavirus_jhu() head(covid19_df) ``` After this, `covid19_df` will be a data frame with the latest data. Doing `head(covid19_df)` should show the first few rows (which likely correspond to the earliest date’s data, similar to what `coronavirus` initial entries would show). Since `refresh_coronavirus_jhu()` always goes to source, it ensures you get updates even beyond what `update_dataset()` can do (which might be limited by how the package was built). As of early 2023, JHU has ceased updates (the last date with data is March 10, 2023 for new cases, and August 2022 for recoveries). So beyond that, the dataset is static. But from 2020 through early 2023, this was actively updated daily. For historical or offline analysis, using the packaged data or the update function suffices. For real-time tracking (when the data was live), one would likely use `refresh_coronavirus_jhu()` daily to get new numbers and then merge or analyze. ### tl;dr ```{r} # Loading the coronavirus package library(coronavirus) # Load the bundled coronavirus dataset data("coronavirus") # Peek at the data structure and first rows head(coronavirus) # Update the dataset to the latest available data (requires restarting R to use updated data) update_dataset() # Alternatively, fetch the latest JHU COVID-19 data directly into a data frame covid19_df <- refresh_coronavirus_jhu() head(covid19_df) ``` *(The above code demonstrates how to load the COVID-19 dataset included in the package, how to update it to the latest data, and how to directly retrieve the latest data via the JHU source. Use `head()` or similar to inspect the data. Remember that after using `update_dataset()`, you might need to restart R to reload the updated `coronavirus` dataset.)* **TL;DR** – *Summary of Key Code from this Chapter* ```{r} # Loading the WDI package and searching/downloading data library(WDI) listOfIndicators <- WDIsearch("GDP") # search indicators for "GDP" listOfIndicators[1:5, ] # view first 5 results stockTraded <- WDI(indicator = "CM.MKT.TRAD.GD.ZS", country = c("FR", "CA", "US", "CN"), start = 2000, end = 2016) head(stockTraded) # Loading the OECD package and retrieving a filtered dataset library(OECD) dataset_list <- get_datasets() # get list of all datasets search_dataset("unemployment", data = dataset_list) # find datasets related to "unemployment" dstruc <- get_data_structure("DUR_D") # get structure of a specific dataset str(dstruc, max.level = 1) filter_list <- list(c("DEU", "FRA", "CAN", "USA"), "MW", "2024") # set up filters (countries, sex, age) unemploymentOECD <- get_dataset(dataset = "DUR_D", filter = filter_list) unemploymentOECD[1:6, ] # Loading the spiR package and retrieving Social Progress Index data library(spiR) mycountry <- spir_country("Canada") # lookup country code for Canada mycountry myIndicator <- spir_indicator("mortality") # search indicators with "mortality" myIndicator myData <- spir_data(country = c("USA", "FRA", "BRA", "CHN", "ZAF", "CAN"), years = c("2014","2015","2016", "2017", "2018", "2019"), indicators = "SPI") head(myData) # Loading the statcanR package and fetching a StatCan data table library(statcanR) mydata <- statcan_data("27-10-0014-01", "eng") # get data for table 27-10-0014-01 in English head(mydata) # Loading the EpiBibR package and querying the bibliographic database library(EpiBibR) epidata <- epibibr_data() # load entire dataset (very large) complete_data <- epibibr_data() # (same as above) colson_articles <- epibibr_data(author = "Colson") # papers authored by someone named Colson yang2020 <- epibibr_data(author = "Yang", year = "2020") # papers by "Yang" in 2020 canada_articles <- epibibr_data(country = "Canada") # papers with an author from Canada covid_articles <- epibibr_data(title = "covid") # papers with "covid" in the title yangcovid2020_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020") yangcovid2020bio_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020", source = "bio") coronavirus_articles <- epibibr_data(abstract = "coronavirus") # Loading the coronavirus package and using its dataset library(coronavirus) data("coronavirus") head(coronavirus) update_dataset() covid19_df <- refresh_coronavirus_jhu() head(covid19_df) ``` **Commands introduced in this chapter and their purpose:** | **Command** | **Detail / Purpose** | | --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `WDIsearch()` | Search for World Bank indicators by keyword. Returns a list of indicator codes and names that match the search term. | | `WDI()` | Retrieve data for specified World Bank indicator(s) and country(ies). Allows specifying start and end years. | | `get_datasets()` | List all available OECD datasets (IDs and descriptions). Useful as a first step to see what data can be accessed. | | `search_dataset()` | Search within the OECD dataset list for a keyword. Helps find the dataset ID relevant to a topic. | | `get_data_structure()` | Get the structure (dimensions and valid codes) of an OECD dataset. Use this to determine how to filter `get_dataset` queries. | | `get_dataset()` | Download data from a specific OECD dataset, optionally filtering by dimensions (e.g., countries, years, etc.). Returns a data frame of the selected data. | | `spir_country()` | Look up country codes in the Social Progress Index database by country name. Helps to get the correct country ISO3 codes. | | `spir_indicator()` | Look up indicator codes in the Social Progress Index database by keyword. Helps to find the code for a specific SPI indicator. | | `spir_data()` | Retrieve Social Progress Index data for given countries, years, and indicator(s). Returns a data frame of SPI values. | | `statcan_data()` | Fetch data from a Statistics Canada table given its table ID and language. Returns a data frame of that table’s data. | | `epibibr_data()` | Query the bibliographic database for COVID-19/medical references. Can filter by author, year, country, title, abstract, source. Returns a data frame of references matching the criteria. | | `data("coronavirus")` | Load the included COVID-19 dataset from the `coronavirus` package. Provides a tidy data frame of daily cases. | | `update_dataset()` | Update the `coronavirus` package’s dataset to the latest available data from the source. (Requires restarting R to use the updated data.) | | `refresh_coronavirus_jhu()` | Directly pull the latest COVID-19 data from the JHU source in the standardized format. Returns a fresh data frame without needing a package update. |

7.1 WDI

Database Description

Functions

WDIsearch()

WDI()

tl;dr

7.2 OECD

Database Description

Functions

get_datasets() – Listing available datasets

search_dataset()

get_data_structure()

get_dataset()

tl;dr

7.3 spiR

Database Description

Functions

spir_country()

spir_indicator()

spir_data()

tl;dr

7.4 statcanR

Database Description

Functions

Search for data (Finding the table ID)

statcan_data()

tl;dr

7.5 EpiBibR

Database Description

Functions

epibibr_data()

tl;dr

7.6 coronavirus

Database Description

Functions

data("coronavirus")

update_dataset()

refresh_coronavirus_jhu()

tl;dr

`get_datasets()` – Listing available datasets

`data("coronavirus")`