Chapter 12 API and Packages

12.1 Introduction

Interesting data sets are the fuel of a good data science project. And while downloading csv from different websites to get free data works well to accomplish a project, APIs (Application Programming Interface) are another very common way to access and acquire interesting and free data. Therefore, in this chapter, we’ll focus on API to get data.

At the end of the chapter, you should be able to:

know what is an argument
import data using an API

12.2 WDI

12.2.1 Database description

The World Development Indicators is a compilation of relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty. The database contains 1,600 time series indicators for 217 economies and more than 40 country groups, with data for many indicators going back more than 50 years.

12.2.2 Functions

This library gives access to all indicators provided by the World Bank. The functions listed below allow you to search and download specific data from the WDI database.

WDIsearch()
WDI()

Each of these functions are detailed in this course and some examples are provided.

12.2.2.1 WDIsearch()

The function WDIsearch() takes as an input any string of character and will provide the list of indicators containing this string of character.

For example, we would like to obtain all indicators using the term “GDP” from the database WDI.

# Loading the WDI package
library(WDI)

# Search all indicators with the term "GDP"
listOfIndicators <- WDIsearch("GDP")

# List the first 5 indicators
listOfIndicators[1:5,]

12.2.2.2 WDI()

The function WDI() takes as an input the indicator’s code and the country of the data wanted. It returns the value of the indicator for the countries selected. To search specific dates, it is possible to add as inputs the starting year and the ending year of the data.

For example, it would be interesting to evaluate the total amount of stocks traded in percentage of GDP (CM.MKT.TRAD.GD.ZS) for 4 countries (France - FR; Canada - CA; USA - US; China - CN) from 2000 to 2014. This could be obtained by using the function WDI() with the following inputs:

indicator = “CM.MKT.TRAD.GD.ZS”
country = c(“FR”, “CA”, “US”,“CN”)
start = 2000
end = 2014

library(WDI)

# Access and store data concerning Stocks traded in total value (% of GDP)
stockTraded <- WDI(indicator = "CM.MKT.TRAD.GD.ZS", country = c("FR", "CA", "US","CN"), start = 2000, end = 2016)

head(stockTraded)

Access a more in-depth application of the WDI API here.

12.2.3 tl;dr

#Loading the WDI library
library(WDI)

# Search all indicators with the term "GDP"
listOfIndicators <- WDIsearch("GDP")

# List the first 5 indicators
listOfIndicators[1:5,]

stockTraded <- WDI(indicator = "CM.MKT.TRAD.GD.ZS", country = c("FR", "CA", "US","CN"), start = 2000, end = 2016)

head(stockTraded)

12.3 OECD

12.3.1 Database description

The Organisation for Economic Co-operation and Development (OECD) database contains almost 300 indicators under 12 categories including agriculture, finance, health and education, etc.

12.3.2 Functions

search_dataset()
get_data_structure()
get_datasets()

Each of these functions are detailed in this course and some examples are provided.

12.3.2.1 search_dataset()

The function search_dataset() searches for OECD indicators. It takes as an input the indicator that would be useful for your analysis. It searches and returns a table of all the available indicator related to the input inserted.

# Loading OECD library
library(OECD)

# List all available datasets
dataset_list <- get_datasets()

# Search all indicators with the term "unemployment"
search_dataset("unemployment", data = dataset_list)

12.3.2.2 get_data_structure()

The function get_data_structure takes as an input the id associate with the dataset and returns the structure of any query made with the OECD package.

# Structure of a query
dstruc <- get_data_structure("DUR_D")
str(dstruc, max.level = 1)

12.3.2.3 get_datasets()

The function get_datasets takes as an input the dataset id and filters. Add a variable to list all of the specific filters allows to simplify the input of the function. get_datasets returns a dataframe containing the selected data.

# Filter use to narrow the research (Canada-Germany-France-USA; male and female; 20-24 years old)
filter_list <- list(c("DEU", "FRA", "CAN", "USA"), "MW", "2024")

# Dataframe containing selected data
unemployementOECD <- get_dataset(dataset = "DUR_D", filter = filter_list)
unemployementOECD[1:6,]

12.3.3 tl;dr

# Loading OECD library
library(OECD)

# List all available datasets
dataset_list <- get_datasets()

# Search all indicators with the term "unemployment"
search_dataset("unemployment", data = dataset_list)

# Structure of a query
dstruc <- get_data_structure("DUR_D")
str(dstruc, max.level = 1)

# Filter use to narrow the research (Canada-Germany-France-USA; male and female; 20-24 years old)
filter_list <- list(c("DEU", "FRA", "CAN", "USA"), "MW", "2024")

# Dataframe containing selected data
unemployementOECD <- get_dataset(dataset = "DUR_D", filter = filter_list)
unemployementOECD[1:6,]

12.4 spiR

12.4.1 Database description

The Social Progress Index is an index created to show a country’s human development. The index was being created between 2009 and 2013. The index thus includes 52 indicators. It is published by the Social Progress Imperative.

The index is based on three axes including 52 indicators:

Basic Human Needs, based on food, health, sanitation, housing, access to electricity, security etc.
Foundations of Well-being, based on literacy, education, access to media, life expectancy, suicide rate, obesity, pollution, environment, etc.
Opportunity, based on political rights, property rights, corruption, social tolerance, access to higher education, etc.

12.4.2 Functions

This package lets you recreate impactful dashboards and visualizations as the ones found on the Social Progress Imperative. This API provides one main function, spir_data(), which lets you extract the data in a convenient format and two other functions, spir_country() and spir_indicator(), that can assist you finding the appropriate arguments for the API.

spir_country()
spir_indicator()
spir_data()

Some examples are provided below.

12.4.2.1 spir_country()

This function allows you to find and search the right country code associated with the Social Progress Index’s Data. If no argument is filed, all indicators will be displayed.

#Loading the sipR package
library(spiR)

#Get the ISO code for a specific country
myCountry <- spir_country("Canada")
myCountry

12.4.2.2 spir_indicator()

This function allows you to find and search the right indicator code from the Social Progress Index you want to use. If no argument is filed, all indicators will be displayed.

#Search for an indicator
myIndicator <- spir_indicator("mortality")
myIndicator

12.4.2.3 spir_data()

First, the function spir_data() takes as an input the countries we’re interested in. We specify this argument with the countries ISO code, as such: c(“USA”, “FRA”, “BRA”, “CHN”, “ZAF”, “CAN”). The second argument is dedicated for the years for which we want data. Finally, we need to specify the indicator from Social Progress we would like to extract.

For example, let’s take a look at the spir indicator (Social Progress Index) for the countries listed above.

#Extracting the data
myData <- spir_data(country = c("USA", "FRA", "BRA", "CHN", "ZAF", "CAN"),
                    years = c("2014","2015","2016", "2017", "2018", "2019"),
                    indicators = "SPI")
head(myData)

Access a more in-depth application of the spiR API here.

12.4.3 tl;dr

#Loading the spir package
library(spir)

#Get the ISO code for a specific country
mycountry <- spir_country("Canada")
mycountry

#Search for an indicator
myIndicator <- spir_indicator("mortality")
myIndicator

#Extracting the data
myData <- spir_data(country = c("USA", "FRA", "BRA", "CHN", "ZAF", "CAN"),
                    years = c("2014","2015","2016", "2017", "2018", "2019"),
                    indicators = "SPI")
head(myData)

12.5 statcanR

12.5.1 Database description

Statistics Canada database contains about 30 subjects including agriculture, energy, environment and education for 5 geographical levels (Canada, Provinces, CMA, etc.)

12.5.2 Functions

StatcanR provides the R user with a consistent process to collect data from Statistics Canada’s data portal. It provides access to all Statistics Canada’ open economic data (formerly known as CANSIM tables) now identified by product IDs (PID) by the new Statistics Canada’s Web Data Service.

This tutorial presents how to use the statcanR R package and its function statcan_data(). The use of this package is separated into two parts. You first have to search the desired table, and then you are able to fetch the data from the statcan_data() function.

Search for data
statcan_data()

Some examples are provided below.

12.5.2.1 Search for data

In order to search for the desired information, Statistic Canada provides a search engine which indicates us the table number we are looking for. If we were interested in the federal expenditures on science and technology by socio-economic objectives, we would visit https://www150.statcan.gc.ca/n1/en/type/data?MM=1 and type in the search box the data’s description.

For this example the table number is ‘27-10-0014-01’. With the table number associated with our search, we can move on to extracting data with the API.

12.5.2.2 statcan_data()

The statcan_data() function takes as an input the table number obtained earlier and the data’s display language (french or english). The lang argument is either “fra” or “eng”.

For example, we can now extract the data associated with the federal expenditures on science and technology by socio-economic objectives.

#Loading the statCanR package
library(statcanR)

# Get data with statcan_data function
mydata <- statcan_data("27-10-0014-01", "eng")

head(mydata)

Access a more in-depth application of the statcanR API here.

12.5.3 tl;dr

#Loading the statCanR package
library(statcanR)

# Get data with statcan_data function
mydata <- statcan_data("27-10-0014-01", "eng")

head(mydata)

12.6 EpiBibR

12.6.1 Database description

EpiBibR is a R wrapper to easily access bibliographic data on Covid-19 and other medical references. In this global crisis, knowledge and open data can have an impact. In this regard, our team thought it could be significant to make available more than 100 000 references (journal articles, letter, news) through R.

12.6.1.1 Features

Table 1. Features accessible through the package.

Field Tags	Descriptions	Field Tags	Descriptions
AU	Authors	ISSN	Source Code
TI	Document Title	VOL	Volume
AB	Abstract	ISSUE	Issue Number
PY	Year	LT	Language
DT	Document Type	C1	Author Address
MESH	Medical Subject Headings Vocabulary	RP	Reprint Address
TC	Times Cited	ID	PubMed ID
SO	Publication Name (or Source)	DE	Authors’ Keywords
J9	Source Abbreviation	UT	Unique Article Identifier
JI	ISO Source Abbreviation	AU_CO	Author’s Country of Origin
DI	Digital Object Identifier (DOI)	DB	Bibliographic Database

12.6.2 Functions

EpiBibR allows you to search bibligraphic references using several arguments : Author, author’s country of origin, year, keywords in the title, keywords in the abstract and source name. The function listed below allow you to retrieve these informations and each some examples are provided.

epibibr_data()

12.6.2.1 epibibr_data()

To get the entire bibliographic dataframe contaning more than 80 000 references, use the epibib_data function.

complete_data <- epibibr_data()

But, it can be truly helpful to search references by the name of the author. For example, we will search all the articles written by Philippe Colson.

colson_articles <- epibibr_data(author = "Colson")

You can also search by author’s name and year of publication.

yang2020 <- epibibr_data(author = "Yang", year = "2020")

Another interesting search would be by author’s country of origin.

canada_articles <- epibibr_data(country = "Canada")

It would be also interesting to search by keywords in title.

covid_articles <- epibibr_data(title = "covid")

As you may have noticed, you can keep more than one argument to refine your search. Let’s use 3 arguments this time by searching by author, title and year.

yangcovid2020_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020")

We can easily use a fourth argument by adding a source.

yangcovid2020bio_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020", source = "bio")

Finally, you can search for keywords in the abstract.

coronavirus_articles <- epibibr_data(abstract = "coronavirus")

12.6.3 tl;dr

library(EpiBibR)
epidata <- epibibr_data()

complete_data <- epibibr_data()

colson_articles <- epibibr_data(author = "Colson")

yang2020 <- epibibr_data(author = "Yang", year = "2020")

canada_articles <- epibibr_data(country = "Canada")

covid_articles <- epibibr_data(title = "covid")

yangcovid2020_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020")

yangcovid2020bio_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020", source = "bio")

coronavirus_articles <- epibibr_data(abstract = "coronavirus")

12.7 coronavirus

12.7.1 Database description

The coronavirus package provides a tidy format dataset of the 2019 Novel Coronavirus COVID-19 (2019-nCoV) epidemic. The raw data pulled from the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) Coronavirus repository.

12.7.2 Functions

This package gives access a tidy format dataset of the 2019 Novel Coronavirus COVID-19 (2019-nCoV) epidemic. The function below allows you to download the data.

data(“coronavirus”)
update_dataset()

Each of these functions are detailed in this course and some examples are provided.

12.7.2.1 data(“coronavirus”)

This is a basic example which shows you how to get the data:

library(coronavirus)

data("coronavirus")

This coronavirus dataset has the following fields:

head(coronavirus)

12.7.2.2 update_dataset()

While the coronavirus CRAN version is updated every month or two, the Github (Dev) version is updated on a daily basis. The update_dataset function enables to overcome this gap and keep the installed version with the most recent data available on the Github version:

update_dataset()

Note: must restart the R session to have the updates available

12.7.2.3 refresh_coronavirus_jhu()

Alternatively, you can pull the data using the Covid19R project data standard format with the refresh_coronavirus_jhu function:

covid19_df <- refresh_coronavirus_jhu()

head(covid19_df)

12.7.3 tl;dr

library(coronavirus)

data("coronavirus")
head(coronavirus) 

update_dataset()

covid19_df <- refresh_coronavirus_jhu()
head(covid19_df)

TL;DR

# Loading the WDI package
library(WDI)

# Search all indicators with the term "GDP"
listOfIndicators <- WDIsearch("GDP")

# List the first 5 indicators
listOfIndicators[1:5,]

stockTraded <- WDI(indicator = "CM.MKT.TRAD.GD.ZS", country = c("FR", "CA", "US","CN"), start = 2000, end = 2016)

head(stockTraded)


# Loading OECD package
library(OECD)

# List all available datasets
dataset_list <- get_datasets()

# Search all indicators with the term "unemployment"
search_dataset("unemployment", data = dataset_list)

# Structure of a query
dstruc <- get_data_structure("DUR_D")
str(dstruc, max.level = 1)

# Filter use to narrow the research (Canada-Germany-France-USA; male and female; 20-24 years old)
filter_list <- list(c("DEU", "FRA", "CAN", "USA"), "MW", "2024")

# Dataframe containing selected data
unemployementOECD <- get_dataset(dataset = "DUR_D", filter = filter_list)
unemployementOECD[1:6,]


#Loading the spir package
library(spir)

#Get the ISO code for a specific country
mycountry <- spir_country("Canada")
mycountry

#Search for an indicator
myIndicator <- spir_indicator("mortality")
myIndicator

#Extracting the data
myData <- spir_data(country = c("USA", "FRA", "BRA", "CHN", "ZAF", "CAN"),
                    years = c("2014","2015","2016", "2017", "2018", "2019"),
                    indicators = "SPI")
head(myData)


# Loading the statcanR package
library(statcanR)

# Get data with statcan_data function
mydata <- statcan_data("27-10-0014-01", "eng")

head(mydata)


# Loading the EpiBibR package
library(EpiBibR)
epidata <- epibibr_data()

complete_data <- epibibr_data()

colson_articles <- epibibr_data(author = "Colson")

yang2020 <- epibibr_data(author = "Yang", year = "2020")

canada_articles <- epibibr_data(country = "Canada")

covid_articles <- epibibr_data(title = "covid")

yangcovid2020_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020")

yangcovid2020bio_articles <- epibibr_data(author = "Yang", title = "covid", year = "2020", source = "bio")

coronavirus_articles <- epibibr_data(abstract = "coronavirus")


# Loading the coronavirus package
library(coronavirus)

data("coronavirus")
head(coronavirus) 

update_dataset()

covid19_df <- refresh_coronavirus_jhu()
head(covid19_df)

Code learned in this chapter

Command	Detail
WDIsearch()	Search for world bank indicators
WDI()	Find data related to the indicators for each country
search_dataset()	Search for OECD indicators
get_data_structure()	Read the structure of any query made with the OECD package
get_datasets()	Find data related to the indicators
spi_data()	Find data related to the indicators for each country
spi_country()	Search for a country’s ISO code
spi_indicator()	Search for a Social Progress indicator
statcan_data()	Extract data from Statistic Canada
epibibr_data()	Retrieve bibliographic data
data(“coronavirus”)	Get data for of all Corona Virus cases
update_dataset	Get the most recent data available on the Github version
refresh_coronavirus_jhu()	Pull data using the Covid19R project data standard format