3 APIs for international trade and economics
APIs (Application Programming Interfaces) are an essential part of modern data science workflows. They allow programs to request data or services from other software systems over the web. In the context of data analysis, an API lets us fetch data directly from an online source (like a database or data provider) using code, rather than manually downloading files. This approach offers two big advantages over static files: freshness and reproducibility. By pulling data live from the source, we ensure we’re getting the latest available information. And by writing code to retrieve the data, we make our analysis reproducible – anyone running the same code can get the same updated dataset. In contrast, if you manually download a CSV file and load it, someone reproducing your work might use an outdated file or not know exactly how you obtained it. Embedding data access in code ensures that every result in your analysis is backed by a clear, repeatable process. This approach aligns with the principles of reproducible research advocated in modern science (Peng, 2011).
In this chapter, we explore how to import data into R in three ways: (1) reading local or static files, (2) reading directly from a URL, and (3) using APIs via dedicated R packages. We’ll especially focus on the third approach – using R packages to interact with web data APIs for international business, trade, and economic data. By the end of this chapter, you will know how to use several data access packages (WDI, OECD, spiR, statcanR, EpiBibR, coronavirus, imfr, comtradr, etc.) to obtain up-to-date international datasets in R. Throughout, we provide step-by-step examples and small exercises to reinforce the concepts and demonstrate how these tools support transparent, up-to-date analysis.
3.1 Three Ways to Import Data
Data can be imported into R in several ways:
From a local file: For example, you can use
read_csv()
(from the readr package) to load a CSV file from your computer.library(readr) <- read_csv("data.csv") df
This will read the file data.csv from your working directory into an R data frame
df
. Make sure the working directory is set to where your file is, or provide the full file path.Directly from a URL: You can provide an
http://
orhttps://
URL toread_csv()
(or a similar function) to read data straight from the web. This saves you the step of manually downloading the file. For example:library(readr) <- read_csv("https://example.com/datasets/panel_data.csv") data_panel
This one-liner fetches the CSV hosted at the given URL and imports it into R as the data frame
data_panel
. No local file needed – R downloads the data on the fly.Via an API using R packages: Many organizations provide data through web APIs, and R packages can retrieve this data directly. Using an API involves sending a query (often through an R function that wraps a web request) and receiving data, usually in JSON or CSV format, which the package then parses into a data frame. We will explore several such packages – WDI, OECD, spiR, statcanR, EpiBibR, coronavirus, as well as packages for IMF data and trade data – and how to use them to get data from their respective sources. This approach requires learning a package’s functions, but it provides powerful capabilities to search and download data programmatically.
Each method has its uses. If you have a one-time static dataset (say an Excel or CSV file from a colleague), method 1 is straightforward. Method 2 is great for quickly grabbing a publicly hosted file without saving it manually. Method 3 shines when you need data that updates regularly or want to automate data gathering from official sources. It also keeps your analysis pipeline reproducible: anyone can run your script to fetch the latest data, without you having to package the data file with your code. In the remainder of this chapter, we’ll dive into method 3, using specific R packages to access a variety of international datasets via their APIs.
3.2 WDI (World Development Indicators)
The WDI package provides access to the World Bank’s World Development Indicators database, which contains over 1,600 time-series development indicators for more than 200 countries. This includes economic indicators (GDP, trade, inflation, etc.), population statistics, education metrics, and many other socio-economic measures collected by the World Bank. Instead of downloading data from the World Bank website, we can use the WDI R package to search for indicators and retrieve country-level data directly through the World Bank API. The World Bank’s data are a go-to source in development economics research (Acemoglu, Johnson & Robinson, 2001), making it convenient to access them in R.
Before using WDI, make sure the package is installed (install.packages("WDI")
) and then load it with library(WDI)
. We’ll demonstrate two key functions from the WDI package: one to search for indicator codes, and one to download the data.
WDIsearch()
Often, we might know the name or topic of an indicator (e.g. GDP per capita or life expectancy), but the API requires a specific indicator code. The function WDIsearch()
helps you find indicator codes by keyword. It takes a text string as input and returns a data frame of indicator names and their corresponding codes that contain that keyword.
For example, suppose we want to find indicators related to GDP:
library(WDI)
# Search all indicators with the term "GDP"
<- WDIsearch("GDP")
indicators # Show the first 5 results
1:5, ] indicators[
This will query the World Bank’s indicator list for any indicator whose name or description contains “GDP”. The result (indicators
) is a data frame where each row is an indicator. It typically includes columns for the indicator name and its code (among others). In our example, the first few results might include entries like GDP (current US$), GDP per capita (current US$), etc., along with their codes. For instance, you might see an indicator named “GDP per capita (current US$)” with a code like NY.GDP.PCAP.CD. Once you identify the exact indicator you need, you’ll use its code in the data retrieval function.
Try it yourself: Use
WDIsearch()
with a different term to explore other indicators. For instance, try searching for"population"
or"life expectancy"
. What indicator codes do you find for those topics?
WDI()
After finding the indicator code of interest, you can use WDI()
to download the actual data. The WDI()
function requires at minimum an indicator code and one or more country codes. You can also specify a time range with start
and end
years. The function will return a data frame of the requested indicator values for the given countries and years.
Let’s walk through an example. Suppose we want the World Bank data on “Stocks traded, total value (% of GDP)” for a few countries. The indicator code for that series is CM.MKT.TRAD.GD.ZS (we could have found this via WDIsearch("stocks traded")
, for example). We’ll retrieve this indicator for four countries – France, Canada, the USA, and China – over the period 2000 to 2016:
library(WDI)
# Download % of GDP traded as stocks for FR, CA, US, CN from 2000 to 2016
<- WDI(indicator = "CM.MKT.TRAD.GD.ZS",
stock_traded country = c("FR", "CA", "US", "CN"),
start = 2000,
end = 2016)
head(stock_traded)
When you run this, the WDI package contacts the World Bank API and pulls the data for the specified indicator, countries, and years. The resulting data frame stock_traded
will include columns such as country (typically an ISO 2-letter or 3-letter code and/or country name), year, the indicator value, and possibly additional metadata like country region or indicator name. Calling head(stock_traded)
shows the first few rows: each row represents one country-year observation (for example, one row might be France in 2000 with the value of stocks traded as % of GDP).
It’s worth noting that if an indicator isn’t available for some years or countries, the returned value will be NA
for those entries. Also, you can request multiple indicators at once by passing a vector of indicator codes to WDI()
(the result will have one column per indicator in that case).
Finally, remember that the data retrieved is as up-to-date as the World Bank’s database. If you run the same WDI query later, you might get more recent years as the World Bank updates its indicators annually or quarterly depending on the series. This makes WDI()
very powerful for keeping analyses current.
Try it yourself: Find the indicator code for GDP per capita (current US$) using
WDIsearch
, and then useWDI()
to download GDP per capita for a few countries of your choice over the last 10 years. Check the resulting data frame to see which columns are included and how the data is structured.
3.3 OECD Data
The OECD R package allows you to search and retrieve data from the OECD (Organisation for Economic Co-operation and Development) databases via their public API. The OECD database includes a wide range of economic and social indicators (nearly 300 datasets across around a dozen categories) for OECD member countries. Examples include labor statistics, economic outlook indicators, productivity and trade figures, education metrics, and more. Unlike the World Bank WDI (which is organized by indicator), the OECD API is organized by datasets. Each dataset contains a collection of related indicators, often multi-dimensional (e.g., broken down by country, year, gender, etc.). OECD data are commonly used to compare policy outcomes across advanced economies in academic research (for example, studies of unemployment or productivity often leverage OECD statistics).
Before using the OECD package, ensure it’s installed (install.packages("OECD")
) and load it with library(OECD)
. A typical workflow is: (a) find the dataset you need, (b) inspect its structure (to understand what dimensions and codes it uses), and (c) retrieve data (optionally filtering on dimensions like country or years). We’ll go through these steps with the OECD package’s key functions: get_datasets()
, search_dataset()
, get_data_structure()
, and get_dataset()
.
First, it’s often useful to see what datasets are available. The function get_datasets()
fetches a list of all available OECD datasets. For example:
library(OECD)
<- get_datasets() dataset_list
Here, dataset_list
will be a data frame where each row is a dataset available via the OECD API. It typically has columns for the dataset ID and a brief description. This list can be quite long, so you might not want to print it in full. Instead, you can search within it for keywords, or use the dedicated search function described next.
search_dataset()
The function search_dataset()
helps you find OECD datasets related to a keyword. You provide a search string (e.g., “unemployment”) and it returns matching dataset IDs and titles. By default, search_dataset()
will search across all OECD datasets. You can optionally supply the data
argument to limit the search to a specific list (for instance, the dataset_list
you retrieved earlier).
For example, to find datasets related to unemployment:
# Search all datasets with "unemployment" in their title or description
<- search_dataset("unemployment", data = dataset_list) search_results
This will look through the dataset list for any entry whose name or description contains “unemployment”. The result search_results
will list those datasets, showing their dataset ID and title. From there, you might identify a dataset of interest – for instance, you might see an entry like “DUR_D – Duration of unemployment” (just as an example). Suppose DUR_D
is a dataset that contains unemployment duration statistics by country, gender, and age group. We would then proceed to learn about its structure and get data from it.
get_data_structure()
Once you have a dataset ID (for example, "DUR_D"
), the next step is to retrieve its data structure (metadata) using get_data_structure(dataset_id)
. This function returns information about the dataset’s dimensions and valid codes. Essentially, it tells us what breakdowns the dataset has (e.g., Country, Time, Gender, Age Group, etc.) and what the allowed values are for each (e.g., country codes, year ranges, category codes, etc.).
For example:
# Get the structure (metadata) of a specific OECD dataset (e.g., "DUR_D")
<- get_data_structure("DUR_D")
dstruc str(dstruc, max.level = 1)
Here, dstruc
will be a list (or another structured object) containing metadata for the dataset DUR_D. We use str()
to peek at its structure (limiting to one level for brevity). Typically, this metadata includes multiple components – often one component per dimension of the dataset. For instance, dstruc
might have elements like $COUNTRY
, $SEX
, $AGE
, $TIME_PERIOD
, each containing the codes and labels available for that dimension. By examining this, you learn how to formulate your data query. For example, you might find that countries are identified by 3-letter ISO codes (USA, CAN, FRA, etc.), genders by codes like M/F or MW (Male, Female, or Male+Female combined), age groups by codes like “2024” (which might stand for the 20–24 age cohort), and so on.
Pro tip: Checking the data structure before downloading data is very useful. It prevents guesswork and frustration. You need to know the exact codes to use for filtering. For instance, a dataset might use "USA"
for United States or it might use "US"
; it might use "ALL"
or "TOT"
for an aggregate category (like both genders combined), or a code like "MW"
for Male+Female; years might be labeled as 2020
or as a full date 2020-01-01
depending on the dataset. The get_data_structure()
output will reveal these details.
get_dataset()
After identifying the dataset and understanding its structure, get_dataset()
is used to download the actual data. You call get_dataset()
with the dataset ID and optionally provide a filter
list to narrow down which slices of data you want. If no filter is provided, the function will attempt to retrieve all data in the dataset – which can be extremely large, so it’s generally better to filter to only what you need.
A filter in the OECD API context is essentially a selection for each dimension of the dataset. The filter is provided as a list of values, one element per dimension (in the dataset’s default order of dimensions). For example, suppose DUR_D (our example unemployment duration dataset) has three dimensions: Country, Sex, AgeClass (and Time is typically a dimension too, but often you don’t specify time in the filter because you can retrieve all years or specify a time range separately). If we want to retrieve data for specific countries, a specific combined gender category, and a specific age group, our filter list might look like this:
# Specify filters for Country, Sex, AgeClass dimensions
<- list(
filter_list c("DEU", "FRA", "CAN", "USA"), # Countries: Germany, France, Canada, USA
"MW", # Sex: "MW" might denote Male+Female combined
"2024" # Age group: "2024" for ages 20-24
)# Retrieve the filtered data for dataset "DUR_D"
<- get_dataset(dataset = "DUR_D", filter = filter_list)
unemployment_data head(unemployment_data)
In this example, we filtered the Duration of Unemployment dataset to only include four countries (DEU, FRA, CAN, USA), the combined-gender category (assuming MW stands for Male+Female combined – one would confirm that from the data structure), and the age class 20–24 years (code “2024”). The call to get_dataset()
then returns a data frame unemployment_data
with the data matching those criteria. Each row in unemployment_data
would typically have columns for Country, Sex, AgeClass, Time (Year), and the value of the unemployment duration indicator (plus possibly indicator names or units). The head()
call displays the first few rows to give us a sense of the output.
When using get_dataset()
, the order of the filters in the list is critical – it must follow the order of dimensions that the dataset expects. The metadata from get_data_structure()
tells you this order. In our example, if the order was (Country, Sex, AgeClass, Time), we provided three filters for Country, Sex, and AgeClass, and implicitly we got all available Time (years) because we didn’t include time in the filter. If we had flipped the order of, say, country and sex in the list, the query would be mis-specified and likely return an error or no data. So always align your filter list with the dataset’s dimensions in the given sequence.
Another note: If you call get_dataset("DUR_D")
with no filters at all, it will try to retrieve the entire dataset. That could be a huge amount of data (potentially thousands of series or millions of rows), which is usually not practical to pull into R. It’s better to constrain the query as shown. If you really do need all data, you might have to retrieve it in chunks or ensure you have enough memory and time.
After retrieving the data, you can treat it like any other R data frame: clean it, merge it with other data, visualize it, etc. The benefit is that you pulled it directly from the authoritative source, and you can update it easily by re-running the code whenever needed.
Try it yourself: Using the OECD package, search for a dataset related to, say, inflation or GDP. Use
search_dataset("inflation", data = dataset_list)
to find a relevant dataset ID. Then useget_data_structure()
on that ID to see what dimensions it has. Finally, useget_dataset()
to retrieve a subset of that data – for example, inflation rates for a few countries over a certain time period. Examine the returned data frame to see how the data is organized.
3.5 statcanR (Statistics Canada Data)
The statcanR package provides a convenient interface to Statistics Canada’s open data API. Statistics Canada (often abbreviated StatCan) offers a vast array of data tables (formerly known as CANSIM tables) covering about 30 different subjects – such as agriculture, energy, education, health, economics, and more – at various geographic levels (country, provinces, cities, etc.). If your analysis involves Canadian data, StatCan’s online portal has a wealth of information, and statcanR helps you pull that data directly into R.
Using statcanR typically involves two steps:
- Find the table ID for the data you want.
- Retrieve the data using that table ID with
statcan_data()
.
There are thousands of data tables, each identified by a unique table number (also called a product ID, or PID). An example of a table ID is something like 27-10-0014-01 – which happens to correspond to a table about federal expenditures on science and technology by socio-economic objective (we’ll use that as a running example). These table IDs often have a format with two digits, then two digits, then four digits, then two digits (as in 27-10-0014-01). The format isn’t important except as an identifier; what matters is finding the correct ID for the data you need.
Finding a StatCan Table ID
To find the table ID for the data you need, you have a couple of options. One is to use the Statistics Canada website’s search or browse features. The StatCan website has a Data search portal where you can enter keywords related to the dataset you want, and it will list matching tables. Once you find the table in the web interface, the site will show its table number (ID). For example, if we wanted “federal expenditures on science and technology by socio-economic objective”, we could search those terms on the site and likely find a matching table with ID 27-10-0014-01.
Alternatively, statcanR provides a function statcan_search()
that allows you to search the database from within R, similar to how we searched with WDI and OECD from R. For example, you can do:
library(statcanR)
statcan_search(c("federal", "expenditures"), "eng")
This would search (in English, since we specified "eng"
for the language) for StatCan tables whose description contains both “federal” and “expenditures”. The result would include our target table among possibly others, showing the table title and the table ID (e.g., “27-10-0014-01”). You can adjust the keywords or switch to French (lang = "fra"
) if needed to find other tables (sometimes the French descriptions might catch different keywords).
Whichever method you use (website or statcan_search()
), the key outcome is identifying the exact table ID you need.
statcan_data()
Once you have the table number (product ID) of interest, you can use the statcan_data()
function to fetch the data via the API. The statcan_data()
function takes the table ID and an optional language parameter. By default, it returns English labels for data fields, but you can request French by using lang = "fra"
if desired (for example, if you want column headings and category labels in French).
For example, using the table ID from above (27-10-0014-01):
library(statcanR)
# Download StatCan table 27-10-0014-01 (Federal S&T expenditures) in English
<- statcan_data("27-10-0014-01", lang = "eng")
my_data head(my_data)
This will retrieve the dataset identified by "27-10-0014-01"
– in our example, Federal expenditures on science and technology, by socio-economic objectives. The data comes as a data frame (my_data
). Each StatCan table has its own structure, but generally the columns will include the dimensions of that table (e.g., Year, Geography, some Category or Classification, and the Value or Measure). In this case, we might expect columns like Year, maybe a breakdown by type of expenditure or objective, and the expenditure amount. The head(my_data)
output will show the first few rows and the column names. You’ll likely see human-readable labels for categories because the API provides labeled data (which is why specifying lang
matters – it returns English category names here, such as “Total expenditure” or specific objective names, rather than codes).
One great thing about statcanR is that it handles behind-the-scenes steps for you: downloading the data file (StatCan often provides data as a CSV or XML inside a zip), unzipping it, and reading it into R. The single statcan_data()
call does all that, so you don’t have to manually navigate the StatCan website or deal with file downloads and imports.
By default, statcan_data()
gives you the entire table. If it’s a large table, that could be a lot of data (potentially many thousands of rows for very detailed tables). Keep an eye on the size of my_data
(for instance, use nrow(my_data)
to see how many rows were returned). If it’s too large for your needs, you might filter it in R after downloading, or (if the API supports it) query a subset by specifying certain dimensions. In many cases, however, StatCan tables are moderate in size or you only need a subset which you can filter in R after retrieval.
(Advanced note: Some other R packages like {cansim} or {tidycansim} also exist to interface with Statistics Canada data and provide additional helpers (e.g., automatically converting to tibbles, or normalizing date formats). The statcanR package used here works well for direct access, especially for the tables curated in this context, but for very large tables or more complex manipulations, those other packages can be handy.)
Try it yourself: Suppose you want Canadian data on unemployment rates by province. You could go to StatCan’s website and search for “unemployment rate province table” to find a table ID. For example, you might find something like 14-10-0294-02, which corresponds to unemployment rates by province. Once you have the ID, use
statcan_data("14-10-0294-02", lang="eng")
to download it. Check the first few rows to see what columns it has. Can you identify the province names (or codes), and the time period of each observation? This exercise gives you practice in finding and retrieving a StatCan data table.
3.6 EpiBibR (COVID-19 Literature Bibliography)
EpiBibR is an R package (backed by an API) that provides access to a large bibliographic database of publications related to COVID-19 and other epidemiological research. In response to the global COVID-19 pandemic, the EpiBibR project compiled tens of thousands of references from sources like PubMed and other databases. The dataset started at about 20,000 references early in the pandemic and has since grown enormously – by April 2022 it contained roughly 180,000 references, and it continues to update regularly (with new publications added daily). The name EpiBibR stands for “Epidemiology Bibliography for R.” Essentially, it’s a specialized database of academic papers and reports about COVID-19 (and related health topics) that you can query from R, which is incredibly useful for literature reviews or tracking the evolution of scientific research. The explosion of COVID-19 related publications has been noted as a challenge for researchers to keep up with (Palayew et al., 2020), and tools like EpiBibR can help manage that information overload.
So what’s in this bibliographic database? Each entry (row) is a reference to a publication, with various fields such as title, authors, journal, year, and so on. The table below outlines some of the main fields (tags) in the bibliographic data and their meanings:
Field Tag | Description |
---|---|
AU | Authors (names of the authors) |
TI | Title (of the article or document) |
AB | Abstract (summary of the paper) |
PY | Publication Year |
DT | Document Type (e.g., Journal Article, Preprint, etc.) |
MESH | MeSH Terms (Medical Subject Headings, standardized keywords) |
TC | Times Cited (citation count) |
SO | Source (Publication name, e.g., journal title) |
J9 | Source Abbreviation (often the 9-character source title) |
JI | ISO Source Abbreviation |
ISSN | International Standard Serial Number (journal identifier) |
VOL | Volume (journal volume, if applicable) |
ISSUE | Issue Number (if applicable) |
LT | Language (of the publication, e.g., EN for English) |
C1 | Author Address (affiliations) |
RP | Reprint Address (contact address for corresponding author) |
ID | PubMed ID (if available) |
DE | Author Keywords (keywords provided by the authors) |
UT | Unique Article Identifier (an internal or database ID) |
AU_CO | Author’s Country (country of author affiliation) |
DB | Database Source (which repository or source the entry came from) |
Many of these fields follow standard bibliographic conventions (similar to formats like BibTeX or Medline tags). For example, AU_CO indicates the country affiliation of an author, MESH terms are the standardized medical subject headings assigned to the article, TC is a citation count, and UT might be a Web of Science unique ID or similar. Not every entry will have all fields (for instance, very new preprints might not have a Times Cited yet, or some records might lack an abstract if it wasn’t available), but this gives an idea of the scope of information available for each reference.
The main function to access this bibliographic database in R is epibibr_data()
. This function can retrieve references and allows filtering by various criteria via its arguments (such as author name, author’s country, year of publication, keywords in the title, keywords in the abstract, source name, etc.). If called with no arguments at all, epibibr_data()
will attempt to return the entire bibliographic dataset:
library(EpiBibR)
# Retrieve the entire COVID-19 bibliographic dataset (all references)
<- epibibr_data() all_refs
Be cautious with the above command! The complete dataset is very large (hundreds of thousands of entries). Retrieving it in full will take a long time and consume a lot of memory. In practice, you will usually want to filter your query to a specific subset of interest rather than pulling everything.
More typically, you’ll query for a specific subset of references. Here are several ways to use epibibr_data()
with filters:
By author: You can find all publications by a given author. For example,
epibibr_data(author = "Colson")
will return all entries where an author’s name contains “Colson” (such as publications by Philippe Colson). The search is case-insensitive and will match partial names, so it could return any paper authored by someone with “Colson” in their name.By author and year: You can combine filters. For example,
epibibr_data(author = "Yang", year = "2020")
will return references where an author’s name contains “Yang” and the publication year is 2020. This would retrieve, say, all papers authored by people named Yang that were published in 2020.By author’s country: To get references where at least one author’s country of origin (affiliation) is Canada, use
epibibr_data(country = "Canada")
. This filter uses the AU_CO field and will return entries for which an author is associated with Canada. This can be useful if you want to see the contribution of researchers from a certain country or analyze country-specific research output.By keywords in title: If you want papers with a certain keyword in the title, use the
title
argument. For instance,epibibr_data(title = "vaccine")
returns references whose titles contain “vaccine”. (This would catch titles like “COVID-19 vaccine development…”, “…vaccine efficacy…”, etc. The search is not case-sensitive.)By keywords in abstract: Similarly, use the
abstract
argument to search within abstracts. For example,epibibr_data(abstract = "coronavirus")
finds papers that mention “coronavirus” in the abstract text. This is useful for capturing papers that might not have COVID in the title but are still about coronaviruses or COVID-19.Combining multiple criteria: You can specify multiple arguments at once to narrow down the search. The filters will be combined (essentially an AND of all conditions). For example,
epibibr_data(author = "Yang", title = "COVID", year = "2020")
would retrieve articles authored by someone named Yang that have “COVID” in the title and were published in 2020. That’s a very specific query, but it illustrates how multiple filters work together. You can even add more filters, such as the source (journal) name. For instance, extending the above example:epibibr_data(author = "Yang", title = "COVID", year = "2020", source = "Lancet")
would further restrict to those published in sources whose name contains “Lancet” (so it would find papers from The Lancet or Lancet Infectious Diseases, etc.). In that case, you’d get papers by authors named Yang, from 2020, with “COVID” in the title, and published in a journal with “Lancet” in its name.
All these queries use the same epibibr_data()
function, just with different arguments. This unified interface makes it quite flexible to get exactly the subset of bibliographic data you need. The returned result is a data frame where each row is a reference (a publication), with columns for all those fields (author, title, year, journal, etc.). You can then further analyze or export this information as needed. For example, you might count how many papers match your query or see which journals are most represented.
After retrieving a subset, you can use regular R tools to examine it. For instance, if you want to know how many results you got, you could do nrow(my_results)
. Or to see the unique journals in that subset, you might do unique(my_results$SO)
(if SO
is the Source column). This can help answer questions like “How much literature was published on topic X in year Y?” or “Who are the most prolific authors in this subset of papers?”.
Try it yourself: Suppose you want to find how much literature was published in 2021 about vaccines for COVID-19. You could try a query like
epibibr_data(title = "vaccine", year = "2021")
. This would fetch references from 2021 whose titles contain “vaccine”. How many results does it return? Which journals or authors appear frequently in those results? This kind of query gives you a quick overview of the literature on a specific topic.
3.8 IMF Data via R APIs
International economic research often relies on macroeconomic and financial data provided by the International Monetary Fund (IMF). The IMF maintains several extensive databases, such as the World Economic Outlook (WEO), International Financial Statistics (IFS), Balance of Payments Statistics, Government Finance Statistics, and more. These contain historical time series on GDP, inflation, government budgets, trade balances, exchange rates, and many other economic indicators across most countries. The IMF makes these data available through an API, and there are R packages to access them, notably imfr (by Christopher Gandrud) and the newer imf.data package. By using these packages, one can directly query IMF databases for specific indicators and countries, ensuring that analyses use the most current figures released by the IMF. The IMF’s datasets are central in international macroeconomic research and policy analysis – for example, studies of fiscal policy often use IMF data on growth and government spending forecasts (Blanchard & Leigh, 2013).
We will illustrate using the imfr package, as it provides a user-friendly interface to the IMF API. Make sure to install it (install.packages("imfr")
) and load it with library(imfr)
.
Listing Available IMF Databases
The IMF API has many separate databases. Each is identified by a short code. To see a list of available databases, you can use:
library(imfr)
<- imf_ids()
imf_ids head(imf_ids)
The result imf_ids
will show database IDs and names. For example, you might see entries like: IFS
(International Financial Statistics), WEO
(World Economic Outlook), BOP
(Balance of Payments), GFS
(Government Finance Statistics), DOT
(Direction of Trade Statistics), etc. Each of these has different indicators and coverage. Suppose we’re interested in the World Economic Outlook (WEO) database, which contains broad macroeconomic indicators along with IMF forecasts, or the IFS database for more high-frequency financial data.
Querying a Specific Series with imf_data()
Once we know the database, we need to know the indicator codes and country codes. Each database has its own set of series codes (similar in concept to the indicator codes in WDI or dataset codes in OECD). For example, in the WEO database, an indicator code for nominal GDP might be "NGDPD"
(Nominal GDP in USD), for real GDP growth "NGDP_RPCH"
, for inflation "PCPIPCH"
(percent change in consumer prices), etc. Country codes are often IMF-specific or ISO country codes. The WEO and some other databases use ISO 3-letter country codes (USA, CHN, FRA, etc.), but some IMF databases might use numeric codes or their own abbreviations.
The imf_data()
function in imfr is used to query data. It requires a database ID, an indicator (or series) code, one or more countries, and a date range (start year, and optionally end year). Let’s say we want to retrieve nominal GDP in USD for a couple of countries from the WEO database:
library(imfr)
# Get nominal GDP (code "NGDPD") for USA, China, and Germany from 2010 to 2020 using WEO database
<- imf_data(database_id = "WEO",
gdp_data indicator = "NGDPD",
country = c("USA", "CHN", "DEU"),
start = 2010,
end = 2020)
head(gdp_data)
In this example, database_id = "WEO"
specifies we want the World Economic Outlook database. The indicator = "NGDPD"
specifies nominal GDP in current USD (we determined that code from WEO documentation or searching within the WEO dataset). We requested country = c("USA","CHN","DEU")
, which are the ISO codes for the United States, China, and Germany. The start and end years define the period 2010–2020. The result gdp_data
should be a data frame containing the GDP values for those countries and years.
If we inspect head(gdp_data)
, we might see columns like: WEO.Country.Code
, ISO
, Country
, WEO.Subject.Code
, Subject Descriptor
, Units
, Scale
, Year
, and Value
– or a subset of those. The structure depends on the database; WEO often returns a rich metadata along with the values. For easier use, one can clean or select relevant columns (for instance, selecting ISO, Year, and Value).
Another example: using the IFS database (International Financial Statistics), which has more granular series. If we wanted to get, say, the Consumer Price Index (CPI) for a country (which is often monthly or quarterly in IFS), we would identify the CPI series code and the country code in IFS and query similarly with imf_data(database_id="IFS", ...)
. The imfr package includes helper functions like imf_codelist()
to explore available series within a database if needed.
The IMF API might require an API key for large queries, but for many basic queries, it works without one or with a shared guest key. If you plan to use it extensively, you can register for a free API key on the IMF data portal and provide it to the R package (imfr can be told the key via an argument or environment variable).
The power of using the IMF API through R is that you can programmatically gather macroeconomic data for many countries and series in one go. This ensures consistency (you’re getting all data from the same source and update cycle) and reproducibility (anyone can run your R code to fetch the latest data). In research on international economics, it’s common to use IMF data for analyzing global economic trends, exchange rates, financial crises, etc., so being able to pull this data on demand is highly valuable. The IMF’s data is often updated periodically (WEO is updated biannually, IFS monthly or quarterly depending on series, etc.), so an analysis can be updated by simply re-running the code to get new values when they are released.
Try it yourself: Use
imf_data()
to fetch an economic indicator of your choice. For example, retrieve the inflation rate or unemployment rate for a set of countries. First, identify which IMF database might have that series (WEO contains many macro indicators; IFS contains financial and price data, etc.). Then find the series code (the documentation or an online search can help; e.g., WEO’s consumer price inflation percent change is often"PCPIPCH"
). Then callimf_data()
with that indicator and a few country codes. Check the returned data frame to see if it matches your expectations (units, years, etc.). This will give you practice in working with the IMF’s data API.
3.9 International Trade Data via APIs (UN Comtrade & Open Trade Statistics)
For analyses focused on international trade, we often need detailed data on trade flows between countries (exports and imports, by product categories, over time). One of the primary sources for such data is the United Nations Comtrade database, which is a repository of official international trade statistics reported by countries. It contains bilateral trade flows (e.g., country A’s exports of product X to country B) typically classified by product codes (such as Harmonized System codes) and available annually (and for some data, monthly). Rather than manually downloading trade data from the Comtrade website, we can use R to query the Comtrade API. There is an R package comtradr (by rOpenSci) that serves as a wrapper for the UN Comtrade API, making it easier to pull trade data into R. Trade flow data are crucial in international economics research, underpinning studies from gravity models of trade (Anderson & van Wincoop, 2003) to analyses of global value chains. Having direct access to these data via API allows researchers to quickly get the specific trade statistics they need for analysis.
Before using comtradr, install it (install.packages("comtradr")
) and load it with library(comtradr)
. The Comtrade API has some complexity in terms of parameters, but comtradr’s main function ct_get_data()
simplifies many of these by providing sensible defaults and checks.
Basics of UN Comtrade Data and API Parameters
When querying Comtrade, you typically specify: type of trade (goods or services), frequency (annual or monthly), a commodity classification (e.g., HS – Harmonized System, or SITC, etc.), the commodity code (or “all” commodities), flow direction (export, import, re-export, etc.), the reporter country, the partner country, and the time period (years or months). Additional parameters can include mode of transport or customs regime, but those are more advanced.
The ct_get_data()
function in comtradr allows you to specify these. It has default values that, if not changed, will query a broad set of data (which might be large). For example, by default, commodity_code = "TOTAL"
(which means the total trade for all commodities combined) and partner = "World"
(which means trade with all partners aggregated). These defaults effectively give total exports or imports of a country if you specify a reporter and flow.
Important: The Comtrade API has usage limits (number of calls per second and per hour) and may require an API key for extensive use. You can sign up on their website for a free API key, which increases your allowance. The comtradr package has a function set_primary_comtrade_key("YOURKEY")
to store your key. For small queries, you may not need a key at all (the API allows some calls without a key, albeit with stricter throttling).
Example: Querying Trade Data with ct_get_data()
Let’s walk through an example. Suppose we want to get the annual exports of goods for a few major exporters (say China, USA, Germany) to the world, for recent years. Essentially, this means we want, for each of those countries, the value of exports of all goods to all partners (which is the total exports) for each year in a range.
In Comtrade terms:
- type = “goods”
- frequency = “A” (annual data)
- commodity_classification = “HS” (let’s use the Harmonized System, a common product classification; by default it uses HS)
- commodity_code = “TOTAL” (the code for all commodities combined)
- flow_direction = “Export”
- reporter = (our list of countries, by ISO3 code: CHN, USA, DEU)
- partner = “World” (meaning aggregate trade with the world)
- start_date and end_date for the years of interest.
Here’s how we do it:
library(comtradr)
# Get total exports for China, USA, Germany (reporters) to World, from 2015 to 2020
<- ct_get_data(
trade_data type = "goods",
frequency = "A",
commodity_classification = "HS",
commodity_code = "TOTAL",
flow_direction = "Export",
reporter = c("CHN", "USA", "DEU"),
partner = "World",
start_date = "2015",
end_date = "2020"
)head(trade_data)
When this query runs, comtradr will retrieve data from the Comtrade API. The result trade_data
should contain rows corresponding to each country-year’s export value. If we inspect head(trade_data)
, we might see columns such as Reporter (country name), ReporterISO (ISO code), Partner (likely “World” for all since we chose aggregate partner), Year, TradeValue (the trade value, typically in USD), and perhaps additional metadata like the classification or unit. Each row would be, for example, China – World – 2015 – (export value), USA – World – 2015 – (export value), and so on, through 2020.
This gives us the total export values. If we wanted imports instead, we’d set flow_direction = "Import"
. If we wanted data for a specific trading partner (e.g., China’s exports to the USA), we could specify partner = "USA"
instead of World, and that would give bilateral trade rather than global totals.
We can also query more granular commodity data. For example, if we wanted to know the trade in a specific product category, we could set commodity_code
to an HS code (or multiple codes). For instance, HS code 8703 is “Motor cars and other motor vehicles for passenger transport”. We could query exports of cars from, say, Japan to the world. We’d specify commodity_code = "8703"
, reporter = "JPN"
, partner = "World"
, etc. The result would then be the yearly export value of cars from Japan.
The comtradr package also provides some functions to look up reference data, like ct_country_lookup("keyword")
to find country codes or ct_commodity_lookup("car")
to find commodity codes by keyword. However, these functions might not cover all classification versions, so sometimes it’s necessary to know or look up the exact code externally.
Open Trade Statistics (OTS): An alternative way to access trade data is via the Open Trade Statistics API and its R package tradestatistics (Vargas, 2020). OTS is an initiative to simplify access to cleaned and aggregated trade data (it builds on Comtrade data but provides a simpler interface for common queries). The tradestatistics
package allows queries for trade flows without dealing with some of the complexities of the raw Comtrade API. For example, it can retrieve a country’s export or import totals or product-level trade for given years in a tidy format. If you find the Comtrade API too granular or complicated for quick use, tradestatistics
can be a friendlier option for many purposes.
Here’s a quick example using tradestatistics
(assuming you have it installed and loaded):
library(tradestatistics)
# Get total exports for China, USA, Germany (reporters) from 2015 to 2020 via OTS
<- ots_trade(
ots_data reporters = c("CHN", "USA", "DEU"),
partners = "World",
years = 2015:2020,
table = "yr" # "yr" means yearly trade data
)head(ots_data)
The result ots_data
might have columns like year, reporter_iso, partner_iso, export_value, import_value, etc., for each combination requested. OTS data might already be aggregated by default.
Whether you use the direct Comtrade API via comtradr or the OTS via tradestatistics, you are leveraging APIs to get exactly the trade data needed. This avoids manually downloading massive CSV files of trade data and then filtering them. Instead, the filtering happens on the server side (at the API), and you get a manageable chunk of data in R to work with.
In international trade research, working with bilateral trade data is common. For example, the gravity model of trade – a fundamental model in trade economics – analyzes bilateral trade flows between countries and relates them to factors like GDP, distance, and trade barriers (Head & Mayer, 2014). These analyses require assembling trade flow data (often from Comtrade or similar) along with other country variables. Being able to programmatically pull, say, all pairs of trade flows for a given year or all years greatly eases the assembly of such datasets.
Try it yourself: Let’s say you want to examine how a country’s exports of a particular commodity have grown over time. Pick a country and a product category you’re interested in (for example, Brazil and soybeans, or Germany and automobiles). Find the HS code for that product (you might search for an HS code list online). Then use
ct_get_data()
in comtradr to query annual export values for that country (reporter
) to the world (partner = "World"
) for that HS code over a range of years. Inspect the trend of the export values – do you see growth over time? This exercise gives you practice in pulling a specific slice of trade data.
Each of the packages and APIs above demonstrates a way to access external data sources through R. By using APIs and specialized R packages, you can automate data retrieval for up-to-date information without manual downloading, which is crucial for keeping analysis pipelines current. In this chapter, we introduced how to search for datasets or indicators and fetch data from the World Bank (WDI), OECD, Social Progress Imperative (spiR), Statistics Canada (statcanR), a COVID-19 literature database (EpiBibR), COVID-19 case data (coronavirus package), as well as data from the IMF and international trade data from UN Comtrade (via comtradr) and Open Trade Statistics. These tools greatly expand your ability to gather data for analysis directly within R, making your data science workflow more efficient and reproducible.
Whenever you start a new analysis, think about whether an API could provide the data you need. If so, investing a bit of time in learning the R package for that API will pay off in the long run: your code will directly pull the latest data and anyone else running it will be able to get the same data without hunting down files. This ensures transparency (everyone knows exactly what data was used and how it was obtained) and reproducibility (the analysis can be re-run at any time, even as data updates). In an international business or economics context – where data can change rapidly, and where one often needs to combine information from multiple countries or sources – these skills will enable you to keep your analyses current, pedagogically clear, and robust for both you and any collaborators or readers of your work.
References
- Acemoglu, D., Johnson, S., & Robinson, J. (2001). The colonial origins of comparative development: An empirical investigation. American Economic Review, 91(5), 1369–1401.
- Anderson, J. E., & van Wincoop, E. (2003). Gravity with gravitas: A solution to the border puzzle. American Economic Review, 93(1), 170–192.
- Blanchard, O., & Leigh, D. (2013). Growth forecast errors and fiscal multipliers. American Economic Review (Papers & Proceedings), 103(3), 117–120.
- Dong, E., Du, H., & Gardner, L. (2020). An interactive web-based dashboard to track COVID-19 in real time. The Lancet Infectious Diseases, 20(5), 533–534.
- Head, K., & Mayer, T. (2014). Gravity equations: Workhorse, toolkit, and cookbook. In G. Gopinath, E. Helpman, & K. Rogoff (Eds.), Handbook of International Economics (Vol. 4, pp. 131–195). Elsevier.
- Jones, C. I., & Klenow, P. J. (2016). Beyond GDP? Welfare across countries and time. American Economic Review, 106(9), 2426–2457.
- Palayew, A., Norgaard, O., Safreed-Harmon, K., Andersen, T. H., & Rasmussen, L. N. (2020). Pandemic publishing poses a new COVID-19 challenge. Nature Human Behaviour, 4(7), 666–669.
- Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226–1227.
- Stiglitz, J. E., Sen, A., & Fitoussi, J. P. (2009). Report by the Commission on the Measurement of Economic Performance and Social Progress. (Commission report).