3 APIs for international trade and economics

APIs (Application Programming Interfaces) are an essential part of modern data science workflows. They allow programs to request data or services from other software systems over the web. In the context of data analysis, an API lets us fetch data directly from an online source (like a database or data provider) using code, rather than manually downloading files. This approach offers two big advantages over static files: freshness and reproducibility. By pulling data live from the source, we ensure we’re getting the latest available information. And by writing code to retrieve the data, we make our analysis reproducible – anyone running the same code can get the same updated dataset. In contrast, if you manually download a CSV file and load it, someone reproducing your work might use an outdated file or not know exactly how you obtained it. Embedding data access in code ensures that every result in your analysis is backed by a clear, repeatable process. This approach aligns with the principles of reproducible research advocated in modern science (Peng, 2011).

In this chapter, we explore how to import data into R in three ways: (1) reading local or static files, (2) reading directly from a URL, and (3) using APIs via dedicated R packages. We’ll especially focus on the third approach – using R packages to interact with web data APIs for international business, trade, and economic data. By the end of this chapter, you will know how to use several data access packages (WDI, OECD, spiR, statcanR, EpiBibR, coronavirus, imfr, comtradr, etc.) to obtain up-to-date international datasets in R. Throughout, we provide step-by-step examples and small exercises to reinforce the concepts and demonstrate how these tools support transparent, up-to-date analysis.

3.1 Three Ways to Import Data

Data can be imported into R in several ways:

From a local file: For example, you can use read_csv() (from the readr package) to load a CSV file from your computer.
```
library(readr)
df <- read_csv("data.csv")
```
This will read the file data.csv from your working directory into an R data frame df. Make sure the working directory is set to where your file is, or provide the full file path.
Directly from a URL: You can provide an http:// or https:// URL to read_csv() (or a similar function) to read data straight from the web. This saves you the step of manually downloading the file. For example:
```
library(readr)
data_panel <- read_csv("https://example.com/datasets/panel_data.csv")
```
This one-liner fetches the CSV hosted at the given URL and imports it into R as the data frame data_panel. No local file needed – R downloads the data on the fly.
Via an API using R packages: Many organizations provide data through web APIs, and R packages can retrieve this data directly. Using an API involves sending a query (often through an R function that wraps a web request) and receiving data, usually in JSON or CSV format, which the package then parses into a data frame. We will explore several such packages – WDI, OECD, spiR, statcanR, EpiBibR, coronavirus, as well as packages for IMF data and trade data – and how to use them to get data from their respective sources. This approach requires learning a package’s functions, but it provides powerful capabilities to search and download data programmatically.

Each method has its uses. If you have a one-time static dataset (say an Excel or CSV file from a colleague), method 1 is straightforward. Method 2 is great for quickly grabbing a publicly hosted file without saving it manually. Method 3 shines when you need data that updates regularly or want to automate data gathering from official sources. It also keeps your analysis pipeline reproducible: anyone can run your script to fetch the latest data, without you having to package the data file with your code. In the remainder of this chapter, we’ll dive into method 3, using specific R packages to access a variety of international datasets via their APIs.

3.2 WDI (World Development Indicators)

The WDI package provides access to the World Bank’s World Development Indicators database, which contains over 1,600 time-series development indicators for more than 200 countries. This includes economic indicators (GDP, trade, inflation, etc.), population statistics, education metrics, and many other socio-economic measures collected by the World Bank. Instead of downloading data from the World Bank website, we can use the WDI R package to search for indicators and retrieve country-level data directly through the World Bank API. The World Bank’s data are a go-to source in development economics research (Acemoglu, Johnson & Robinson, 2001), making it convenient to access them in R.

Before using WDI, make sure the package is installed (install.packages("WDI")) and then load it with library(WDI). We’ll demonstrate two key functions from the WDI package: one to search for indicator codes, and one to download the data.

`WDIsearch()`

Often, we might know the name or topic of an indicator (e.g. GDP per capita or life expectancy), but the API requires a specific indicator code. The function WDIsearch() helps you find indicator codes by keyword. It takes a text string as input and returns a data frame of indicator names and their corresponding codes that contain that keyword.

For example, suppose we want to find indicators related to GDP:

library(WDI)
# Search all indicators with the term "GDP"
indicators <- WDIsearch("GDP")
# Show the first 5 results
indicators[1:5, ]

This will query the World Bank’s indicator list for any indicator whose name or description contains “GDP”. The result (indicators) is a data frame where each row is an indicator. It typically includes columns for the indicator name and its code (among others). In our example, the first few results might include entries like GDP (current US$), GDP per capita (current US$), etc., along with their codes. For instance, you might see an indicator named “GDP per capita (current US$)” with a code like NY.GDP.PCAP.CD. Once you identify the exact indicator you need, you’ll use its code in the data retrieval function.

Try it yourself: Use WDIsearch() with a different term to explore other indicators. For instance, try searching for "population" or "life expectancy". What indicator codes do you find for those topics?

`WDI()`

After finding the indicator code of interest, you can use WDI() to download the actual data. The WDI() function requires at minimum an indicator code and one or more country codes. You can also specify a time range with start and end years. The function will return a data frame of the requested indicator values for the given countries and years.

Let’s walk through an example. Suppose we want the World Bank data on “Stocks traded, total value (% of GDP)” for a few countries. The indicator code for that series is CM.MKT.TRAD.GD.ZS (we could have found this via WDIsearch("stocks traded"), for example). We’ll retrieve this indicator for four countries – France, Canada, the USA, and China – over the period 2000 to 2016:

library(WDI)
# Download % of GDP traded as stocks for FR, CA, US, CN from 2000 to 2016
stock_traded <- WDI(indicator = "CM.MKT.TRAD.GD.ZS",
                    country   = c("FR", "CA", "US", "CN"),
                    start     = 2000,
                    end       = 2016)
head(stock_traded)

When you run this, the WDI package contacts the World Bank API and pulls the data for the specified indicator, countries, and years. The resulting data frame stock_traded will include columns such as country (typically an ISO 2-letter or 3-letter code and/or country name), year, the indicator value, and possibly additional metadata like country region or indicator name. Calling head(stock_traded) shows the first few rows: each row represents one country-year observation (for example, one row might be France in 2000 with the value of stocks traded as % of GDP).

It’s worth noting that if an indicator isn’t available for some years or countries, the returned value will be NA for those entries. Also, you can request multiple indicators at once by passing a vector of indicator codes to WDI() (the result will have one column per indicator in that case).

Finally, remember that the data retrieved is as up-to-date as the World Bank’s database. If you run the same WDI query later, you might get more recent years as the World Bank updates its indicators annually or quarterly depending on the series. This makes WDI() very powerful for keeping analyses current.

Try it yourself: Find the indicator code for GDP per capita (current US$) using WDIsearch, and then use WDI() to download GDP per capita for a few countries of your choice over the last 10 years. Check the resulting data frame to see which columns are included and how the data is structured.

3.3 OECD Data

The OECD R package allows you to search and retrieve data from the OECD (Organisation for Economic Co-operation and Development) databases via their public API. The OECD database includes a wide range of economic and social indicators (nearly 300 datasets across around a dozen categories) for OECD member countries. Examples include labor statistics, economic outlook indicators, productivity and trade figures, education metrics, and more. Unlike the World Bank WDI (which is organized by indicator), the OECD API is organized by datasets. Each dataset contains a collection of related indicators, often multi-dimensional (e.g., broken down by country, year, gender, etc.). OECD data are commonly used to compare policy outcomes across advanced economies in academic research (for example, studies of unemployment or productivity often leverage OECD statistics).

Before using the OECD package, ensure it’s installed (install.packages("OECD")) and load it with library(OECD). A typical workflow is: (a) find the dataset you need, (b) inspect its structure (to understand what dimensions and codes it uses), and (c) retrieve data (optionally filtering on dimensions like country or years). We’ll go through these steps with the OECD package’s key functions: get_datasets(), search_dataset(), get_data_structure(), and get_dataset().

First, it’s often useful to see what datasets are available. The function get_datasets() fetches a list of all available OECD datasets. For example:

library(OECD)
dataset_list <- get_datasets()

Here, dataset_list will be a data frame where each row is a dataset available via the OECD API. It typically has columns for the dataset ID and a brief description. This list can be quite long, so you might not want to print it in full. Instead, you can search within it for keywords, or use the dedicated search function described next.

`search_dataset()`

The function search_dataset() helps you find OECD datasets related to a keyword. You provide a search string (e.g., “unemployment”) and it returns matching dataset IDs and titles. By default, search_dataset() will search across all OECD datasets. You can optionally supply the data argument to limit the search to a specific list (for instance, the dataset_list you retrieved earlier).

For example, to find datasets related to unemployment:

# Search all datasets with "unemployment" in their title or description
search_results <- search_dataset("unemployment", data = dataset_list)

This will look through the dataset list for any entry whose name or description contains “unemployment”. The result search_results will list those datasets, showing their dataset ID and title. From there, you might identify a dataset of interest – for instance, you might see an entry like “DUR_D – Duration of unemployment” (just as an example). Suppose DUR_D is a dataset that contains unemployment duration statistics by country, gender, and age group. We would then proceed to learn about its structure and get data from it.

`get_data_structure()`

Once you have a dataset ID (for example, "DUR_D"), the next step is to retrieve its data structure (metadata) using get_data_structure(dataset_id). This function returns information about the dataset’s dimensions and valid codes. Essentially, it tells us what breakdowns the dataset has (e.g., Country, Time, Gender, Age Group, etc.) and what the allowed values are for each (e.g., country codes, year ranges, category codes, etc.).

For example:

# Get the structure (metadata) of a specific OECD dataset (e.g., "DUR_D")
dstruc <- get_data_structure("DUR_D")
str(dstruc, max.level = 1)

Here, dstruc will be a list (or another structured object) containing metadata for the dataset DUR_D. We use str() to peek at its structure (limiting to one level for brevity). Typically, this metadata includes multiple components – often one component per dimension of the dataset. For instance, dstruc might have elements like $COUNTRY, $SEX, $AGE, $TIME_PERIOD, each containing the codes and labels available for that dimension. By examining this, you learn how to formulate your data query. For example, you might find that countries are identified by 3-letter ISO codes (USA, CAN, FRA, etc.), genders by codes like M/F or MW (Male, Female, or Male+Female combined), age groups by codes like “2024” (which might stand for the 20–24 age cohort), and so on.

Pro tip: Checking the data structure before downloading data is very useful. It prevents guesswork and frustration. You need to know the exact codes to use for filtering. For instance, a dataset might use "USA" for United States or it might use "US"; it might use "ALL" or "TOT" for an aggregate category (like both genders combined), or a code like "MW" for Male+Female; years might be labeled as 2020 or as a full date 2020-01-01 depending on the dataset. The get_data_structure() output will reveal these details.

`get_dataset()`

After identifying the dataset and understanding its structure, get_dataset() is used to download the actual data. You call get_dataset() with the dataset ID and optionally provide a filter list to narrow down which slices of data you want. If no filter is provided, the function will attempt to retrieve all data in the dataset – which can be extremely large, so it’s generally better to filter to only what you need.

A filter in the OECD API context is essentially a selection for each dimension of the dataset. The filter is provided as a list of values, one element per dimension (in the dataset’s default order of dimensions). For example, suppose DUR_D (our example unemployment duration dataset) has three dimensions: Country, Sex, AgeClass (and Time is typically a dimension too, but often you don’t specify time in the filter because you can retrieve all years or specify a time range separately). If we want to retrieve data for specific countries, a specific combined gender category, and a specific age group, our filter list might look like this:

# Specify filters for Country, Sex, AgeClass dimensions
filter_list <- list(
  c("DEU", "FRA", "CAN", "USA"),  # Countries: Germany, France, Canada, USA
  "MW",                           # Sex: "MW" might denote Male+Female combined
  "2024"                          # Age group: "2024" for ages 20-24
)
# Retrieve the filtered data for dataset "DUR_D"
unemployment_data <- get_dataset(dataset = "DUR_D", filter = filter_list)
head(unemployment_data)

In this example, we filtered the Duration of Unemployment dataset to only include four countries (DEU, FRA, CAN, USA), the combined-gender category (assuming MW stands for Male+Female combined – one would confirm that from the data structure), and the age class 20–24 years (code “2024”). The call to get_dataset() then returns a data frame unemployment_data with the data matching those criteria. Each row in unemployment_data would typically have columns for Country, Sex, AgeClass, Time (Year), and the value of the unemployment duration indicator (plus possibly indicator names or units). The head() call displays the first few rows to give us a sense of the output.

When using get_dataset(), the order of the filters in the list is critical – it must follow the order of dimensions that the dataset expects. The metadata from get_data_structure() tells you this order. In our example, if the order was (Country, Sex, AgeClass, Time), we provided three filters for Country, Sex, and AgeClass, and implicitly we got all available Time (years) because we didn’t include time in the filter. If we had flipped the order of, say, country and sex in the list, the query would be mis-specified and likely return an error or no data. So always align your filter list with the dataset’s dimensions in the given sequence.

Another note: If you call get_dataset("DUR_D") with no filters at all, it will try to retrieve the entire dataset. That could be a huge amount of data (potentially thousands of series or millions of rows), which is usually not practical to pull into R. It’s better to constrain the query as shown. If you really do need all data, you might have to retrieve it in chunks or ensure you have enough memory and time.

After retrieving the data, you can treat it like any other R data frame: clean it, merge it with other data, visualize it, etc. The benefit is that you pulled it directly from the authoritative source, and you can update it easily by re-running the code whenever needed.

Try it yourself: Using the OECD package, search for a dataset related to, say, inflation or GDP. Use search_dataset("inflation", data = dataset_list) to find a relevant dataset ID. Then use get_data_structure() on that ID to see what dimensions it has. Finally, use get_dataset() to retrieve a subset of that data – for example, inflation rates for a few countries over a certain time period. Examine the returned data frame to see how the data is organized.

3.4 spiR (Social Progress Index)

The spiR package allows access to the Social Progress Index (SPI) data via the Social Progress Imperative’s API. The Social Progress Index is a composite index that measures multiple dimensions of social performance (Basic Human Needs, Foundations of Well-Being, and Opportunity) for countries. It is akin to an alternative to GDP for gauging social development and human welfare. In practice, the SPI provides scores for various components of social progress (like health, education, personal rights, etc.) as well as an overall score, updated annually. The motivation for such indices comes from the recognition that GDP alone does not fully capture a society’s well-being – a point emphasized by many economists and commissions on measurement (Stiglitz, Sen & Fitoussi, 2009; Jones & Klenow, 2016).

Using the SPI API through the spiR package, we can retrieve country codes, indicator (component) codes, and the data values. Make sure to install the package (install.packages("spiR") from CRAN if available, or via GitHub if instructed by its documentation) and load it with library(spiR).

Typically, you will follow these steps with spiR:

Get the country code for the country (or countries) you are interested in (since the API might require ISO country codes).
Find the indicator code(s) for the aspect of SPI you want (e.g., overall SPI, or specific components like Health and Wellness).
Retrieve the data for those countries and indicators (optionally for specific years, or a range of years).

The spiR package provides convenient functions for steps 1 and 2, which we’ll demonstrate.

`spir_country()`

The function spir_country(name) looks up the country code for a given country name. This is handy because the SPI API expects country identifiers (likely ISO-3 letter country codes). Instead of guessing or looking up codes manually, you can search by name.

For example, to find the code for Canada:

library(spiR)
canada_code <- spir_country("Canada")
canada_code

This will return Canada’s country code used in the SPI database (likely CAN). If you leave the name blank (i.e., call spir_country() with no arguments), it will list all available countries and their codes in the SPI dataset. That is a useful way to see how countries are identified in this dataset. You’ll notice country codes that follow the standard ISO conventions (e.g., USA for United States, DEU for Germany, BRA for Brazil, etc.).

`spir_indicator()`

Similarly, spir_indicator(keyword) lets you search for an SPI indicator code by keyword. The SPI database includes the overall index as well as many specific indicators and sub-indexes (for example, Nutrition and Basic Medical Care, Shelter, Personal Freedom and Choice, Health and Wellness, etc.). If you know the general topic or name, you can search for it.

For instance, if we are interested in indicators related to mortality:

# Search for SPI indicators containing "mortality" in their description or name
mortality_indicators <- spir_indicator("mortality")
mortality_indicators

This returns any SPI indicators whose name or description includes “mortality” (for example, it might find “Maternal mortality rate” or similar) along with the code for each such indicator. If called with no arguments, spir_indicator() would list all available SPI indicators and their codes, which is a long list of all components and sub-indexes in the SPI framework.

From the search results, suppose we find an indicator code that we want. For example, the overall Social Progress Index score might have the code "SPI" (often the overall index is coded as SPI). Or a component like Health and Wellness might have a code like "HW" (just a hypothetical example). We would then use those codes to get the data.

`spir_data()`

Once we have the country code(s) and an indicator code (or multiple indicators) of interest, we use spir_data() to download the data. The spir_data() function requires three main arguments: a country (or vector of countries), a year (or vector of years), and an indicator (or vector of indicator codes). All these arguments are typically provided as strings. You can request multiple countries, multiple years, and multiple indicators in one go.

For example, let’s retrieve the overall SPI score (indicator code "SPI") for a set of countries over a range of years:

# Get the SPI overall scores for selected countries and years
spi_data <- spir_data(country    = c("USA", "FRA", "BRA", "CHN", "ZAF", "CAN"),
                      years      = c("2014","2015","2016","2017","2018","2019"),
                      indicators = "SPI")
head(spi_data)

In this call, we requested the SPI overall index for the USA, France, Brazil, China, South Africa, and Canada for the years 2014 through 2019. The argument indicators = "SPI" means we only want the overall SPI index (and not the dozens of sub-indicators). The result, stored in spi_data, would be a data frame where each row is a country-year combination. If we inspect head(spi_data), we should see columns like country, year, and the SPI score (there might also be an indicator name column or code column if multiple indicators are returned). In this case, since we requested only one indicator, the data frame might have a column named “SPI” representing the score, or it might have a generic “value” column with another column identifying the indicator as SPI – the exact format depends on how the API returns the data.

From the example above, each of those countries will have SPI scores for each year 2014–2019 (if the data is available for those years). We could easily plot these or compare them, e.g., to see which country improved its SPI the most over that period.

(Note: SPI data tends to be annual, and not all countries might have every year available – some years might be missing if data wasn’t collected or if a country was added to the index later. Always check for NA values or gaps in the time series.)

For completeness, you can request multiple indicators by supplying a vector to the indicators argument. For example, indicators = c("SPI", "BHN") might fetch both the overall SPI and the Basic Human Needs sub-index (if “BHN” is the code for that) in one call, with the returned data indicating which indicator each value corresponds to.

Try it yourself: Suppose you want to compare a specific component of the Social Progress Index, such as Health and Wellness, across countries. First, use spir_indicator("Health") to find the exact code for the Health & Wellness indicator (or any component related to health). Then use spir_data() to retrieve that indicator for a set of countries (e.g., your country of interest and a few others) for the most recent year or two. How do the countries rank on that component? Which country scored highest, and which scored lowest?

(For an extended example of using the SPI API and even creating visualizations, you might refer to case studies or academic analyses of social progress. The creators of the SPI have written about it as well (Porter, Stern & Green, 2017), and researchers often compare SPI with other development measures.)

3.5 statcanR (Statistics Canada Data)

The statcanR package provides a convenient interface to Statistics Canada’s open data API. Statistics Canada (often abbreviated StatCan) offers a vast array of data tables (formerly known as CANSIM tables) covering about 30 different subjects – such as agriculture, energy, education, health, economics, and more – at various geographic levels (country, provinces, cities, etc.). If your analysis involves Canadian data, StatCan’s online portal has a wealth of information, and statcanR helps you pull that data directly into R.

Using statcanR typically involves two steps:

Find the table ID for the data you want.
Retrieve the data using that table ID with statcan_data().

There are thousands of data tables, each identified by a unique table number (also called a product ID, or PID). An example of a table ID is something like 27-10-0014-01 – which happens to correspond to a table about federal expenditures on science and technology by socio-economic objective (we’ll use that as a running example). These table IDs often have a format with two digits, then two digits, then four digits, then two digits (as in 27-10-0014-01). The format isn’t important except as an identifier; what matters is finding the correct ID for the data you need.

Finding a StatCan Table ID

To find the table ID for the data you need, you have a couple of options. One is to use the Statistics Canada website’s search or browse features. The StatCan website has a Data search portal where you can enter keywords related to the dataset you want, and it will list matching tables. Once you find the table in the web interface, the site will show its table number (ID). For example, if we wanted “federal expenditures on science and technology by socio-economic objective”, we could search those terms on the site and likely find a matching table with ID 27-10-0014-01.

Alternatively, statcanR provides a function statcan_search() that allows you to search the database from within R, similar to how we searched with WDI and OECD from R. For example, you can do:

library(statcanR)
statcan_search(c("federal", "expenditures"), "eng")

This would search (in English, since we specified "eng" for the language) for StatCan tables whose description contains both “federal” and “expenditures”. The result would include our target table among possibly others, showing the table title and the table ID (e.g., “27-10-0014-01”). You can adjust the keywords or switch to French (lang = "fra") if needed to find other tables (sometimes the French descriptions might catch different keywords).

Whichever method you use (website or statcan_search()), the key outcome is identifying the exact table ID you need.

`statcan_data()`

Once you have the table number (product ID) of interest, you can use the statcan_data() function to fetch the data via the API. The statcan_data() function takes the table ID and an optional language parameter. By default, it returns English labels for data fields, but you can request French by using lang = "fra" if desired (for example, if you want column headings and category labels in French).

For example, using the table ID from above (27-10-0014-01):

library(statcanR)
# Download StatCan table 27-10-0014-01 (Federal S&T expenditures) in English
my_data <- statcan_data("27-10-0014-01", lang = "eng")
head(my_data)

This will retrieve the dataset identified by "27-10-0014-01" – in our example, Federal expenditures on science and technology, by socio-economic objectives. The data comes as a data frame (my_data). Each StatCan table has its own structure, but generally the columns will include the dimensions of that table (e.g., Year, Geography, some Category or Classification, and the Value or Measure). In this case, we might expect columns like Year, maybe a breakdown by type of expenditure or objective, and the expenditure amount. The head(my_data) output will show the first few rows and the column names. You’ll likely see human-readable labels for categories because the API provides labeled data (which is why specifying lang matters – it returns English category names here, such as “Total expenditure” or specific objective names, rather than codes).

One great thing about statcanR is that it handles behind-the-scenes steps for you: downloading the data file (StatCan often provides data as a CSV or XML inside a zip), unzipping it, and reading it into R. The single statcan_data() call does all that, so you don’t have to manually navigate the StatCan website or deal with file downloads and imports.

By default, statcan_data() gives you the entire table. If it’s a large table, that could be a lot of data (potentially many thousands of rows for very detailed tables). Keep an eye on the size of my_data (for instance, use nrow(my_data) to see how many rows were returned). If it’s too large for your needs, you might filter it in R after downloading, or (if the API supports it) query a subset by specifying certain dimensions. In many cases, however, StatCan tables are moderate in size or you only need a subset which you can filter in R after retrieval.

(Advanced note: Some other R packages like {cansim} or {tidycansim} also exist to interface with Statistics Canada data and provide additional helpers (e.g., automatically converting to tibbles, or normalizing date formats). The statcanR package used here works well for direct access, especially for the tables curated in this context, but for very large tables or more complex manipulations, those other packages can be handy.)

Try it yourself: Suppose you want Canadian data on unemployment rates by province. You could go to StatCan’s website and search for “unemployment rate province table” to find a table ID. For example, you might find something like 14-10-0294-02, which corresponds to unemployment rates by province. Once you have the ID, use statcan_data("14-10-0294-02", lang="eng") to download it. Check the first few rows to see what columns it has. Can you identify the province names (or codes), and the time period of each observation? This exercise gives you practice in finding and retrieving a StatCan data table.

3.6 EpiBibR (COVID-19 Literature Bibliography)

EpiBibR is an R package (backed by an API) that provides access to a large bibliographic database of publications related to COVID-19 and other epidemiological research. In response to the global COVID-19 pandemic, the EpiBibR project compiled tens of thousands of references from sources like PubMed and other databases. The dataset started at about 20,000 references early in the pandemic and has since grown enormously – by April 2022 it contained roughly 180,000 references, and it continues to update regularly (with new publications added daily). The name EpiBibR stands for “Epidemiology Bibliography for R.” Essentially, it’s a specialized database of academic papers and reports about COVID-19 (and related health topics) that you can query from R, which is incredibly useful for literature reviews or tracking the evolution of scientific research. The explosion of COVID-19 related publications has been noted as a challenge for researchers to keep up with (Palayew et al., 2020), and tools like EpiBibR can help manage that information overload.

So what’s in this bibliographic database? Each entry (row) is a reference to a publication, with various fields such as title, authors, journal, year, and so on. The table below outlines some of the main fields (tags) in the bibliographic data and their meanings:

Field Tag	Description
AU	Authors (names of the authors)
TI	Title (of the article or document)
AB	Abstract (summary of the paper)
PY	Publication Year
DT	Document Type (e.g., Journal Article, Preprint, etc.)
MESH	MeSH Terms (Medical Subject Headings, standardized keywords)
TC	Times Cited (citation count)
SO	Source (Publication name, e.g., journal title)
J9	Source Abbreviation (often the 9-character source title)
JI	ISO Source Abbreviation
ISSN	International Standard Serial Number (journal identifier)
VOL	Volume (journal volume, if applicable)
ISSUE	Issue Number (if applicable)
LT	Language (of the publication, e.g., EN for English)
C1	Author Address (affiliations)
RP	Reprint Address (contact address for corresponding author)
ID	PubMed ID (if available)
DE	Author Keywords (keywords provided by the authors)
UT	Unique Article Identifier (an internal or database ID)
AU_CO	Author’s Country (country of author affiliation)
DB	Database Source (which repository or source the entry came from)

Many of these fields follow standard bibliographic conventions (similar to formats like BibTeX or Medline tags). For example, AU_CO indicates the country affiliation of an author, MESH terms are the standardized medical subject headings assigned to the article, TC is a citation count, and UT might be a Web of Science unique ID or similar. Not every entry will have all fields (for instance, very new preprints might not have a Times Cited yet, or some records might lack an abstract if it wasn’t available), but this gives an idea of the scope of information available for each reference.

The main function to access this bibliographic database in R is epibibr_data(). This function can retrieve references and allows filtering by various criteria via its arguments (such as author name, author’s country, year of publication, keywords in the title, keywords in the abstract, source name, etc.). If called with no arguments at all, epibibr_data() will attempt to return the entire bibliographic dataset:

library(EpiBibR)
# Retrieve the entire COVID-19 bibliographic dataset (all references)
all_refs <- epibibr_data()

Be cautious with the above command! The complete dataset is very large (hundreds of thousands of entries). Retrieving it in full will take a long time and consume a lot of memory. In practice, you will usually want to filter your query to a specific subset of interest rather than pulling everything.

More typically, you’ll query for a specific subset of references. Here are several ways to use epibibr_data() with filters:

By author: You can find all publications by a given author. For example, epibibr_data(author = "Colson") will return all entries where an author’s name contains “Colson” (such as publications by Philippe Colson). The search is case-insensitive and will match partial names, so it could return any paper authored by someone with “Colson” in their name.
By author and year: You can combine filters. For example, epibibr_data(author = "Yang", year = "2020") will return references where an author’s name contains “Yang” and the publication year is 2020. This would retrieve, say, all papers authored by people named Yang that were published in 2020.
By author’s country: To get references where at least one author’s country of origin (affiliation) is Canada, use epibibr_data(country = "Canada"). This filter uses the AU_CO field and will return entries for which an author is associated with Canada. This can be useful if you want to see the contribution of researchers from a certain country or analyze country-specific research output.
By keywords in title: If you want papers with a certain keyword in the title, use the title argument. For instance, epibibr_data(title = "vaccine") returns references whose titles contain “vaccine”. (This would catch titles like “COVID-19 vaccine development…”, “…vaccine efficacy…”, etc. The search is not case-sensitive.)
By keywords in abstract: Similarly, use the abstract argument to search within abstracts. For example, epibibr_data(abstract = "coronavirus") finds papers that mention “coronavirus” in the abstract text. This is useful for capturing papers that might not have COVID in the title but are still about coronaviruses or COVID-19.
Combining multiple criteria: You can specify multiple arguments at once to narrow down the search. The filters will be combined (essentially an AND of all conditions). For example, epibibr_data(author = "Yang", title = "COVID", year = "2020") would retrieve articles authored by someone named Yang that have “COVID” in the title and were published in 2020. That’s a very specific query, but it illustrates how multiple filters work together. You can even add more filters, such as the source (journal) name. For instance, extending the above example: epibibr_data(author = "Yang", title = "COVID", year = "2020", source = "Lancet") would further restrict to those published in sources whose name contains “Lancet” (so it would find papers from The Lancet or Lancet Infectious Diseases, etc.). In that case, you’d get papers by authors named Yang, from 2020, with “COVID” in the title, and published in a journal with “Lancet” in its name.

All these queries use the same epibibr_data() function, just with different arguments. This unified interface makes it quite flexible to get exactly the subset of bibliographic data you need. The returned result is a data frame where each row is a reference (a publication), with columns for all those fields (author, title, year, journal, etc.). You can then further analyze or export this information as needed. For example, you might count how many papers match your query or see which journals are most represented.

After retrieving a subset, you can use regular R tools to examine it. For instance, if you want to know how many results you got, you could do nrow(my_results). Or to see the unique journals in that subset, you might do unique(my_results$SO) (if SO is the Source column). This can help answer questions like “How much literature was published on topic X in year Y?” or “Who are the most prolific authors in this subset of papers?”.

Try it yourself: Suppose you want to find how much literature was published in 2021 about vaccines for COVID-19. You could try a query like epibibr_data(title = "vaccine", year = "2021"). This would fetch references from 2021 whose titles contain “vaccine”. How many results does it return? Which journals or authors appear frequently in those results? This kind of query gives you a quick overview of the literature on a specific topic.

3.7 coronavirus (COVID-19 Case Data Package)

The coronavirus package (by Rami Krispin, as part of the Covid19R project) provides a tidy, daily-updated dataset of COVID-19 cases and outcomes worldwide. It sources the data from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) repository – the well-known Johns Hopkins COVID-19 data which was used globally for tracking the pandemic. Essentially, this R package wraps the JHU time series data (which includes confirmed cases, deaths, recoveries, etc., by location and date) into an easily accessible format in R. The JHU data became a standard reference for researchers and public health analysts during the pandemic (Dong, Du & Gardner, 2020), and accessing it through an R package makes analysis much more convenient.

What’s convenient about the coronavirus package is that it not only gives you the data in a ready-to-use format but also includes functions to keep it updated. The data is structured in a “long” format with columns typically including: date, province/state, country/region, latitude, longitude, number of cases, and type of case (confirmed, death, recovered). Each row is a daily report for a location and case type.

After installing (install.packages("coronavirus")) and loading the package with library(coronavirus), you can use the following:

`data("coronavirus")`

The simplest way to get started is to load the package’s built-in dataset. Upon installation, the package comes with a snapshot of the COVID-19 data (current up to the package’s last update on CRAN). You load it by calling:

library(coronavirus)
data("coronavirus")
head(coronavirus)

When you run data("coronavirus"), it loads the dataset into your R environment as a data frame named coronavirus. The head(coronavirus) will show the first few rows, which might look like:

date	province	country	lat	long	type	cases
2020-01-22	Hubei	China	30.9756	112.2707	confirmed	444
2020-01-22	Hubei	China	30.9756	112.2707	death	17
2020-01-22	Hubei	China	30.9756	112.2707	recovered	28
2020-01-22	Anhui	China	31.8257	117.2264	confirmed	1
…	…	…	…	…	…	…

Each row represents the number of cases of a certain type on a given date in a given region. For example, one row might show that on 2020-01-22, in Hubei province (China), there were 444 confirmed cases; another row shows the number of deaths on that date in Hubei, etc. If province is NA or empty, it usually means the data is country-level with no sub-region specified (for instance, for the US, the province column contains state names; for smaller countries, province will be empty and the cases are country totals). The type column indicates whether the row is reporting confirmed cases, deaths, or recovered cases, and the cases column is the count for that date.

The dataset included in the package is a snapshot (it will be as up-to-date as of the package version you installed). To ensure you have the latest data, the package offers an update function.

`update_dataset()`

The CRAN version of the coronavirus package is updated periodically (every few weeks or months). To get more up-to-date data without waiting for a new package release, you can use the function update_dataset(). This function will fetch the newest COVID-19 records from the JHU source and update the in-memory coronavirus dataset in your R session.

update_dataset()

After running update_dataset(), the internal coronavirus data in your R session is refreshed with the latest data pulled from the JHU GitHub repository. It’s often a good idea to then restart your R session and reload the package/dataset to ensure that you’re working with the freshly updated data (especially if you had already loaded the old coronavirus data – R might otherwise keep the old version in memory until you reload it). Essentially, update_dataset() modifies the package’s data behind the scenes to store the new data, and by restarting and calling data("coronavirus") again, you load the newly obtained data.

Why not just always use update_dataset()? Well, if you just installed the package for a quick analysis and don’t need today’s absolutely latest data, using the built-in snapshot might be sufficient. But if you are building a live dashboard or conducting an analysis that demands the most current figures (e.g., tracking a surge in cases), update_dataset() ensures you have up-to-the-day information.

`refresh_coronavirus_jhu()`

As an alternative to using the package’s internal dataset (and updating it), the coronavirus package provides a direct retrieval function called refresh_coronavirus_jhu(). This function queries the data from the JHU repository on the fly and returns it as a data frame (without relying on the package’s built-in dataset at all). In other words, it fetches the latest data in the same structure as the coronavirus dataset, but it stores it in a user-defined object of your choosing.

For example:

latest_covid_data <- refresh_coronavirus_jhu()
head(latest_covid_data)

Here, latest_covid_data will be a data frame with the same columns as the coronavirus dataset (date, province, country, lat, long, type, cases). The difference is that this is fetched fresh at the moment you run the function. You don’t need to reload any data from the package; it’s assigned to latest_covid_data directly. This ensures you have the absolute latest records, independently of the package’s built-in data.

Using refresh_coronavirus_jhu() is especially useful for real-time analysis or when you need to be sure you’re not missing even the most recent day’s data. For example, if you were updating a report or a Shiny dashboard each day, you might call this function every time the report runs to pull the latest counts.

To summarize the use of the coronavirus package: it provides a quick way to get comprehensive COVID-19 time series data in R. You can either use the built-in data (which might be slightly behind the current date) or update it via code to the latest available. The data covers global cases and is updated daily at the source. Once you have the data in R, you can easily aggregate it (e.g., total cases by country or globally), visualize trends over time, calculate rates or moving averages, or merge it with other datasets (for instance, to correlate case numbers with government policy measures or economic indicators).

3.8 IMF Data via R APIs

International economic research often relies on macroeconomic and financial data provided by the International Monetary Fund (IMF). The IMF maintains several extensive databases, such as the World Economic Outlook (WEO), International Financial Statistics (IFS), Balance of Payments Statistics, Government Finance Statistics, and more. These contain historical time series on GDP, inflation, government budgets, trade balances, exchange rates, and many other economic indicators across most countries. The IMF makes these data available through an API, and there are R packages to access them, notably imfr (by Christopher Gandrud) and the newer imf.data package. By using these packages, one can directly query IMF databases for specific indicators and countries, ensuring that analyses use the most current figures released by the IMF. The IMF’s datasets are central in international macroeconomic research and policy analysis – for example, studies of fiscal policy often use IMF data on growth and government spending forecasts (Blanchard & Leigh, 2013).

We will illustrate using the imfr package, as it provides a user-friendly interface to the IMF API. Make sure to install it (install.packages("imfr")) and load it with library(imfr).

Listing Available IMF Databases

The IMF API has many separate databases. Each is identified by a short code. To see a list of available databases, you can use:

library(imfr)
imf_ids <- imf_ids()
head(imf_ids)

The result imf_ids will show database IDs and names. For example, you might see entries like: IFS (International Financial Statistics), WEO (World Economic Outlook), BOP (Balance of Payments), GFS (Government Finance Statistics), DOT (Direction of Trade Statistics), etc. Each of these has different indicators and coverage. Suppose we’re interested in the World Economic Outlook (WEO) database, which contains broad macroeconomic indicators along with IMF forecasts, or the IFS database for more high-frequency financial data.

Querying a Specific Series with `imf_data()`

Once we know the database, we need to know the indicator codes and country codes. Each database has its own set of series codes (similar in concept to the indicator codes in WDI or dataset codes in OECD). For example, in the WEO database, an indicator code for nominal GDP might be "NGDPD" (Nominal GDP in USD), for real GDP growth "NGDP_RPCH", for inflation "PCPIPCH" (percent change in consumer prices), etc. Country codes are often IMF-specific or ISO country codes. The WEO and some other databases use ISO 3-letter country codes (USA, CHN, FRA, etc.), but some IMF databases might use numeric codes or their own abbreviations.

The imf_data() function in imfr is used to query data. It requires a database ID, an indicator (or series) code, one or more countries, and a date range (start year, and optionally end year). Let’s say we want to retrieve nominal GDP in USD for a couple of countries from the WEO database:

library(imfr)
# Get nominal GDP (code "NGDPD") for USA, China, and Germany from 2010 to 2020 using WEO database
gdp_data <- imf_data(database_id = "WEO",
                     indicator   = "NGDPD",
                     country     = c("USA", "CHN", "DEU"),
                     start       = 2010,
                     end         = 2020)
head(gdp_data)

In this example, database_id = "WEO" specifies we want the World Economic Outlook database. The indicator = "NGDPD" specifies nominal GDP in current USD (we determined that code from WEO documentation or searching within the WEO dataset). We requested country = c("USA","CHN","DEU"), which are the ISO codes for the United States, China, and Germany. The start and end years define the period 2010–2020. The result gdp_data should be a data frame containing the GDP values for those countries and years.

If we inspect head(gdp_data), we might see columns like: WEO.Country.Code, ISO, Country, WEO.Subject.Code, Subject Descriptor, Units, Scale, Year, and Value – or a subset of those. The structure depends on the database; WEO often returns a rich metadata along with the values. For easier use, one can clean or select relevant columns (for instance, selecting ISO, Year, and Value).

Another example: using the IFS database (International Financial Statistics), which has more granular series. If we wanted to get, say, the Consumer Price Index (CPI) for a country (which is often monthly or quarterly in IFS), we would identify the CPI series code and the country code in IFS and query similarly with imf_data(database_id="IFS", ...). The imfr package includes helper functions like imf_codelist() to explore available series within a database if needed.

The IMF API might require an API key for large queries, but for many basic queries, it works without one or with a shared guest key. If you plan to use it extensively, you can register for a free API key on the IMF data portal and provide it to the R package (imfr can be told the key via an argument or environment variable).

The power of using the IMF API through R is that you can programmatically gather macroeconomic data for many countries and series in one go. This ensures consistency (you’re getting all data from the same source and update cycle) and reproducibility (anyone can run your R code to fetch the latest data). In research on international economics, it’s common to use IMF data for analyzing global economic trends, exchange rates, financial crises, etc., so being able to pull this data on demand is highly valuable. The IMF’s data is often updated periodically (WEO is updated biannually, IFS monthly or quarterly depending on series, etc.), so an analysis can be updated by simply re-running the code to get new values when they are released.

Try it yourself: Use imf_data() to fetch an economic indicator of your choice. For example, retrieve the inflation rate or unemployment rate for a set of countries. First, identify which IMF database might have that series (WEO contains many macro indicators; IFS contains financial and price data, etc.). Then find the series code (the documentation or an online search can help; e.g., WEO’s consumer price inflation percent change is often "PCPIPCH"). Then call imf_data() with that indicator and a few country codes. Check the returned data frame to see if it matches your expectations (units, years, etc.). This will give you practice in working with the IMF’s data API.

3.9 International Trade Data via APIs (UN Comtrade & Open Trade Statistics)

For analyses focused on international trade, we often need detailed data on trade flows between countries (exports and imports, by product categories, over time). One of the primary sources for such data is the United Nations Comtrade database, which is a repository of official international trade statistics reported by countries. It contains bilateral trade flows (e.g., country A’s exports of product X to country B) typically classified by product codes (such as Harmonized System codes) and available annually (and for some data, monthly). Rather than manually downloading trade data from the Comtrade website, we can use R to query the Comtrade API. There is an R package comtradr (by rOpenSci) that serves as a wrapper for the UN Comtrade API, making it easier to pull trade data into R. Trade flow data are crucial in international economics research, underpinning studies from gravity models of trade (Anderson & van Wincoop, 2003) to analyses of global value chains. Having direct access to these data via API allows researchers to quickly get the specific trade statistics they need for analysis.

Before using comtradr, install it (install.packages("comtradr")) and load it with library(comtradr). The Comtrade API has some complexity in terms of parameters, but comtradr’s main function ct_get_data() simplifies many of these by providing sensible defaults and checks.

Basics of UN Comtrade Data and API Parameters

When querying Comtrade, you typically specify: type of trade (goods or services), frequency (annual or monthly), a commodity classification (e.g., HS – Harmonized System, or SITC, etc.), the commodity code (or “all” commodities), flow direction (export, import, re-export, etc.), the reporter country, the partner country, and the time period (years or months). Additional parameters can include mode of transport or customs regime, but those are more advanced.

The ct_get_data() function in comtradr allows you to specify these. It has default values that, if not changed, will query a broad set of data (which might be large). For example, by default, commodity_code = "TOTAL" (which means the total trade for all commodities combined) and partner = "World" (which means trade with all partners aggregated). These defaults effectively give total exports or imports of a country if you specify a reporter and flow.

Important: The Comtrade API has usage limits (number of calls per second and per hour) and may require an API key for extensive use. You can sign up on their website for a free API key, which increases your allowance. The comtradr package has a function set_primary_comtrade_key("YOURKEY") to store your key. For small queries, you may not need a key at all (the API allows some calls without a key, albeit with stricter throttling).

Example: Querying Trade Data with `ct_get_data()`

Let’s walk through an example. Suppose we want to get the annual exports of goods for a few major exporters (say China, USA, Germany) to the world, for recent years. Essentially, this means we want, for each of those countries, the value of exports of all goods to all partners (which is the total exports) for each year in a range.

In Comtrade terms:

type = “goods”
frequency = “A” (annual data)
commodity_classification = “HS” (let’s use the Harmonized System, a common product classification; by default it uses HS)
commodity_code = “TOTAL” (the code for all commodities combined)
flow_direction = “Export”
reporter = (our list of countries, by ISO3 code: CHN, USA, DEU)
partner = “World” (meaning aggregate trade with the world)
start_date and end_date for the years of interest.

Here’s how we do it:

library(comtradr)
# Get total exports for China, USA, Germany (reporters) to World, from 2015 to 2020
trade_data <- ct_get_data(
  type = "goods",
  frequency = "A",
  commodity_classification = "HS",
  commodity_code = "TOTAL",
  flow_direction = "Export",
  reporter = c("CHN", "USA", "DEU"),
  partner = "World",
  start_date = "2015",
  end_date   = "2020"
)
head(trade_data)

When this query runs, comtradr will retrieve data from the Comtrade API. The result trade_data should contain rows corresponding to each country-year’s export value. If we inspect head(trade_data), we might see columns such as Reporter (country name), ReporterISO (ISO code), Partner (likely “World” for all since we chose aggregate partner), Year, TradeValue (the trade value, typically in USD), and perhaps additional metadata like the classification or unit. Each row would be, for example, China – World – 2015 – (export value), USA – World – 2015 – (export value), and so on, through 2020.

This gives us the total export values. If we wanted imports instead, we’d set flow_direction = "Import". If we wanted data for a specific trading partner (e.g., China’s exports to the USA), we could specify partner = "USA" instead of World, and that would give bilateral trade rather than global totals.

We can also query more granular commodity data. For example, if we wanted to know the trade in a specific product category, we could set commodity_code to an HS code (or multiple codes). For instance, HS code 8703 is “Motor cars and other motor vehicles for passenger transport”. We could query exports of cars from, say, Japan to the world. We’d specify commodity_code = "8703", reporter = "JPN", partner = "World", etc. The result would then be the yearly export value of cars from Japan.

The comtradr package also provides some functions to look up reference data, like ct_country_lookup("keyword") to find country codes or ct_commodity_lookup("car") to find commodity codes by keyword. However, these functions might not cover all classification versions, so sometimes it’s necessary to know or look up the exact code externally.

Open Trade Statistics (OTS): An alternative way to access trade data is via the Open Trade Statistics API and its R package tradestatistics (Vargas, 2020). OTS is an initiative to simplify access to cleaned and aggregated trade data (it builds on Comtrade data but provides a simpler interface for common queries). The tradestatistics package allows queries for trade flows without dealing with some of the complexities of the raw Comtrade API. For example, it can retrieve a country’s export or import totals or product-level trade for given years in a tidy format. If you find the Comtrade API too granular or complicated for quick use, tradestatistics can be a friendlier option for many purposes.

Here’s a quick example using tradestatistics (assuming you have it installed and loaded):

library(tradestatistics)
# Get total exports for China, USA, Germany (reporters) from 2015 to 2020 via OTS
ots_data <- ots_trade(
  reporters = c("CHN", "USA", "DEU"),
  partners  = "World",
  years     = 2015:2020,
  table     = "yr"   # "yr" means yearly trade data
)
head(ots_data)

The result ots_data might have columns like year, reporter_iso, partner_iso, export_value, import_value, etc., for each combination requested. OTS data might already be aggregated by default.

Whether you use the direct Comtrade API via comtradr or the OTS via tradestatistics, you are leveraging APIs to get exactly the trade data needed. This avoids manually downloading massive CSV files of trade data and then filtering them. Instead, the filtering happens on the server side (at the API), and you get a manageable chunk of data in R to work with.

In international trade research, working with bilateral trade data is common. For example, the gravity model of trade – a fundamental model in trade economics – analyzes bilateral trade flows between countries and relates them to factors like GDP, distance, and trade barriers (Head & Mayer, 2014). These analyses require assembling trade flow data (often from Comtrade or similar) along with other country variables. Being able to programmatically pull, say, all pairs of trade flows for a given year or all years greatly eases the assembly of such datasets.

Try it yourself: Let’s say you want to examine how a country’s exports of a particular commodity have grown over time. Pick a country and a product category you’re interested in (for example, Brazil and soybeans, or Germany and automobiles). Find the HS code for that product (you might search for an HS code list online). Then use ct_get_data() in comtradr to query annual export values for that country (reporter) to the world (partner = "World") for that HS code over a range of years. Inspect the trend of the export values – do you see growth over time? This exercise gives you practice in pulling a specific slice of trade data.

Each of the packages and APIs above demonstrates a way to access external data sources through R. By using APIs and specialized R packages, you can automate data retrieval for up-to-date information without manual downloading, which is crucial for keeping analysis pipelines current. In this chapter, we introduced how to search for datasets or indicators and fetch data from the World Bank (WDI), OECD, Social Progress Imperative (spiR), Statistics Canada (statcanR), a COVID-19 literature database (EpiBibR), COVID-19 case data (coronavirus package), as well as data from the IMF and international trade data from UN Comtrade (via comtradr) and Open Trade Statistics. These tools greatly expand your ability to gather data for analysis directly within R, making your data science workflow more efficient and reproducible.

Whenever you start a new analysis, think about whether an API could provide the data you need. If so, investing a bit of time in learning the R package for that API will pay off in the long run: your code will directly pull the latest data and anyone else running it will be able to get the same data without hunting down files. This ensures transparency (everyone knows exactly what data was used and how it was obtained) and reproducibility (the analysis can be re-run at any time, even as data updates). In an international business or economics context – where data can change rapidly, and where one often needs to combine information from multiple countries or sources – these skills will enable you to keep your analyses current, pedagogically clear, and robust for both you and any collaborators or readers of your work.

References

Acemoglu, D., Johnson, S., & Robinson, J. (2001). The colonial origins of comparative development: An empirical investigation. American Economic Review, 91(5), 1369–1401.
Anderson, J. E., & van Wincoop, E. (2003). Gravity with gravitas: A solution to the border puzzle. American Economic Review, 93(1), 170–192.
Blanchard, O., & Leigh, D. (2013). Growth forecast errors and fiscal multipliers. American Economic Review (Papers & Proceedings), 103(3), 117–120.
Dong, E., Du, H., & Gardner, L. (2020). An interactive web-based dashboard to track COVID-19 in real time. The Lancet Infectious Diseases, 20(5), 533–534.
Head, K., & Mayer, T. (2014). Gravity equations: Workhorse, toolkit, and cookbook. In G. Gopinath, E. Helpman, & K. Rogoff (Eds.), Handbook of International Economics (Vol. 4, pp. 131–195). Elsevier.
Jones, C. I., & Klenow, P. J. (2016). Beyond GDP? Welfare across countries and time. American Economic Review, 106(9), 2426–2457.
Palayew, A., Norgaard, O., Safreed-Harmon, K., Andersen, T. H., & Rasmussen, L. N. (2020). Pandemic publishing poses a new COVID-19 challenge. Nature Human Behaviour, 4(7), 666–669.
Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226–1227.
Stiglitz, J. E., Sen, A., & Fitoussi, J. P. (2009). Report by the Commission on the Measurement of Economic Performance and Social Progress. (Commission report).

# APIs for international trade and economics **APIs (Application Programming Interfaces)** are an essential part of modern data science workflows. They allow programs to request data or services from other software systems over the web. In the context of data analysis, an API lets us fetch data directly from an online source (like a database or data provider) using code, rather than manually downloading files. This approach offers two big advantages over static files: **freshness** and **reproducibility**. By pulling data live from the source, we ensure we’re getting the latest available information. And by writing code to retrieve the data, we make our analysis reproducible – anyone running the same code can get the same updated dataset. In contrast, if you manually download a CSV file and load it, someone reproducing your work might use an outdated file or not know exactly how you obtained it. Embedding data access in code ensures that every result in your analysis is backed by a clear, repeatable process. This approach aligns with the principles of reproducible research advocated in modern science (Peng, 2011). In this chapter, we explore how to import data into R in three ways: (1) reading local or static files, (2) reading directly from a URL, and (3) using APIs via dedicated R packages. We’ll especially focus on the third approach – using R packages to interact with web data APIs for international business, trade, and economic data. By the end of this chapter, you will know how to use several data access packages (**WDI**, **OECD**, **spiR**, **statcanR**, **EpiBibR**, **coronavirus**, **imfr**, **comtradr**, etc.) to obtain up-to-date international datasets in R. Throughout, we provide step-by-step examples and small exercises to reinforce the concepts and demonstrate how these tools support transparent, up-to-date analysis. ## Three Ways to Import Data Data can be imported into R in several ways: 1. **From a local file:** For example, you can use `read_csv()` (from the **readr** package) to load a CSV file from your computer. ```r library(readr) df <- read_csv("data.csv") ``` This will read the file **data.csv** from your working directory into an R data frame `df`. Make sure the working directory is set to where your file is, or provide the full file path. 2. **Directly from a URL:** You can provide an `http://` or `https://` URL to `read_csv()` (or a similar function) to read data straight from the web. This saves you the step of manually downloading the file. For example: ```r library(readr) data_panel <- read_csv("https://example.com/datasets/panel_data.csv") ``` This one-liner fetches the CSV hosted at the given URL and imports it into R as the data frame `data_panel`. No local file needed – R downloads the data on the fly. 3. **Via an API using R packages:** Many organizations provide data through web APIs, and R packages can retrieve this data directly. Using an API involves sending a query (often through an R function that wraps a web request) and receiving data, usually in JSON or CSV format, which the package then parses into a data frame. We will explore several such packages – **WDI**, **OECD**, **spiR**, **statcanR**, **EpiBibR**, **coronavirus**, as well as packages for **IMF data** and **trade data** – and how to use them to get data from their respective sources. This approach requires learning a package’s functions, but it provides powerful capabilities to search and download data programmatically. Each method has its uses. If you have a one-time static dataset (say an Excel or CSV file from a colleague), method 1 is straightforward. Method 2 is great for quickly grabbing a publicly hosted file without saving it manually. Method 3 shines when you need data that updates regularly or want to automate data gathering from official sources. It also keeps your analysis pipeline reproducible: anyone can run your script to fetch the latest data, without you having to package the data file with your code. In the remainder of this chapter, we’ll dive into method 3, using specific R packages to access a variety of international datasets via their APIs. ## WDI (World Development Indicators) The **WDI** package provides access to the World Bank’s **World Development Indicators** database, which contains over 1,600 time-series development indicators for more than 200 countries. This includes economic indicators (GDP, trade, inflation, etc.), population statistics, education metrics, and many other socio-economic measures collected by the World Bank. Instead of downloading data from the World Bank website, we can use the WDI R package to search for indicators and retrieve country-level data directly through the World Bank API. The World Bank’s data are a go-to source in development economics research (Acemoglu, Johnson & Robinson, 2001), making it convenient to access them in R. Before using WDI, make sure the package is installed (`install.packages("WDI")`) and then load it with `library(WDI)`. We’ll demonstrate two key functions from the WDI package: one to search for indicator codes, and one to download the data. ### `WDIsearch()` Often, we might know the name or topic of an indicator (e.g. *GDP per capita* or *life expectancy*), but the API requires a specific indicator code. The function `WDIsearch()` helps you find indicator codes by keyword. It takes a text string as input and returns a data frame of indicator names and their corresponding codes that contain that keyword. For example, suppose we want to find indicators related to **GDP**: ```r library(WDI) # Search all indicators with the term "GDP" indicators <- WDIsearch("GDP") # Show the first 5 results indicators[1:5, ] ``` This will query the World Bank’s indicator list for any indicator whose name or description contains “GDP”. The result (`indicators`) is a data frame where each row is an indicator. It typically includes columns for the indicator name and its code (among others). In our example, the first few results might include entries like **GDP (current US\$)**, **GDP per capita (current US\$)**, etc., along with their codes. For instance, you might see an indicator named *“GDP per capita (current US\$)”* with a code like **NY.GDP.PCAP.CD**. Once you identify the exact indicator you need, you’ll use its code in the data retrieval function. > **Try it yourself:** Use `WDIsearch()` with a different term to explore other indicators. For instance, try searching for `"population"` or `"life expectancy"`. What indicator codes do you find for those topics? ### `WDI()` After finding the indicator code of interest, you can use `WDI()` to download the actual data. The `WDI()` function requires at minimum an **indicator code** and one or more **country codes**. You can also specify a time range with `start` and `end` years. The function will return a data frame of the requested indicator values for the given countries and years. Let’s walk through an example. Suppose we want the World Bank data on *“Stocks traded, total value (% of GDP)”* for a few countries. The indicator code for that series is **CM.MKT.TRAD.GD.ZS** (we could have found this via `WDIsearch("stocks traded")`, for example). We’ll retrieve this indicator for four countries – France, Canada, the USA, and China – over the period 2000 to 2016: ```r library(WDI) # Download % of GDP traded as stocks for FR, CA, US, CN from 2000 to 2016 stock_traded <- WDI(indicator = "CM.MKT.TRAD.GD.ZS", country = c("FR", "CA", "US", "CN"), start = 2000, end = 2016) head(stock_traded) ``` When you run this, the WDI package contacts the World Bank API and pulls the data for the specified indicator, countries, and years. The resulting data frame `stock_traded` will include columns such as country (typically an ISO 2-letter or 3-letter code and/or country name), year, the indicator value, and possibly additional metadata like country region or indicator name. Calling `head(stock_traded)` shows the first few rows: each row represents one country-year observation (for example, one row might be France in 2000 with the value of stocks traded as % of GDP). It’s worth noting that if an indicator isn’t available for some years or countries, the returned value will be `NA` for those entries. Also, you can request multiple indicators at once by passing a vector of indicator codes to `WDI()` (the result will have one column per indicator in that case). Finally, remember that the data retrieved is as up-to-date as the World Bank’s database. If you run the same WDI query later, you might get more recent years as the World Bank updates its indicators annually or quarterly depending on the series. This makes `WDI()` very powerful for keeping analyses current. > **Try it yourself:** Find the indicator code for *GDP per capita (current US\$)* using `WDIsearch`, and then use `WDI()` to download GDP per capita for a few countries of your choice over the last 10 years. Check the resulting data frame to see which columns are included and how the data is structured. ## OECD Data The **OECD** R package allows you to search and retrieve data from the OECD (Organisation for Economic Co-operation and Development) databases via their public API. The OECD database includes a wide range of economic and social indicators (nearly 300 datasets across around a dozen categories) for OECD member countries. Examples include labor statistics, economic outlook indicators, productivity and trade figures, education metrics, and more. Unlike the World Bank WDI (which is organized by indicator), the OECD API is organized by **datasets**. Each dataset contains a collection of related indicators, often multi-dimensional (e.g., broken down by country, year, gender, etc.). OECD data are commonly used to compare policy outcomes across advanced economies in academic research (for example, studies of unemployment or productivity often leverage OECD statistics). Before using the OECD package, ensure it’s installed (`install.packages("OECD")`) and load it with `library(OECD)`. A typical workflow is: (a) find the dataset you need, (b) inspect its structure (to understand what dimensions and codes it uses), and (c) retrieve data (optionally filtering on dimensions like country or years). We’ll go through these steps with the OECD package’s key functions: `get_datasets()`, `search_dataset()`, `get_data_structure()`, and `get_dataset()`. First, it’s often useful to see what datasets are available. The function `get_datasets()` fetches a list of all available OECD datasets. For example: ```r library(OECD) dataset_list <- get_datasets() ``` Here, `dataset_list` will be a data frame where each row is a dataset available via the OECD API. It typically has columns for the dataset **ID** and a brief **description**. This list can be quite long, so you might not want to print it in full. Instead, you can search within it for keywords, or use the dedicated search function described next. ### `search_dataset()` The function `search_dataset()` helps you find OECD datasets related to a keyword. You provide a search string (e.g., "unemployment") and it returns matching dataset IDs and titles. By default, `search_dataset()` will search across all OECD datasets. You can optionally supply the `data` argument to limit the search to a specific list (for instance, the `dataset_list` you retrieved earlier). For example, to find datasets related to **unemployment**: ```r # Search all datasets with "unemployment" in their title or description search_results <- search_dataset("unemployment", data = dataset_list) ``` This will look through the dataset list for any entry whose name or description contains "unemployment". The result `search_results` will list those datasets, showing their dataset ID and title. From there, you might identify a dataset of interest – for instance, you might see an entry like *“DUR\_D – Duration of unemployment”* (just as an example). Suppose `DUR_D` is a dataset that contains unemployment duration statistics by country, gender, and age group. We would then proceed to learn about its structure and get data from it. ### `get_data_structure()` Once you have a dataset ID (for example, `"DUR_D"`), the next step is to retrieve its **data structure** (metadata) using `get_data_structure(dataset_id)`. This function returns information about the dataset’s dimensions and valid codes. Essentially, it tells us what breakdowns the dataset has (e.g., Country, Time, Gender, Age Group, etc.) and what the allowed values are for each (e.g., country codes, year ranges, category codes, etc.). For example: ```r # Get the structure (metadata) of a specific OECD dataset (e.g., "DUR_D") dstruc <- get_data_structure("DUR_D") str(dstruc, max.level = 1) ``` Here, `dstruc` will be a list (or another structured object) containing metadata for the dataset **DUR\_D**. We use `str()` to peek at its structure (limiting to one level for brevity). Typically, this metadata includes multiple components – often one component per dimension of the dataset. For instance, `dstruc` might have elements like `$COUNTRY`, `$SEX`, `$AGE`, `$TIME_PERIOD`, each containing the codes and labels available for that dimension. By examining this, you learn how to formulate your data query. For example, you might find that countries are identified by 3-letter ISO codes (USA, CAN, FRA, etc.), genders by codes like M/F or MW (Male, Female, or Male+Female combined), age groups by codes like "2024" (which might stand for the 20–24 age cohort), and so on. **Pro tip:** Checking the data structure before downloading data is very useful. It prevents guesswork and frustration. You need to know the exact codes to use for filtering. For instance, a dataset might use `"USA"` for United States or it might use `"US"`; it might use `"ALL"` or `"TOT"` for an aggregate category (like both genders combined), or a code like `"MW"` for Male+Female; years might be labeled as `2020` or as a full date `2020-01-01` depending on the dataset. The `get_data_structure()` output will reveal these details. ### `get_dataset()` After identifying the dataset and understanding its structure, `get_dataset()` is used to download the actual data. You call `get_dataset()` with the dataset ID and optionally provide a `filter` list to narrow down which slices of data you want. If no filter is provided, the function will attempt to retrieve **all** data in the dataset – which can be extremely large, so it’s generally better to filter to only what you need. A **filter** in the OECD API context is essentially a selection for each dimension of the dataset. The filter is provided as a list of values, one element per dimension (in the dataset’s default order of dimensions). For example, suppose **DUR\_D** (our example unemployment duration dataset) has three dimensions: Country, Sex, AgeClass (and Time is typically a dimension too, but often you don’t specify time in the filter because you can retrieve all years or specify a time range separately). If we want to retrieve data for specific countries, a specific combined gender category, and a specific age group, our filter list might look like this: ```r # Specify filters for Country, Sex, AgeClass dimensions filter_list <- list( c("DEU", "FRA", "CAN", "USA"), # Countries: Germany, France, Canada, USA "MW", # Sex: "MW" might denote Male+Female combined "2024" # Age group: "2024" for ages 20-24 ) # Retrieve the filtered data for dataset "DUR_D" unemployment_data <- get_dataset(dataset = "DUR_D", filter = filter_list) head(unemployment_data) ``` In this example, we filtered the *Duration of Unemployment* dataset to only include four countries (DEU, FRA, CAN, USA), the combined-gender category (assuming **MW** stands for Male+Female combined – one would confirm that from the data structure), and the age class 20–24 years (code "2024"). The call to `get_dataset()` then returns a data frame `unemployment_data` with the data matching those criteria. Each row in `unemployment_data` would typically have columns for Country, Sex, AgeClass, Time (Year), and the value of the unemployment duration indicator (plus possibly indicator names or units). The `head()` call displays the first few rows to give us a sense of the output. When using `get_dataset()`, **the order of the filters in the list is critical** – it must follow the order of dimensions that the dataset expects. The metadata from `get_data_structure()` tells you this order. In our example, if the order was (Country, Sex, AgeClass, Time), we provided three filters for Country, Sex, and AgeClass, and implicitly we got all available Time (years) because we didn’t include time in the filter. If we had flipped the order of, say, country and sex in the list, the query would be mis-specified and likely return an error or no data. So always align your filter list with the dataset’s dimensions in the given sequence. Another note: If you call `get_dataset("DUR_D")` with no filters at all, it will try to retrieve the *entire* dataset. That could be a huge amount of data (potentially thousands of series or millions of rows), which is usually not practical to pull into R. It’s better to constrain the query as shown. If you really do need *all* data, you might have to retrieve it in chunks or ensure you have enough memory and time. After retrieving the data, you can treat it like any other R data frame: clean it, merge it with other data, visualize it, etc. The benefit is that you pulled it directly from the authoritative source, and you can update it easily by re-running the code whenever needed. > **Try it yourself:** Using the OECD package, search for a dataset related to, say, *inflation* or *GDP*. Use `search_dataset("inflation", data = dataset_list)` to find a relevant dataset ID. Then use `get_data_structure()` on that ID to see what dimensions it has. Finally, use `get_dataset()` to retrieve a subset of that data – for example, inflation rates for a few countries over a certain time period. Examine the returned data frame to see how the data is organized. ## spiR (Social Progress Index) The **spiR** package allows access to the **Social Progress Index (SPI)** data via the Social Progress Imperative’s API. The Social Progress Index is a composite index that measures multiple dimensions of social performance (Basic Human Needs, Foundations of Well-Being, and Opportunity) for countries. It is akin to an alternative to GDP for gauging social development and human welfare. In practice, the SPI provides scores for various components of social progress (like health, education, personal rights, etc.) as well as an overall score, updated annually. The motivation for such indices comes from the recognition that GDP alone does not fully capture a society’s well-being – a point emphasized by many economists and commissions on measurement (Stiglitz, Sen & Fitoussi, 2009; Jones & Klenow, 2016). Using the SPI API through the spiR package, we can retrieve country codes, indicator (component) codes, and the data values. Make sure to install the package (`install.packages("spiR")` from CRAN if available, or via GitHub if instructed by its documentation) and load it with `library(spiR)`. Typically, you will follow these steps with spiR: 1. Get the country code for the country (or countries) you are interested in (since the API might require ISO country codes). 2. Find the indicator code(s) for the aspect of SPI you want (e.g., overall SPI, or specific components like *Health and Wellness*). 3. Retrieve the data for those countries and indicators (optionally for specific years, or a range of years). The spiR package provides convenient functions for steps 1 and 2, which we’ll demonstrate. ### `spir_country()` The function `spir_country(name)` looks up the country code for a given country name. This is handy because the SPI API expects country identifiers (likely ISO-3 letter country codes). Instead of guessing or looking up codes manually, you can search by name. For example, to find the code for Canada: ```r library(spiR) canada_code <- spir_country("Canada") canada_code ``` This will return Canada’s country code used in the SPI database (likely **CAN**). If you leave the name blank (i.e., call `spir_country()` with no arguments), it will list all available countries and their codes in the SPI dataset. That is a useful way to see how countries are identified in this dataset. You’ll notice country codes that follow the standard ISO conventions (e.g., USA for United States, DEU for Germany, BRA for Brazil, etc.). ### `spir_indicator()` Similarly, `spir_indicator(keyword)` lets you search for an SPI indicator code by keyword. The SPI database includes the overall index as well as many specific indicators and sub-indexes (for example, *Nutrition and Basic Medical Care*, *Shelter*, *Personal Freedom and Choice*, *Health and Wellness*, etc.). If you know the general topic or name, you can search for it. For instance, if we are interested in indicators related to **mortality**: ```r # Search for SPI indicators containing "mortality" in their description or name mortality_indicators <- spir_indicator("mortality") mortality_indicators ``` This returns any SPI indicators whose name or description includes “mortality” (for example, it might find *“Maternal mortality rate”* or similar) along with the code for each such indicator. If called with no arguments, `spir_indicator()` would list **all** available SPI indicators and their codes, which is a long list of all components and sub-indexes in the SPI framework. From the search results, suppose we find an indicator code that we want. For example, the overall Social Progress Index score might have the code `"SPI"` (often the overall index is coded as SPI). Or a component like *Health and Wellness* might have a code like `"HW"` (just a hypothetical example). We would then use those codes to get the data. ### `spir_data()` Once we have the country code(s) and an indicator code (or multiple indicators) of interest, we use `spir_data()` to download the data. The `spir_data()` function requires three main arguments: a **country** (or vector of countries), a **year** (or vector of years), and an **indicator** (or vector of indicator codes). All these arguments are typically provided as strings. You can request multiple countries, multiple years, and multiple indicators in one go. For example, let’s retrieve the **overall SPI score** (indicator code `"SPI"`) for a set of countries over a range of years: ```r # Get the SPI overall scores for selected countries and years spi_data <- spir_data(country = c("USA", "FRA", "BRA", "CHN", "ZAF", "CAN"), years = c("2014","2015","2016","2017","2018","2019"), indicators = "SPI") head(spi_data) ``` In this call, we requested the SPI overall index for the USA, France, Brazil, China, South Africa, and Canada for the years 2014 through 2019. The argument `indicators = "SPI"` means we only want the overall SPI index (and not the dozens of sub-indicators). The result, stored in `spi_data`, would be a data frame where each row is a country-year combination. If we inspect `head(spi_data)`, we should see columns like country, year, and the SPI score (there might also be an indicator name column or code column if multiple indicators are returned). In this case, since we requested only one indicator, the data frame might have a column named “SPI” representing the score, or it might have a generic “value” column with another column identifying the indicator as SPI – the exact format depends on how the API returns the data. From the example above, each of those countries will have SPI scores for each year 2014–2019 (if the data is available for those years). We could easily plot these or compare them, e.g., to see which country improved its SPI the most over that period. *(Note: SPI data tends to be annual, and not all countries might have every year available – some years might be missing if data wasn’t collected or if a country was added to the index later. Always check for `NA` values or gaps in the time series.)* For completeness, you can request multiple indicators by supplying a vector to the `indicators` argument. For example, `indicators = c("SPI", "BHN")` might fetch both the overall SPI and the *Basic Human Needs* sub-index (if "BHN" is the code for that) in one call, with the returned data indicating which indicator each value corresponds to. > **Try it yourself:** Suppose you want to compare a specific component of the Social Progress Index, such as **Health and Wellness**, across countries. First, use `spir_indicator("Health")` to find the exact code for the Health & Wellness indicator (or any component related to health). Then use `spir_data()` to retrieve that indicator for a set of countries (e.g., your country of interest and a few others) for the most recent year or two. How do the countries rank on that component? Which country scored highest, and which scored lowest? *(For an extended example of using the SPI API and even creating visualizations, you might refer to case studies or academic analyses of social progress. The creators of the SPI have written about it as well (Porter, Stern & Green, 2017), and researchers often compare SPI with other development measures.)* ## statcanR (Statistics Canada Data) The **statcanR** package provides a convenient interface to Statistics Canada’s open data API. Statistics Canada (often abbreviated StatCan) offers a vast array of data tables (formerly known as CANSIM tables) covering about 30 different subjects – such as agriculture, energy, education, health, economics, and more – at various geographic levels (country, provinces, cities, etc.). If your analysis involves Canadian data, StatCan’s online portal has a wealth of information, and statcanR helps you pull that data directly into R. Using statcanR typically involves two steps: 1. **Find the table ID** for the data you want. 2. **Retrieve the data** using that table ID with `statcan_data()`. There are thousands of data tables, each identified by a unique table number (also called a product ID, or PID). An example of a table ID is something like **27-10-0014-01** – which happens to correspond to a table about federal expenditures on science and technology by socio-economic objective (we’ll use that as a running example). These table IDs often have a format with two digits, then two digits, then four digits, then two digits (as in 27-10-0014-01). The format isn’t important except as an identifier; what matters is finding the correct ID for the data you need. ### Finding a StatCan Table ID To find the table ID for the data you need, you have a couple of options. One is to use the Statistics Canada website’s search or browse features. The StatCan website has a **Data** search portal where you can enter keywords related to the dataset you want, and it will list matching tables. Once you find the table in the web interface, the site will show its table number (ID). For example, if we wanted *“federal expenditures on science and technology by socio-economic objective”*, we could search those terms on the site and likely find a matching table with ID **27-10-0014-01**. Alternatively, statcanR provides a function `statcan_search()` that allows you to search the database from within R, similar to how we searched with WDI and OECD from R. For example, you can do: ```r library(statcanR) statcan_search(c("federal", "expenditures"), "eng") ``` This would search (in English, since we specified `"eng"` for the language) for StatCan tables whose description contains both “federal” and “expenditures”. The result would include our target table among possibly others, showing the table title and the table ID (e.g., “27-10-0014-01”). You can adjust the keywords or switch to French (`lang = "fra"`) if needed to find other tables (sometimes the French descriptions might catch different keywords). Whichever method you use (website or `statcan_search()`), the key outcome is identifying the exact table ID you need. ### `statcan_data()` Once you have the table number (product ID) of interest, you can use the `statcan_data()` function to fetch the data via the API. The `statcan_data()` function takes the table ID and an optional language parameter. By default, it returns English labels for data fields, but you can request French by using `lang = "fra"` if desired (for example, if you want column headings and category labels in French). For example, using the table ID from above (27-10-0014-01): ```r library(statcanR) # Download StatCan table 27-10-0014-01 (Federal S&T expenditures) in English my_data <- statcan_data("27-10-0014-01", lang = "eng") head(my_data) ``` This will retrieve the dataset identified by `"27-10-0014-01"` – in our example, *Federal expenditures on science and technology, by socio-economic objectives*. The data comes as a data frame (`my_data`). Each StatCan table has its own structure, but generally the columns will include the dimensions of that table (e.g., Year, Geography, some Category or Classification, and the Value or Measure). In this case, we might expect columns like Year, maybe a breakdown by type of expenditure or objective, and the expenditure amount. The `head(my_data)` output will show the first few rows and the column names. You’ll likely see human-readable labels for categories because the API provides labeled data (which is why specifying `lang` matters – it returns English category names here, such as “Total expenditure” or specific objective names, rather than codes). One great thing about statcanR is that it handles behind-the-scenes steps for you: downloading the data file (StatCan often provides data as a CSV or XML inside a zip), unzipping it, and reading it into R. The single `statcan_data()` call does all that, so you don’t have to manually navigate the StatCan website or deal with file downloads and imports. By default, `statcan_data()` gives you the entire table. If it’s a large table, that could be a lot of data (potentially many thousands of rows for very detailed tables). Keep an eye on the size of `my_data` (for instance, use `nrow(my_data)` to see how many rows were returned). If it’s too large for your needs, you might filter it in R after downloading, or (if the API supports it) query a subset by specifying certain dimensions. In many cases, however, StatCan tables are moderate in size or you only need a subset which you can filter in R after retrieval. *(Advanced note: Some other R packages like **{cansim}** or **{tidycansim}** also exist to interface with Statistics Canada data and provide additional helpers (e.g., automatically converting to tibbles, or normalizing date formats). The **statcanR** package used here works well for direct access, especially for the tables curated in this context, but for very large tables or more complex manipulations, those other packages can be handy.)* > **Try it yourself:** Suppose you want Canadian data on unemployment rates by province. You could go to StatCan’s website and search for "unemployment rate province table" to find a table ID. For example, you might find something like **14-10-0294-02**, which corresponds to unemployment rates by province. Once you have the ID, use `statcan_data("14-10-0294-02", lang="eng")` to download it. Check the first few rows to see what columns it has. Can you identify the province names (or codes), and the time period of each observation? This exercise gives you practice in finding and retrieving a StatCan data table. ## EpiBibR (COVID-19 Literature Bibliography) **EpiBibR** is an R package (backed by an API) that provides access to a large bibliographic database of publications related to COVID-19 and other epidemiological research. In response to the global COVID-19 pandemic, the EpiBibR project compiled tens of thousands of references from sources like PubMed and other databases. The dataset started at about 20,000 references early in the pandemic and has since grown enormously – by April 2022 it contained roughly 180,000 references, and it continues to update regularly (with new publications added daily). The name EpiBibR stands for “Epidemiology Bibliography for R.” Essentially, it’s a specialized database of academic papers and reports about COVID-19 (and related health topics) that you can query from R, which is incredibly useful for literature reviews or tracking the evolution of scientific research. The explosion of COVID-19 related publications has been noted as a challenge for researchers to keep up with (Palayew et al., 2020), and tools like EpiBibR can help manage that information overload. So what’s in this bibliographic database? Each entry (row) is a reference to a publication, with various fields such as title, authors, journal, year, and so on. The table below outlines some of the main fields (tags) in the bibliographic data and their meanings: | **Field Tag** | **Description** | | ------------- | ---------------------------------------------------------------- | | AU | Authors (names of the authors) | | TI | Title (of the article or document) | | AB | Abstract (summary of the paper) | | PY | Publication Year | | DT | Document Type (e.g., Journal Article, Preprint, etc.) | | MESH | MeSH Terms (Medical Subject Headings, standardized keywords) | | TC | Times Cited (citation count) | | SO | Source (Publication name, e.g., journal title) | | J9 | Source Abbreviation (often the 9-character source title) | | JI | ISO Source Abbreviation | | ISSN | International Standard Serial Number (journal identifier) | | VOL | Volume (journal volume, if applicable) | | ISSUE | Issue Number (if applicable) | | LT | Language (of the publication, e.g., EN for English) | | C1 | Author Address (affiliations) | | RP | Reprint Address (contact address for corresponding author) | | ID | PubMed ID (if available) | | DE | Author Keywords (keywords provided by the authors) | | UT | Unique Article Identifier (an internal or database ID) | | AU\_CO | Author’s Country (country of author affiliation) | | DB | Database Source (which repository or source the entry came from) | Many of these fields follow standard bibliographic conventions (similar to formats like BibTeX or Medline tags). For example, **AU\_CO** indicates the country affiliation of an author, **MESH** terms are the standardized medical subject headings assigned to the article, **TC** is a citation count, and **UT** might be a Web of Science unique ID or similar. Not every entry will have all fields (for instance, very new preprints might not have a Times Cited yet, or some records might lack an abstract if it wasn’t available), but this gives an idea of the scope of information available for each reference. The main function to access this bibliographic database in R is `epibibr_data()`. This function can retrieve references and allows filtering by various criteria via its arguments (such as author name, author’s country, year of publication, keywords in the title, keywords in the abstract, source name, etc.). If called with no arguments at all, `epibibr_data()` will attempt to return **the entire bibliographic dataset**: ```r library(EpiBibR) # Retrieve the entire COVID-19 bibliographic dataset (all references) all_refs <- epibibr_data() ``` *Be cautious with the above command!* The complete dataset is **very large** (hundreds of thousands of entries). Retrieving it in full will take a long time and consume a lot of memory. In practice, you will usually want to filter your query to a specific subset of interest rather than pulling everything. More typically, you’ll query for a specific subset of references. Here are several ways to use `epibibr_data()` with filters: * **By author:** You can find all publications by a given author. For example, `epibibr_data(author = "Colson")` will return all entries where an author’s name contains “Colson” (such as publications by *Philippe Colson*). The search is case-insensitive and will match partial names, so it could return any paper authored by someone with "Colson" in their name. * **By author and year:** You can combine filters. For example, `epibibr_data(author = "Yang", year = "2020")` will return references where an author’s name contains “Yang” **and** the publication year is 2020. This would retrieve, say, all papers authored by people named Yang that were published in 2020. * **By author’s country:** To get references where at least one author’s country of origin (affiliation) is Canada, use `epibibr_data(country = "Canada")`. This filter uses the **AU\_CO** field and will return entries for which an author is associated with Canada. This can be useful if you want to see the contribution of researchers from a certain country or analyze country-specific research output. * **By keywords in title:** If you want papers with a certain keyword in the *title*, use the `title` argument. For instance, `epibibr_data(title = "vaccine")` returns references whose titles contain “vaccine”. (This would catch titles like “COVID-19 vaccine development...”, “...vaccine efficacy...”, etc. The search is not case-sensitive.) * **By keywords in abstract:** Similarly, use the `abstract` argument to search within abstracts. For example, `epibibr_data(abstract = "coronavirus")` finds papers that mention “coronavirus” in the abstract text. This is useful for capturing papers that might not have COVID in the title but are still about coronaviruses or COVID-19. * **Combining multiple criteria:** You can specify multiple arguments at once to narrow down the search. The filters will be combined (essentially an AND of all conditions). For example, `epibibr_data(author = "Yang", title = "COVID", year = "2020")` would retrieve articles authored by someone named Yang **that have “COVID” in the title and were published in 2020**. That’s a very specific query, but it illustrates how multiple filters work together. You can even add more filters, such as the source (journal) name. For instance, extending the above example: `epibibr_data(author = "Yang", title = "COVID", year = "2020", source = "Lancet")` would further restrict to those published in sources whose name contains “Lancet” (so it would find papers from *The Lancet* or *Lancet Infectious Diseases*, etc.). In that case, you’d get papers by authors named Yang, from 2020, with “COVID” in the title, and published in a journal with “Lancet” in its name. All these queries use the same `epibibr_data()` function, just with different arguments. This unified interface makes it quite flexible to get exactly the subset of bibliographic data you need. The returned result is a data frame where each row is a reference (a publication), with columns for all those fields (author, title, year, journal, etc.). You can then further analyze or export this information as needed. For example, you might count how many papers match your query or see which journals are most represented. After retrieving a subset, you can use regular R tools to examine it. For instance, if you want to know how many results you got, you could do `nrow(my_results)`. Or to see the unique journals in that subset, you might do `unique(my_results$SO)` (if `SO` is the Source column). This can help answer questions like “How much literature was published on topic X in year Y?” or “Who are the most prolific authors in this subset of papers?”. > **Try it yourself:** Suppose you want to find how much literature was published in 2021 about vaccines for COVID-19. You could try a query like `epibibr_data(title = "vaccine", year = "2021")`. This would fetch references from 2021 whose titles contain "vaccine". How many results does it return? Which journals or authors appear frequently in those results? This kind of query gives you a quick overview of the literature on a specific topic. ## coronavirus (COVID-19 Case Data Package) The **coronavirus** package (by Rami Krispin, as part of the Covid19R project) provides a tidy, daily-updated dataset of COVID-19 cases and outcomes worldwide. It sources the data from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) repository – the well-known Johns Hopkins COVID-19 data which was used globally for tracking the pandemic. Essentially, this R package wraps the JHU time series data (which includes confirmed cases, deaths, recoveries, etc., by location and date) into an easily accessible format in R. The JHU data became a standard reference for researchers and public health analysts during the pandemic (Dong, Du & Gardner, 2020), and accessing it through an R package makes analysis much more convenient. What’s convenient about the **coronavirus** package is that it not only gives you the data in a ready-to-use format but also includes functions to keep it updated. The data is structured in a “long” format with columns typically including: date, province/state, country/region, latitude, longitude, number of cases, and type of case (confirmed, death, recovered). Each row is a daily report for a location and case type. After installing (`install.packages("coronavirus")`) and loading the package with `library(coronavirus)`, you can use the following: ### `data("coronavirus")` The simplest way to get started is to load the package’s built-in dataset. Upon installation, the package comes with a snapshot of the COVID-19 data (current up to the package’s last update on CRAN). You load it by calling: ```r library(coronavirus) data("coronavirus") head(coronavirus) ``` When you run `data("coronavirus")`, it loads the dataset into your R environment as a data frame named `coronavirus`. The `head(coronavirus)` will show the first few rows, which might look like: | date | province | country | lat | long | type | cases | | ---------- | -------- | ------- | ------- | -------- | --------- | ----- | | 2020-01-22 | Hubei | China | 30.9756 | 112.2707 | confirmed | 444 | | 2020-01-22 | Hubei | China | 30.9756 | 112.2707 | death | 17 | | 2020-01-22 | Hubei | China | 30.9756 | 112.2707 | recovered | 28 | | 2020-01-22 | Anhui | China | 31.8257 | 117.2264 | confirmed | 1 | | ... | ... | ... | ... | ... | ... | ... | Each row represents the number of cases of a certain type on a given date in a given region. For example, one row might show that on 2020-01-22, in Hubei province (China), there were 444 confirmed cases; another row shows the number of deaths on that date in Hubei, etc. If `province` is `NA` or empty, it usually means the data is country-level with no sub-region specified (for instance, for the US, the `province` column contains state names; for smaller countries, `province` will be empty and the cases are country totals). The `type` column indicates whether the row is reporting **confirmed** cases, **deaths**, or **recovered** cases, and the `cases` column is the count for that date. The dataset included in the package is a snapshot (it will be as up-to-date as of the package version you installed). To ensure you have the latest data, the package offers an update function. ### `update_dataset()` The CRAN version of the coronavirus package is updated periodically (every few weeks or months). To get more up-to-date data without waiting for a new package release, you can use the function `update_dataset()`. This function will fetch the newest COVID-19 records from the JHU source and update the in-memory `coronavirus` dataset in your R session. ```r update_dataset() ``` After running `update_dataset()`, the internal `coronavirus` data in your R session is refreshed with the latest data pulled from the JHU GitHub repository. It’s often a good idea to then restart your R session and reload the package/dataset to ensure that you’re working with the freshly updated data (especially if you had already loaded the old `coronavirus` data – R might otherwise keep the old version in memory until you reload it). Essentially, `update_dataset()` modifies the package’s data behind the scenes to store the new data, and by restarting and calling `data("coronavirus")` again, you load the newly obtained data. Why not just always use `update_dataset()`? Well, if you just installed the package for a quick analysis and don’t need today’s absolutely latest data, using the built-in snapshot might be sufficient. But if you are building a live dashboard or conducting an analysis that demands the most current figures (e.g., tracking a surge in cases), `update_dataset()` ensures you have up-to-the-day information. ### `refresh_coronavirus_jhu()` As an alternative to using the package’s internal dataset (and updating it), the **coronavirus** package provides a direct retrieval function called `refresh_coronavirus_jhu()`. This function queries the data from the JHU repository on the fly and returns it as a data frame (without relying on the package’s built-in dataset at all). In other words, it fetches the latest data in the same structure as the `coronavirus` dataset, but it stores it in a user-defined object of your choosing. For example: ```r latest_covid_data <- refresh_coronavirus_jhu() head(latest_covid_data) ``` Here, `latest_covid_data` will be a data frame with the same columns as the `coronavirus` dataset (date, province, country, lat, long, type, cases). The difference is that this is fetched fresh at the moment you run the function. You don’t need to reload any data from the package; it’s assigned to `latest_covid_data` directly. This ensures you have the absolute latest records, independently of the package’s built-in data. Using `refresh_coronavirus_jhu()` is especially useful for real-time analysis or when you need to be sure you’re not missing even the most recent day’s data. For example, if you were updating a report or a Shiny dashboard each day, you might call this function every time the report runs to pull the latest counts. To summarize the use of the **coronavirus** package: it provides a quick way to get comprehensive COVID-19 time series data in R. You can either use the built-in data (which might be slightly behind the current date) or update it via code to the latest available. The data covers global cases and is updated daily at the source. Once you have the data in R, you can easily aggregate it (e.g., total cases by country or globally), visualize trends over time, calculate rates or moving averages, or merge it with other datasets (for instance, to correlate case numbers with government policy measures or economic indicators). ## IMF Data via R APIs International economic research often relies on macroeconomic and financial data provided by the **International Monetary Fund (IMF)**. The IMF maintains several extensive databases, such as the **World Economic Outlook (WEO)**, **International Financial Statistics (IFS)**, **Balance of Payments Statistics**, **Government Finance Statistics**, and more. These contain historical time series on GDP, inflation, government budgets, trade balances, exchange rates, and many other economic indicators across most countries. The IMF makes these data available through an API, and there are R packages to access them, notably **imfr** (by Christopher Gandrud) and the newer **imf.data** package. By using these packages, one can directly query IMF databases for specific indicators and countries, ensuring that analyses use the most current figures released by the IMF. The IMF’s datasets are central in international macroeconomic research and policy analysis – for example, studies of fiscal policy often use IMF data on growth and government spending forecasts (Blanchard & Leigh, 2013). We will illustrate using the **imfr** package, as it provides a user-friendly interface to the IMF API. Make sure to install it (`install.packages("imfr")`) and load it with `library(imfr)`. ### Listing Available IMF Databases The IMF API has many separate databases. Each is identified by a short code. To see a list of available databases, you can use: ```r library(imfr) imf_ids <- imf_ids() head(imf_ids) ``` The result `imf_ids` will show database IDs and names. For example, you might see entries like: `IFS` (International Financial Statistics), `WEO` (World Economic Outlook), `BOP` (Balance of Payments), `GFS` (Government Finance Statistics), `DOT` (Direction of Trade Statistics), etc. Each of these has different indicators and coverage. Suppose we’re interested in the World Economic Outlook (WEO) database, which contains broad macroeconomic indicators along with IMF forecasts, or the IFS database for more high-frequency financial data. ### Querying a Specific Series with `imf_data()` Once we know the database, we need to know the **indicator codes** and country codes. Each database has its own set of series codes (similar in concept to the indicator codes in WDI or dataset codes in OECD). For example, in the WEO database, an indicator code for nominal GDP might be `"NGDPD"` (Nominal GDP in USD), for real GDP growth `"NGDP_RPCH"`, for inflation `"PCPIPCH"` (percent change in consumer prices), etc. Country codes are often IMF-specific or ISO country codes. The WEO and some other databases use ISO 3-letter country codes (USA, CHN, FRA, etc.), but some IMF databases might use numeric codes or their own abbreviations. The `imf_data()` function in **imfr** is used to query data. It requires a **database ID**, an **indicator (or series) code**, one or more **countries**, and a date range (start year, and optionally end year). Let’s say we want to retrieve **nominal GDP in USD** for a couple of countries from the WEO database: ```r library(imfr) # Get nominal GDP (code "NGDPD") for USA, China, and Germany from 2010 to 2020 using WEO database gdp_data <- imf_data(database_id = "WEO", indicator = "NGDPD", country = c("USA", "CHN", "DEU"), start = 2010, end = 2020) head(gdp_data) ``` In this example, `database_id = "WEO"` specifies we want the World Economic Outlook database. The `indicator = "NGDPD"` specifies nominal GDP in current USD (we determined that code from WEO documentation or searching within the WEO dataset). We requested `country = c("USA","CHN","DEU")`, which are the ISO codes for the United States, China, and Germany. The start and end years define the period 2010–2020. The result `gdp_data` should be a data frame containing the GDP values for those countries and years. If we inspect `head(gdp_data)`, we might see columns like: `WEO.Country.Code`, `ISO`, `Country`, `WEO.Subject.Code`, `Subject Descriptor`, `Units`, `Scale`, `Year`, and `Value` – or a subset of those. The structure depends on the database; WEO often returns a rich metadata along with the values. For easier use, one can clean or select relevant columns (for instance, selecting ISO, Year, and Value). Another example: using the IFS database (International Financial Statistics), which has more granular series. If we wanted to get, say, the Consumer Price Index (CPI) for a country (which is often monthly or quarterly in IFS), we would identify the CPI series code and the country code in IFS and query similarly with `imf_data(database_id="IFS", ...)`. The imfr package includes helper functions like `imf_codelist()` to explore available series within a database if needed. The IMF API might require an API key for large queries, but for many basic queries, it works without one or with a shared guest key. If you plan to use it extensively, you can register for a free API key on the IMF data portal and provide it to the R package (imfr can be told the key via an argument or environment variable). The power of using the IMF API through R is that you can programmatically gather macroeconomic data for many countries and series in one go. This ensures consistency (you’re getting all data from the same source and update cycle) and reproducibility (anyone can run your R code to fetch the latest data). In research on international economics, it’s common to use IMF data for analyzing global economic trends, exchange rates, financial crises, etc., so being able to pull this data on demand is highly valuable. The IMF’s data is often updated periodically (WEO is updated biannually, IFS monthly or quarterly depending on series, etc.), so an analysis can be updated by simply re-running the code to get new values when they are released. > **Try it yourself:** Use `imf_data()` to fetch an economic indicator of your choice. For example, retrieve the inflation rate or unemployment rate for a set of countries. First, identify which IMF database might have that series (WEO contains many macro indicators; IFS contains financial and price data, etc.). Then find the series code (the documentation or an online search can help; e.g., WEO’s consumer price inflation percent change is often `"PCPIPCH"`). Then call `imf_data()` with that indicator and a few country codes. Check the returned data frame to see if it matches your expectations (units, years, etc.). This will give you practice in working with the IMF’s data API. ## International Trade Data via APIs (UN Comtrade & Open Trade Statistics) For analyses focused on **international trade**, we often need detailed data on trade flows between countries (exports and imports, by product categories, over time). One of the primary sources for such data is the **United Nations Comtrade** database, which is a repository of official international trade statistics reported by countries. It contains bilateral trade flows (e.g., country A’s exports of product X to country B) typically classified by product codes (such as Harmonized System codes) and available annually (and for some data, monthly). Rather than manually downloading trade data from the Comtrade website, we can use R to query the Comtrade API. There is an R package **comtradr** (by rOpenSci) that serves as a wrapper for the UN Comtrade API, making it easier to pull trade data into R. Trade flow data are crucial in international economics research, underpinning studies from gravity models of trade (Anderson & van Wincoop, 2003) to analyses of global value chains. Having direct access to these data via API allows researchers to quickly get the specific trade statistics they need for analysis. Before using **comtradr**, install it (`install.packages("comtradr")`) and load it with `library(comtradr)`. The Comtrade API has some complexity in terms of parameters, but comtradr’s main function `ct_get_data()` simplifies many of these by providing sensible defaults and checks. ### Basics of UN Comtrade Data and API Parameters When querying Comtrade, you typically specify: **type of trade** (goods or services), **frequency** (annual or monthly), a **commodity classification** (e.g., HS – Harmonized System, or SITC, etc.), the **commodity code** (or “all” commodities), **flow direction** (export, import, re-export, etc.), the **reporter** country, the **partner** country, and the time period (years or months). Additional parameters can include mode of transport or customs regime, but those are more advanced. The `ct_get_data()` function in comtradr allows you to specify these. It has default values that, if not changed, will query a broad set of data (which might be large). For example, by default, `commodity_code = "TOTAL"` (which means the total trade for all commodities combined) and `partner = "World"` (which means trade with all partners aggregated). These defaults effectively give total exports or imports of a country if you specify a reporter and flow. **Important:** The Comtrade API has usage limits (number of calls per second and per hour) and may require an API key for extensive use. You can sign up on their website for a free API key, which increases your allowance. The comtradr package has a function `set_primary_comtrade_key("YOURKEY")` to store your key. For small queries, you may not need a key at all (the API allows some calls without a key, albeit with stricter throttling). ### Example: Querying Trade Data with `ct_get_data()` Let’s walk through an example. Suppose we want to get the **annual exports** of goods for a few major exporters (say China, USA, Germany) to the world, for recent years. Essentially, this means we want, for each of those countries, the value of exports of all goods to all partners (which is the total exports) for each year in a range. In Comtrade terms: * type = "goods" * frequency = "A" (annual data) * commodity\_classification = "HS" (let’s use the Harmonized System, a common product classification; by default it uses HS) * commodity\_code = "TOTAL" (the code for all commodities combined) * flow\_direction = "Export" * reporter = (our list of countries, by ISO3 code: CHN, USA, DEU) * partner = "World" (meaning aggregate trade with the world) * start\_date and end\_date for the years of interest. Here’s how we do it: ```r library(comtradr) # Get total exports for China, USA, Germany (reporters) to World, from 2015 to 2020 trade_data <- ct_get_data( type = "goods", frequency = "A", commodity_classification = "HS", commodity_code = "TOTAL", flow_direction = "Export", reporter = c("CHN", "USA", "DEU"), partner = "World", start_date = "2015", end_date = "2020" ) head(trade_data) ``` When this query runs, comtradr will retrieve data from the Comtrade API. The result `trade_data` should contain rows corresponding to each country-year’s export value. If we inspect `head(trade_data)`, we might see columns such as Reporter (country name), ReporterISO (ISO code), Partner (likely “World” for all since we chose aggregate partner), Year, TradeValue (the trade value, typically in USD), and perhaps additional metadata like the classification or unit. Each row would be, for example, China – World – 2015 – (export value), USA – World – 2015 – (export value), and so on, through 2020. This gives us the total export values. If we wanted imports instead, we’d set `flow_direction = "Import"`. If we wanted data for a specific trading partner (e.g., China’s exports to the USA), we could specify `partner = "USA"` instead of World, and that would give bilateral trade rather than global totals. We can also query more granular commodity data. For example, if we wanted to know the trade in a specific product category, we could set `commodity_code` to an HS code (or multiple codes). For instance, HS code 8703 is “Motor cars and other motor vehicles for passenger transport”. We could query exports of cars from, say, Japan to the world. We’d specify `commodity_code = "8703"`, `reporter = "JPN"`, `partner = "World"`, etc. The result would then be the yearly export value of cars from Japan. The comtradr package also provides some functions to look up reference data, like `ct_country_lookup("keyword")` to find country codes or `ct_commodity_lookup("car")` to find commodity codes by keyword. However, these functions might not cover all classification versions, so sometimes it’s necessary to know or look up the exact code externally. **Open Trade Statistics (OTS)**: An alternative way to access trade data is via the **Open Trade Statistics** API and its R package **tradestatistics** (Vargas, 2020). OTS is an initiative to simplify access to cleaned and aggregated trade data (it builds on Comtrade data but provides a simpler interface for common queries). The `tradestatistics` package allows queries for trade flows without dealing with some of the complexities of the raw Comtrade API. For example, it can retrieve a country’s export or import totals or product-level trade for given years in a tidy format. If you find the Comtrade API too granular or complicated for quick use, `tradestatistics` can be a friendlier option for many purposes. Here’s a quick example using `tradestatistics` (assuming you have it installed and loaded): ```r library(tradestatistics) # Get total exports for China, USA, Germany (reporters) from 2015 to 2020 via OTS ots_data <- ots_trade( reporters = c("CHN", "USA", "DEU"), partners = "World", years = 2015:2020, table = "yr" # "yr" means yearly trade data ) head(ots_data) ``` The result `ots_data` might have columns like year, reporter\_iso, partner\_iso, export\_value, import\_value, etc., for each combination requested. OTS data might already be aggregated by default. Whether you use the direct Comtrade API via **comtradr** or the OTS via **tradestatistics**, you are leveraging APIs to get exactly the trade data needed. This avoids manually downloading massive CSV files of trade data and then filtering them. Instead, the filtering happens on the server side (at the API), and you get a manageable chunk of data in R to work with. In international trade research, working with bilateral trade data is common. For example, the **gravity model of trade** – a fundamental model in trade economics – analyzes bilateral trade flows between countries and relates them to factors like GDP, distance, and trade barriers (Head & Mayer, 2014). These analyses require assembling trade flow data (often from Comtrade or similar) along with other country variables. Being able to programmatically pull, say, all pairs of trade flows for a given year or all years greatly eases the assembly of such datasets. > **Try it yourself:** Let’s say you want to examine how a country’s exports of a particular commodity have grown over time. Pick a country and a product category you’re interested in (for example, Brazil and soybeans, or Germany and automobiles). Find the HS code for that product (you might search for an HS code list online). Then use `ct_get_data()` in comtradr to query annual export values for that country (`reporter`) to the world (`partner = "World"`) for that HS code over a range of years. Inspect the trend of the export values – do you see growth over time? This exercise gives you practice in pulling a specific slice of trade data. --- Each of the packages and APIs above demonstrates a way to access external data sources through R. By using APIs and specialized R packages, you can automate data retrieval for up-to-date information without manual downloading, which is crucial for keeping analysis pipelines current. In this chapter, we introduced how to search for datasets or indicators and fetch data from the World Bank (WDI), OECD, Social Progress Imperative (spiR), Statistics Canada (statcanR), a COVID-19 literature database (EpiBibR), COVID-19 case data (coronavirus package), as well as data from the IMF and international trade data from UN Comtrade (via comtradr) and Open Trade Statistics. These tools greatly expand your ability to gather data for analysis directly within R, making your data science workflow more efficient and reproducible. Whenever you start a new analysis, think about whether an API could provide the data you need. If so, investing a bit of time in learning the R package for that API will pay off in the long run: your code will directly pull the latest data and anyone else running it will be able to get the same data without hunting down files. This ensures transparency (everyone knows exactly what data was used and how it was obtained) and reproducibility (the analysis can be re-run at any time, even as data updates). In an international business or economics context – where data can change rapidly, and where one often needs to combine information from multiple countries or sources – these skills will enable you to keep your analyses **current**, **pedagogically clear**, and **robust** for both you and any collaborators or readers of your work. ## References {.unnumbered} * Acemoglu, D., Johnson, S., & Robinson, J. (2001). *The colonial origins of comparative development: An empirical investigation*. American Economic Review, 91(5), 1369–1401. * Anderson, J. E., & van Wincoop, E. (2003). *Gravity with gravitas: A solution to the border puzzle*. American Economic Review, 93(1), 170–192. * Blanchard, O., & Leigh, D. (2013). *Growth forecast errors and fiscal multipliers*. American Economic Review (Papers & Proceedings), 103(3), 117–120. * Dong, E., Du, H., & Gardner, L. (2020). *An interactive web-based dashboard to track COVID-19 in real time*. The Lancet Infectious Diseases, 20(5), 533–534. * Head, K., & Mayer, T. (2014). *Gravity equations: Workhorse, toolkit, and cookbook*. In G. Gopinath, E. Helpman, & K. Rogoff (Eds.), Handbook of International Economics (Vol. 4, pp. 131–195). Elsevier. * Jones, C. I., & Klenow, P. J. (2016). *Beyond GDP? Welfare across countries and time*. American Economic Review, 106(9), 2426–2457. * Palayew, A., Norgaard, O., Safreed-Harmon, K., Andersen, T. H., & Rasmussen, L. N. (2020). *Pandemic publishing poses a new COVID-19 challenge*. Nature Human Behaviour, 4(7), 666–669. * Peng, R. D. (2011). *Reproducible research in computational science*. Science, 334(6060), 1226–1227. * Stiglitz, J. E., Sen, A., & Fitoussi, J. P. (2009). *Report by the Commission on the Measurement of Economic Performance and Social Progress*. (Commission report).

3.1 Three Ways to Import Data

3.2 WDI (World Development Indicators)

WDIsearch()

WDI()

3.3 OECD Data

search_dataset()

get_data_structure()

get_dataset()