3 Programmatic Data Acquisition for International Business Research using R in Positron

International business (IB) research increasingly relies on large, multi-country datasets covering economic, financial, and institutional variables. Questions on foreign direct investment (FDI), multinational enterprise (MNE) performance, international trade, and comparative institutional environments all demand combining data from authoritative sources across many countries and years. Traditionally, researchers might manually download spreadsheets from the World Bank or OECD websites and copy them into analysis files, a process that is time-consuming and error-prone. In recent years, however, there has been a strong push toward programmatic data acquisition – using code to retrieve data directly from online databases and APIs – to improve research efficiency, reproducibility, and transparency. This chapter provides a comprehensive guide to programmatic data retrieval in R, using the new Positron IDE (the next-generation successor to RStudio) and Quarto for reproducible reporting. We will illustrate techniques for accessing data from major international sources (World Bank, OECD, IMF, UNCTAD, Statistics Canada, etc.), integrate empirical examples inspired by studies in Journal of International Business Studies (JIBS) and Journal of World Business (JWB), and discuss best practices for ensuring reproducibility and data transparency (such as using Git/GitHub for version control, structuring project folders, tracking data provenance, and integrating Zotero references in Quarto documents).

The motivation for embracing programmatic, transparent data workflows in IB research is both practical and normative. On the practical side, IB scholars often deal with dynamic, multi-period data (e.g. yearly FDI inflows, quarterly trade volumes, annual governance indices) that get updated regularly. Writing code to pull the latest data on demand means analyses can be easily refreshed when new data are released. It also allows customizing data queries to retrieve only the needed indicators, countries, and time ranges, saving time and reducing errors. On the normative side, science is facing a reproducibility and replicability crisis, and IB is “not immune”. Studies have shown that irreproducible research results undermine the credibility and usefulness of our findings. Top-tier journals now emphasize data transparency: for example, JWB encourages data availability statements, and the American Economic Association requires accepted papers to provide complete data and code for replication. By programmatically acquiring data and integrating the retrieval code into our analysis (e.g. in a Quarto document), we document precisely how data were obtained and processed, allowing others (and our future selves) to replicate or audit the process. In sum, mastering programmatic data acquisition is becoming an essential skill for IB researchers aiming to produce robust, reproducible studies.

In this chapter, we will first outline the general benefits and tools for programmatic data access in R (Section 3.1). We then provide detailed guidance for retrieving data from several authoritative international databases: the World Bank’s World Development Indicators (WDI) and related datasets, the OECD’s data API, the International Monetary Fund’s databases, the UN Conference on Trade and Development (UNCTAD) data sources, and a national statistics example (Statistics Canada) (Section 3.2). Throughout, code examples in R (within Positron) will demonstrate how to connect to APIs or use specialized R packages to query and import data. In Section 3.3, we integrate multiple sources in an empirical example replicating a style of analysis common in JIBS/JWB – examining how host-country institutional quality influences FDI inflows – using real-world data pulled programmatically. Finally, Section 3.4 discusses best practices for reproducibility and transparency: we cover how to organize research projects (folders and file naming), track changes with Git and GitHub, record data provenance and metadata, and use Quarto with reference management (e.g. Zotero) to produce publication-quality output with properly cited sources. By the end of the chapter, readers should be able to set up an IB data analysis project in Positron where raw data are fetched in code from trusted sources, analyses are fully reproducible, and results can be transparently shared.

3.1 3.1 Benefits of Programmatic Data Acquisition and Tools in R

Programmatic data acquisition means using scripts or code to directly retrieve data from online sources (via web APIs, data portals, or other programmatic interfaces) rather than manually downloading files. Embracing this approach yields several key benefits for international business research:

Customization: The researcher can tailor queries to exactly the data needed – specific countries, indicators, years, etc. – instead of downloading large general files. This is especially useful in IB where one might need a custom panel (e.g. GDP, trade, and institutional scores for a particular set of emerging economies).
Efficiency: Automating data retrieval saves time and reduces human error. Once the code is written, it can pull updated data or new subsets in seconds, avoiding repetitive manual work.
Reproducibility: Perhaps most importantly, programmatic access ensures that anyone can rerun the code to get the same data. The data extraction steps are documented in the script, making the research reproducible by others. This addresses reproducibility concerns raised in IB’s methodological debates.
Automation and Integration: Data retrieval can be integrated into a larger workflow (data cleaning, analysis, visualization) that can be executed end-to-end with a single command. This enables fully automated updates and dynamic reports (e.g. a Quarto document that always uses the latest available data).

In the R language, a rich ecosystem of packages and base functions exists to facilitate programmatic data access. At a low level, R can download files from the web via functions like download.file() or read data directly from URLs (e.g. read.csv("http://...")). For APIs that return JSON or XML, R provides packages like httr (for HTTP requests) and jsonlite (for parsing JSON) to fetch and handle data. These generic tools allow access to virtually any open API on the web. For example, a researcher could use httr::GET() to call an API endpoint and then parse the JSON content into an R data frame with jsonlite::fromJSON().

More conveniently, many major data providers have dedicated R packages that wrap their APIs and provide user-friendly functions. We will encounter several in this chapter: WDI (World Bank Development Indicators), OECD (for OECD data), imf.data/imfr (IMF databases), and cansim (Statistics Canada), among others. These packages abstract away the technical details of API calls. For instance, with the WDI package, one can simply specify the indicator codes, country codes, and date range, and the package handles constructing the URL query, downloading the data, and formatting it as a neat R data frame.

It is worth noting that programmatic data access sometimes requires dealing with API keys or registration (some providers issue free API keys for tracking usage). Most of the sources we discuss (World Bank, OECD, IMF, etc.) currently provide open access without API keys, though they may have request rate limits. Always consult the source’s API documentation for any access requirements or limitations. R package maintainers often include these details in their package vignettes or README.

Finally, a few words on the computing environment: we will use Positron, Posit’s next-generation data science IDE. Positron is essentially a fusion of RStudio and Visual Studio Code – it supports R and Python out of the box in a modern interface. In Positron, you can run R scripts or Quarto notebooks seamlessly. For example, you can open a Quarto document in Positron and execute R code chunks interactively, seeing results in-line, just as you would in RStudio. Positron’s multi-language support means you could even mix R and Python if needed (useful if, say, a Python library has a particular data-access feature). We focus on R here, but it’s reassuring that Positron provides a unified environment for whatever tools are needed. Using Positron, you can also take advantage of built-in Git integration (for version control) and the Visual Studio Code extension ecosystem – for instance, there’s an extension to connect Zotero for citation insertion directly within Quarto documents. The combination of R, Positron, and Quarto thus offers a powerful platform for reproducible research workflows: data can be pulled via code, analyzed, and the entire analysis (including code, results, and references) can be rendered to an output format (HTML, PDF, etc.) in one click. We will leverage this environment throughout the chapter.

3.2 3.2 Data Sources for International Business Research: Retrieval Techniques

International business scholars draw on a variety of cross-country data sources. In this section, we survey several key providers: the World Bank, OECD, IMF, UNCTAD, and a national statistics agency (Statistics Canada). For each, we demonstrate how to retrieve data in R, highlighting relevant packages or APIs. These sources collectively cover a broad range of indicators commonly used in IB research – from macroeconomic indicators (GDP, trade, inflation) to institutional and regulatory metrics (governance indices, ease of doing business, investment restrictions) to firm-related aggregates (e.g. number of MNE subsidiaries, which might be available via national statistical agencies or UNCTAD reports). The examples given can be adapted to other countries, years, or similar datasets from these organizations.

3.2.1 World Bank Data (WDI and other datasets)

The World Bank’s databases are a staple in international business and development research. The World Development Indicators (WDI) in particular is a comprehensive collection of development and macroeconomic indicators for virtually all countries, spanning topics such as economic output, trade, investment, education, health, infrastructure, governance, and more. Many IB studies use WDI for control variables or main variables – for example, GDP per capita to proxy market size, or net FDI inflows as a percentage of GDP to measure investment exposure. The World Bank also hosts other datasets relevant to IB, like the Worldwide Governance Indicators (WGI) (which cover institutional quality dimensions such as Government Effectiveness, Rule of Law, Control of Corruption, etc.) and previously the Doing Business indicators (ease of doing business scores). All these are accessible through a unified API. The R package WDI (maintained by Vincent Arel-Bundock) provides a convenient interface to search and download data from over 40 World Bank databases, including WDI, International Debt Statistics, Doing Business, Subnational statistics, and more.

Using the WDI package is straightforward. First, ensure the package is installed (install.packages("WDI")) and loaded. The core function is WDI() which takes arguments for country, indicator, start and end years, etc. The World Bank indicators are referenced by specific codes. You can search for indicators by keywords with WDIsearch(). For example, to find the code for “GDP per capita,” you could do WDIsearch("GDP per capita") – one of the returned results would be NY.GDP.PCAP.KD (GDP per capita, constant dollars). Similarly, searching “FDI” might return BX.KLT.DINV.WD.GD.ZS (FDI net inflows as % of GDP). For demonstration, let’s retrieve two indicators from WDI: (1) GDP per capita (constant 2015 USD) and (2) FDI net inflows (% of GDP), for a sample of countries over a span of years. We’ll get data for four countries – say, the United States, China, Brazil, and South Africa – from 2000 to 2020:

# Load the WDI package
library(WDI)

# Specify indicator codes and countries
indicators <- c(
  "NY.GDP.PCAP.KD",      # GDP per capita, constant USD
  "BX.KLT.DINV.WD.GD.ZS" # FDI net inflows, % of GDP
)
countries <- c("USA", "CHN", "BRA", "ZAF")  # ISO-3 country codes

# Retrieve data from 2000 to 2020
wdi_data <- WDI(country = countries, indicator = indicators, start = 2000, end = 2020)

head(wdi_data)

When this code runs, the WDI() function sends a query to the World Bank API for the specified countries, indicators, and years, and returns a data frame (wdi_data). If we inspect head(wdi_data), we will see columns: country (country name), iso2c (ISO 2-letter code), year, and one column for each requested indicator (with names like NY.GDP.PCAP.KD and BX.KLT.DINV.WD.GD.ZS). The data frame is in a “long” format: each row is a country-year observation. For example, a row might show country="China", year=2000, NY.GDP.PCAP.KD=…, BX.KLT.DINV.WD.GD.ZS=…. We can easily reshape or filter this as needed. The argument country in WDI can also take "all" to retrieve all countries, but be cautious as that will fetch a very large dataset if combined with many indicators and years.

A few additional tips when using the World Bank data:

The WDI package by default uses the latest available data. If an indicator has missing years for some countries, those simply won’t appear in the data frame. You may need to check data availability (the World Bank API returns NA for years with no data).
If you set extra=TRUE in WDI(), you get additional metadata columns like region, income level, capital city, etc., for each country. This can be useful for filtering (e.g. focusing on only low-income countries).
The World Bank API does not require an API key and has generous usage limits. But if you are pulling very large amounts of data in loops, you might consider spacing requests to avoid hitting any transient rate limits.
The WDI package caches indicator metadata the first time you use it in a session, which makes subsequent searches faster. If the list of indicators updates (the World Bank occasionally adds series or changes definitions), you might call WDIcache() to refresh the cache.

For other World Bank datasets beyond WDI: you can often access them by indicator code as well, if you know the code. For example, the Worldwide Governance Indicators have codes like GE.EST for Government Effectiveness (estimate). You could retrieve those via the WDI package similarly (the WDI package actually includes WGI under the hood since WGI is another database in the API). Alternatively, the World Bank provides separate downloads for WGI and others, but using the API via WDI() or the newer wbdata interface is typically easier and ensures you have the latest data.

Practical example in IB research: Suppose you are interested in the relationship between institutional quality and FDI inflows, a topic explored by many studies (e.g., Globerman & Shapiro, 2002 showed that countries with stronger governance attract more FDI). You could use World Bank data to get an indicator of institutional quality, such as Government Effectiveness, alongside FDI inflow data. Using WDI(), one could fetch GE.EST (Government Effectiveness index) and BX.KLT.DINV.WD.GD.ZS (FDI % of GDP) for a set of countries. Those data can then be merged and analyzed (we will do exactly this in Section 3.3). The ease with which WDI delivers these series to your R session – in just a few lines of code – illustrates why programmatic access is so powerful. It not only saves time, but it documents exactly which version of the data was used (since the API will always give the most up-to-date values or a specified historical version if using the v2 API with date queries).

3.2.2 OECD Data via API and the `OECD` R Package

The Organisation for Economic Co-operation and Development (OECD) hosts a wealth of economic data, particularly for its member countries (mostly high-income economies) and some partner countries. The OECD datasets cover areas highly relevant to IB: for example, the OECD International Direct Investment statistics (FDI inflows/outflows, bilateral FDI positions), trade in value-added, R&D expenditures, multinational enterprises’ activities, productivity, employment, and various indices (like the OECD’s Product Market Regulation index or FDI Regulatory Restrictiveness Index). Many IB researchers turn to OECD data for deeper or higher-frequency measures that complement broader sources like the World Bank.

The OECD provides a unified RESTful API for accessing all its datasets, and it adheres to the SDMX standard (a standard for exchanging statistical data). In practical terms, each dataset in the OECD repository has an identifier and predefined dimensions (such as country, variable, time). One can query the API by constructing a URL that specifies the dataset and desired filters (e.g. which countries, which variables, which years). For example, an API query might look like:

https://sdmx.oecd.org/public/rest/data/<<DatasetID>>/<<Filters>>?startPeriod=2010&endPeriod=2020&format=csv

which would return a CSV of the requested data. The OECD’s online Data Explorer tool actually allows you to interactively select data and then gives you the corresponding API link – a very helpful way to build queries without guesswork.

In R, the OECD package (developed by DataLab and available on GitHub and CRAN) simplifies using this API. It has two primary functions: get_data_structure() to get metadata about a dataset (like available countries or indicator codes), and get_dataset() to retrieve the actual data. As of 2024, due to some updates in the OECD API, users are advised to use the development version of the OECD package from GitHub (which is more up-to-date). Assuming the package is installed and loaded, here’s how one could use it:

library(OECD)

# Example: Retrieve OECD data (pseudo-code example)
# Suppose we want the OECD FDI Regulatory Restrictiveness Index (dataset "FDIINDEX")
dataset_id <- "FDIINDEX"  # (Example dataset code for illustration)
# We can first get the data structure to see dimensions (countries, years, etc.)
ds_structure <- get_data_structure(dataset_id)
str(ds_structure$VAR_DESC)  # this would list the dimensions and codes

# Now retrieve a subset: e.g., restrictiveness index for USA, CHN from 2010-2020
data <- get_dataset(
  dataset = dataset_id, 
  filter = list(Country=c("USA","CHN")),  # assuming "Country" is a dimension
  start_time = 2010, 
  end_time   = 2020
)
head(data)

(Note: The above is a simplified illustration. The actual dataset ID and filter names need to match OECD’s definitions. For instance, the OECD’s FDI Regulatory Restrictiveness Index dataset has a certain code and the country dimension might be called something like COUNTRY or have specific codes for each country. One would inspect ds_structure to find the right codes.)

The result data would contain the requested panel of data. The get_dataset function returns a data frame where each column is a dimension or measure in the dataset. In this FDI restrictiveness example, likely dimensions include country, year, sector (if applicable), and the value of the index.

An alternative way without the specialized R package is to directly use the API via a URL and R’s reading functions. Since the OECD API can return CSV, one can do something like:

url <- "https://sdmx.oecd.org/public/rest/data/<<dataset query string>>&format=csv"
oecd_df <- read.csv(url)

In fact, the OECD Data Explorer FAQ provides a ready example of using read.csv() on an API URL to fetch data into R. This method is useful if you already have the exact query (perhaps prepared via the OECD website) or if you don’t want to rely on an extra R package. However, for interactive use and exploring datasets, the OECD R package is extremely helpful – you can search for datasets, and use get_data_structure() to navigate what’s available in a dataset (which can otherwise be a bit opaque).

Use cases in IB research: The OECD’s data are often used for more granular analysis of developed economies or to obtain specific indicators not in WDI. For example, the OECD FDI Regulatory Restrictiveness Index (mentioned above) quantifies how open or closed a country’s FDI regulations are, on a 0 (open) to 1 (closed) scale. A researcher examining how policy barriers affect inbound investment could retrieve this index for a set of countries and years, and then relate it to FDI inflow data. Another example is the OECD AFA database (Activity of Foreign Affiliates), which provides statistics on foreign-owned firms within economies – e.g., the number of employees in foreign affiliates by sector. Such detailed data can help study MNE performance and impact. By using R to programmatically get these series, one ensures that the exact definitions and latest updates from OECD are used (and it’s easy to update the analysis when the OECD releases new data each year).

In summary, the OECD’s API, coupled with the R tools, opens up a trove of high-quality data for IB scholars. It does require learning the specific dataset codes and structure, but once mastered, it greatly streamlines comparative research on topics like trade, investment, and economic policy among advanced and emerging economies.

3.2.3 International Monetary Fund (IMF) Data

The International Monetary Fund (IMF) is another critical source for international data, particularly in the realms of macroeconomics and financial statistics. The IMF maintains several databases of interest:

International Financial Statistics (IFS): A broad database covering exchange rates, interest rates, monetary aggregates, national accounts, prices, etc., for almost all countries – often with monthly or quarterly frequency. IB researchers might use IFS for things like exchange rate data or money supply when examining financial factors affecting MNEs.
Balance of Payments (BOP) and Direction of Trade Statistics (DOTS): These include detailed information on countries’ balance of payments components (including FDI flows, portfolio flows) and bilateral trade flows.
Coordinated Direct Investment Survey (CDIS): A dataset focusing on bilateral FDI positions (stock of investment) across countries – useful for network analyses of FDI.
Coordinated Portfolio Investment Survey (CPIS): Similar concept for portfolio investments.
Other datasets like government finance statistics, etc.

The IMF provides a JSON REST API for its data (at https://data.imf.org) and there are R packages to facilitate access. One widely used package was imfr (by Christopher Gandrud) and more recently imf.data (which is available on CRAN). These packages allow you to query IMF data by specifying the database, series, country, and time period, much like WDI does for World Bank data. The imf_data() function (from imfr or imf.data, which have similar interfaces) can be used after finding the relevant “database ID” and “indicator code” for the series of interest.

For example, the IFS database is identified by "IFS". Within IFS, each series has a code (for instance, the real effective exchange rate might be coded as “EREER_IX”). To get the annual Real Effective Exchange Rate index for, say, China and the UK, one could do:

# Using imfr or imf.data package
library(imfr)  # or library(imf.data)

reer_data <- imf_data(
  database_id = "IFS",            # IMF database: International Financial Statistics
  indicator   = "EREER_IX",       # Real Effective Exchange Rate index (CPI-based)
  country     = c("CN", "GB"),    # ISO country codes: China, United Kingdom
  start       = 2000,
  end         = 2022,
  freq        = "A"               # annual frequency
)
head(reer_data)

This call will reach out to the IMF API and return a data frame of the REER index for China and the UK, annually from 2000 to 2022. (The imf_data function automatically handles the API calls and parses the JSON into a data frame.) The structure typically includes columns for country code, year, indicator, and value. In this example, reer_data might show entries like: Country = CN, Indicator = EREER_IX, Year = 2000, Value = 100 (just an example).

If we wanted multiple series in one go, some IMF databases allow bundling them. For instance, one could request both an interest rate and an exchange rate series together by providing a vector of indicators. The imf_data function is quite flexible (and has arguments like return_raw, print_url for debugging, etc.).

One challenge with IMF data is discovering the series codes. The packages usually provide helper functions like imf_ids() to list available databases and imf_codes(database_id) to list series codes in a given database. It may take some effort to find the exact code for, say, FDI net inflow in the BOP data, but the investment is worthwhile. The IMF data often go further back historically or have higher frequency than other sources, which can enrich IB analyses (for example, studying quarterly trends around financial crises, or comparing monthly exchange rate volatility).

Use in IB research: IMF’s CDIS data is particularly relevant if you are analyzing bilateral FDI patterns – for example, how much FDI from country A goes to country B. If a JWB study was examining Chinese outward FDI and its determinants in various host countries, one could fetch the CDIS data on China’s outward investment positions. Similarly, DOTS could be used for analyses of bilateral trade exposure of multinational firms (as a macro proxy). The key advantage of using the IMF API via R is that you can get these specific slices of data without manually downloading multiple Excel files from the IMF website. Everything is reproducible and easily adjustable (e.g., changing the set of countries or extending the time range).

A quick example: suppose we want to replicate a finding that countries with more stable exchange rates have attracted more FDI (a plausible IB hypothesis related to currency stability and investment). We could retrieve from IMF the standard deviation of monthly exchange rate for each country (or an index of exchange rate stability) and the FDI inflows from either IMF BOP or World Bank. Writing an R script to do this for all countries over two decades would be far less painful than manually collecting those series country by country. And by incorporating it into our code, we ensure that if new data (e.g., 2023 values) become available, we just rerun the script to include them.

In summary, the IMF data sources, accessed through R, provide a powerful complement to World Bank and OECD data – especially for financial and high-frequency economic indicators. The combination of imf_data() and careful code documentation means even complex data gathering (like assembling a panel of dozens of countries’ macro-financial indicators) becomes tractable and reproducible.

3.2.4 UNCTAD Data (Trade and Investment)

The UN Conference on Trade and Development (UNCTAD) is known for its focus on trade, investment, and development, and it produces influential reports like the World Investment Report. UNCTAD compiles data on FDI, multinational enterprises, trade in commodities, and increasingly on topics like the digital economy and sustainable development. Some key UNCTAD databases of interest include:

FDI Statistics: UNCTADstat (UNCTAD’s data portal) provides annual data on FDI inflows and outflows, inward and outward FDI stock, cross-border M&A values, and the number of Greenfield investment projects, among others. These often complement the IMF/WB data by providing additional breakdowns (e.g., FDI by sector or by group of economies).
Trade Data: While basic trade values can be obtained from sources like the World Bank or IMF, UNCTAD, in collaboration with WTO, provides detailed merchandise and services trade statistics and indicators like terms of trade, concentration indices, etc.
Tariffs and Non-Tariff Measures: UNCTAD’s TRAINS database (Trade Analysis Information System) contains detailed tariff rates and non-tariff measure data by country. This is often accessed via the World Bank’s WITS interface (World Integrated Trade Solution).

Accessing UNCTAD data programmatically is a bit less standardized than for WB/OECD/IMF, because UNCTADstat’s API is not as publicly advertised. However, many UNCTAD datasets can be retrieved either through partner APIs or via direct download links. For example, the WITS API (managed by the World Bank) provides access to the UNCTAD TRAINS data (tariff and trade measures) – there is even an R package on GitHub (witstrainsr) to interface with it. Similarly, some UNCTAD data appear in World Bank’s WDI or other databases (for instance, WDI’s data on investment often cite UNCTAD as a source). If an indicator from UNCTAD is included in WDI, using the WDI package might be the easiest route.

For direct access, researchers can use generic methods. UNCTADstat allows users to download data tables in CSV or TSV format from its website. If those download URLs can be determined, one can use read.csv() on them. As an illustration, suppose UNCTADstat has a table for “Inward FDI stock, annual, by country”. One could manually download that CSV and then use it in R, but a better way is: find the direct link (often, data portals have a way to share a link to the data behind a table selection) and use R to download it. For example:

url <- "https://unctadstat.unctad.org/EN/DownloadCSV.ashx?reportType=FDI_InwardStock&...<parameters>..."
fdi_stock <- read.csv(url)

(This URL is hypothetical – UNCTADstat queries would include specific report and filter parameters.)

Another approach is web scraping, though it should be a last resort if an API or direct link isn’t available. Using R’s rvest package, one could scrape a page like UNCTAD’s country profiles for specific indicators, but this is less robust and not officially supported by the data provider.

Despite the lack of a dedicated R package for UNCTAD general statistics, it’s still possible to integrate UNCTAD data into a programmatic workflow. For trade data, many researchers actually rely on UN Comtrade (the UN’s trade database) which does have an API and even an R package (comtradr). Comtrade provides detailed bilateral trade flows by product. If one’s IB research question is about trade (say, the export structure of countries where MNEs invest), Comtrade could be queried for the relevant trade stats.

Use case example: The World Investment Report might state a fact like “FDI flows to developing economies declined by X% in 2020”. If we want to use data to analyze trends in FDI flows across regions ourselves, we could retrieve the FDI inflow data for groups of countries. UNCTAD provides regional aggregates (like “Developing Asia” total FDI). Programmatically, one might fetch data for all countries, then aggregate by region in R (using a country-to-region mapping). This ensures we are using the same underlying data that UNCTAD analysts use, and we can then reproduce or examine further (e.g., checking which sub-region had the biggest drop).

Another example: IB scholars often examine the effect of institutions on FDI. UNCTAD’s FDI data can be paired with institutional indicators (like those from WGI or Doing Business). One could use WDI for institutional data and UNCTADstat for FDI data, merging them in R by country-year. We will demonstrate a version of this in the empirical example section.

In summary, UNCTAD data can be accessed in R, but it may require a bit more creativity (leveraging partner APIs or manual URL construction). The effort is justified by the unique data UNCTAD offers, especially on investment. By ensuring that even these data are pulled via code (and not manual downloads), we keep our research workflow transparent. If, say, UNCTAD revises its FDI figures (which happens as data get updated), a quick re-run of the script will incorporate those revisions, and our entire analysis updates accordingly.

3.2.5 National Statistical Office Data (Example: Statistics Canada via `cansim`)

In addition to international organizations, IB researchers may need data from national sources, especially for firm-level or detailed industry-level information that international databases might not capture. Many national statistical offices now provide open data portals and APIs. For instance, the US Bureau of Economic Analysis (BEA) has data on US MNE activities abroad, the UK’s Office for National Statistics provides detailed trade and FDI stats, and so on. As a prototypical example, let’s look at Statistics Canada, which offers an API for its economic and social data.

Statistics Canada’s main socioeconomic database (formerly called CANSIM) contains thousands of data tables, each identified by a table number (e.g., 36-10-0104-01 might be a GDP table). The R package cansim is designed to interface with Statistics Canada’s data, allowing users to search tables and download them as tidy data frames. Under the hood, it accesses StatCan’s JSON API.

To use cansim, you install and load it, and then you can do: get_cansim("table_number"). For example, suppose we want the annual GDP of Canada (which is table 36-10-0222-01 in StatCan’s database). We can fetch it like so:

library(cansim)

gdp_data <- get_cansim("36-10-0222-01")
head(gdp_data)

When run, get_cansim("36-10-0222-01") will reach out to StatCan and return the table as a data frame. The cansim package automatically converts the raw data into a tidy format: typically, you get columns like REF_DATE (the date or year), geographical classification (if any), and the value along with units. It also handles things like converting strings to numeric and adding useful attributes. In our GDP example, since it’s national GDP, the table might have just Year and GDP (and possibly components if included). If the table has multiple series (say GDP by expenditure components), the data frame will include a column for the category.

The package also provides helper functions. For instance, you can search for tables by keywords: search_cansim("foreign direct investment") might return a list of table IDs related to FDI involving Canada. Once you have the ID, you plug it into get_cansim. The package documentation and vignettes list many such capabilities (including bilingual data retrieval in English/French, since Canada is bilingual, but that’s ancillary for most analytical needs).

Statistics Canada’s API (and similarly others like BEA’s API in the US) allows retrieving data in a programmatic way that ensures you always get the latest revision. If StatsCan revises last quarter’s GDP, your code will automatically get the revised number next time it’s run. This is important because IB research, especially when focusing on a particular country, might use the official national statistics as the definitive source.

Use case in IB research: Imagine studying the performance of Canadian multinationals – you might need Canadian FDI outflows by country or the number of Canadian-controlled affiliates abroad. StatCan provides such data in tables (for example, there are tables for Canada’s direct investment position abroad by country). Using cansim, you could retrieve Table 36-10-0009-01: Canadian direct investment abroad, position by country. The code would be similar: cda_outward_fdi <- get_cansim("36-10-0009-01"). Then you could filter that data frame for the years and partner countries of interest. By doing it in code, you make it easy to update when the next year’s data is released or to switch to a different country’s data by just changing the table number (for instance, a similar table might exist for another country if their statistics agency has an API).

Another scenario: national data might give more granular insights, like provincial breakdowns or industry breakdowns within one country. For example, StatsCan might have a table on FDI inflows to Canada by industry sector. If an IB researcher is analyzing which sectors attract the most FDI in Canada and how that relates to policy, they could programmatically retrieve that sectoral FDI data, and perhaps merge it with sector-level employment or productivity data – also available from StatCan – all through code.

While our focus here is StatsCan, it’s worth mentioning that many other national agencies have similar APIs (and sometimes R packages). The US Census Bureau and BEA have R packages (tidycensus, bea.R), the UK ONS has an API (though not sure of an R package, one can use JSON queries), Eurostat (for EU countries) has a well-used R package eurostat, etc. The principles are the same: find the dataset ID or query parameters, then use either a specific package or a generic GET request to fetch the data. Incorporating these into an IB research project can significantly broaden the data available for analysis and allow triangulating international datasets with national details.

In conclusion, national data sources are an important piece of the puzzle for international business research, especially for deep dives into specific countries or validating data. Using packages like cansim ensures that even when using country-specific data, we maintain a reproducible pipeline. This also underscores a best practice: keep all data retrieval in your code – whether it’s global from the World Bank or local from a national source – so that anyone running your analysis script can obtain every dataset needed from scratch.

3.3 3.3 Empirical Example: Integrating Multi-Source Data in IB Research (FDI and Institutional Quality)

Having discussed how to retrieve various data programmatically, we now turn to an empirical example that mimics a typical international business research scenario. Our example question will be: How do host-country institutional quality and market size relate to inward FDI flows? This question is inspired by a stream of IB literature examining determinants of FDI – for instance, the idea that countries with better governance attract more foreign investment (supported by studies like Globerman & Shapiro, 2002 and recent findings in sub-Saharan Africa) and that larger or richer markets pull in more FDI. We will assemble a panel dataset for a set of countries, combining foreign investment data with institutional indicators and economic controls, then do a brief analysis (descriptive or a simple regression) to illustrate the process. The focus here is not on cutting-edge econometric modeling but on the workflow – how multiple data sources can be brought together seamlessly in R, with full reproducibility.

Step 1: Define the scope of data (countries, years, variables). Let’s say we are interested in a diverse set of countries, including both developed and developing economies. For example, we might include the United States, China, India, Brazil, South Africa, Germany, and Nigeria – a mix of large economies and emerging markets across different regions. We will cover the period 2005–2020 (15 years, giving a reasonable timeline that includes pre- and post-financial crisis, etc.).

The variables we want for each country-year are:

FDI Inflows (% of GDP): as the dependent/outcome variable. We can get this from the World Bank WDI (indicator code BX.KLT.DINV.WD.GD.ZS as used earlier).
Institutional Quality Indicator: we will use Government Effectiveness (GE) from the Worldwide Governance Indicators (WGI) as a proxy for institutional quality. GE is measured in units from approximately -2.5 (weak) to +2.5 (strong governance). (Alternatively, we could use Control of Corruption or an average of WGI indicators, but let’s pick one for simplicity.) We’ll obtain this from the World Bank as well (WGI data, which can be accessed via WDI using the code GV.GOVT.EF.ES or GE.EST).
Market Size / Economic Development: we use GDP per capita (constant USD) to proxy the market attractiveness and development level (from WDI, indicator NY.GDP.PCAP.KD).
Trade Openness: as a control, the ratio of trade to GDP (from WDI, indicator NE.TRD.GNFS.ZS – total exports+imports as % of GDP). This is a common control in FDI models, capturing how open the economy is to trade.
(We could include other controls like natural resource endowment, human capital, etc., but we’ll limit to the above for this illustration.)

All these indicators are conveniently available via the World Bank API, meaning we can actually get them in one go with the WDI package. However, to illustrate multi-source integration, we might simulate a scenario where we get FDI from UNCTAD instead. But since WDI’s FDI data is reliable and likely sourced from similar origins (IMF/UNCTAD), we will use WDI for all for now. In a real project, one could cross-verify WDI’s FDI figures with UNCTADstat if needed.

Step 2: Retrieve the data using R. We will use the WDI function as earlier, but with the new set of indicators and the specified countries and years:

library(WDI)
# Define the indicators of interest
wb_indicators <- c(
  FDI_inflow = "BX.KLT.DINV.WD.GD.ZS",   # FDI net inflows (% of GDP)
  GovEffect = "GV.GOVT.EF.ES",           # Government Effectiveness (estimate)
  GDPpc     = "NY.GDP.PCAP.KD",          # GDP per capita (constant USD)
  TradeOpen = "NE.TRD.GNFS.ZS"           # Trade (% of GDP)
)
countries <- c("USA","CHN","IND","BRA","ZAF","DEU","NGA")  # 7 example countries
data_panel <- WDI(country = countries, indicator = wb_indicators, start = 2005, end = 2020, extra = TRUE)

In this call, we used a named vector for indicators – this will automatically name the columns in the returned data frame as “FDI_inflow”, “GovEffect”, etc., instead of the raw codes. We also set extra=TRUE to get additional country metadata, which will give us columns like iso2c, country, region, income (income level category). This can be useful for later filtering or grouping (for instance, to group results by region or to color code plotted points by income level).

After running the above, data_panel contains our assembled dataset. We should check the dimensions: it should have 7 countries × 16 years = 112 rows (minus any missing data). WDI will return NA for GovEffect for 2020 if WGI hasn’t been updated through 2020 (suppose WGI is available up to 2019; in that case 2020 might be NA for GovEffect). Also, note that some countries might have missing FDI%GDP in some years. We need to be mindful of missing values in any analysis or visualization.

Step 3: Data cleaning and merging (if needed). In this case, since we pulled everything in one go, the data is already merged by country-year. If we had fetched some variables from different sources (say FDI from UNCTAD via a different route), we would need to merge them by country and year. In R, merging can be done with base R’s merge() or with dplyr’s left_join(). For example, if fdi_df came from UNCTAD and inst_df came from WDI, both containing country-year, we could do: merged_df <- merge(fdi_df, inst_df, by=c("country","year")). Ensuring a common country identifier is crucial – often using ISO country codes is easiest to avoid spelling mismatches. The WDI data gave us iso2c and iso3c, which are ISO codes; UNCTAD might use country names that differ (e.g., “United States of America” vs “United States”). In a multi-source project, one might use the countrycode package to harmonize country names to ISO codes to facilitate merges.

For our data_panel, let’s remove the extra columns we don’t need for analysis and ensure types are correct:

library(dplyr)
data_panel <- data_panel %>%
  select(country, iso3c, year, FDI_inflow, GovEffect, GDPpc, TradeOpen) %>%
  arrange(country, year)
summary(data_panel)

This uses dplyr to select only the relevant columns and then arrange by country-year. A quick summary will show descriptive stats and importantly how many NAs are in each column. Suppose we find that GovEffect has an NA for 2020 (because WGI last update was 2019). If so, for consistency we might drop 2020 or note it when interpreting results. For simplicity, let’s assume all data through 2020 except GovEffect 2020 are present.

Step 4: Descriptive analysis. We can now explore the data. Let’s compute some simple correlations or plot a relationship to see if it matches expectations. A quick check: do countries with higher governance scores have higher FDI/GDP on average?

# Compute country-average of GovEffect and FDI_inflow over the period
country_means <- data_panel %>%
  group_by(country) %>%
  summarize(GovEffect_avg = mean(GovEffect, na.rm=TRUE),
            FDI_inflow_avg = mean(FDI_inflow, na.rm=TRUE))
print(country_means)

This will list each country with their average governance and FDI%GDP. We might see, for example, the USA and Germany have high GovEffect and perhaps moderate FDI inflows relative to GDP, whereas Nigeria might have lower GovEffect and also perhaps lower FDI as % of GDP (or it could have high FDI% if it’s resource-seeking FDI – interestingly, governance is one factor among many). The point is to observe patterns. We could also create a scatterplot:

library(ggplot2)
ggplot(country_means, aes(x=GovEffect_avg, y=FDI_inflow_avg, label=country)) +
  geom_point() + geom_text(vjust=-0.5) +
  labs(x="Avg Government Effectiveness (WGI)", y="Avg FDI Inflows (% of GDP)",
       title="Governance vs FDI (country averages, 2005-2020)")

This would produce a scatterplot of our seven countries, which we could embed if needed. We might expect a positive slope (better governance, more FDI), which would align with prior studies that find a negative relationship between institutional weakness and FDI (i.e., strong institutions correlate with higher FDI inflows). With only seven observations it’s just an illustration, but one could expand this to all countries in the dataset to see a broader pattern.

Step 5: Simple regression analysis. As a final step, we can run a quick panel regression on our data to see if the hypothesized relationships hold (keeping in mind this is a small sample for demonstration):

# Simple pooled OLS regression of FDI%GDP on GovEffect, GDPpc, TradeOpen (with year dummies)
model <- lm(FDI_inflow ~ GovEffect + log(GDPpc) + TradeOpen + factor(year), data=data_panel)
summary(model)

This regression regresses FDI_inflow on Government Effectiveness, log GDP per capita (since GDPpc can be highly skewed, log is often used), Trade openness, and includes year fixed effects (factor(year)) to account for global shocks each year (like crisis years). The output summary would tell us coefficient estimates. We might find, for example, a positive coefficient on GovEffect (suggesting better governance is associated with higher FDI/GDP, as expected) and perhaps a positive coefficient on GDP per capita (wealthier countries attract more FDI relative to their GDP – or it could be negative if richer countries’ FDI is lower as % of GDP because GDP is large; interpretation needs care). Trade openness might also show positive if open economies tend to also be open to FDI.

The exact numbers are less important here than the fact that we could run this analysis immediately after obtaining the data because everything is in R. Moreover, if we decided to add more countries to the sample or update the data for 2021, we’d just change the countries vector or the end year and rerun – the entire data assembly and analysis updates in one go. This is a clear advantage over manual data prep, where each change could require hours of re-collecting and adjusting spreadsheets.

Step 6: Documenting and citing data sources. It’s good practice to record where each data series came from. In a Quarto document, we could add footnote citations when mentioning, e.g., “FDI inflow data are from the World Bank” or “Governance indicators are from the World Governance Indicators project”. By citing the World Bank or UNCTAD sources in text (as we’ve done with bracketed citations to evidence or definitions), we maintain transparency about data provenance. Quarto (or R Markdown) can also generate a references list at the end if we provide bibliography entries for these sources (for example, an APA reference for the World Bank database).

In fact, as part of reproducibility, one should also cite data (not just academic papers). The World Bank, IMF, etc., often provide suggested citations for their datasets. For instance, one might cite “World Bank (2025), World Development Indicators” in the reference list. This level of detail might go into an appendix or the data section of a paper. For our purposes, demonstrating that our code itself fetches from those official sources is already a strong transparency measure. Anyone can inspect the code and see, for example, that BX.KLT.DINV.WD.GD.ZS is a World Bank indicator (which they could verify on the World Bank’s metadata site).

This example exercise illustrates how an IB researcher can leverage multiple data sources in one coherent workflow. We combined data on institutional quality, economic size, and openness (all World Bank in this case, but easily extendable to other sources) to explain variation in FDI inflows, replicating qualitatively the approach of numerous IB studies. If this were a real research project, we would extend it: perhaps include many more countries, use panel data methods (fixed effects, etc.), address endogeneity, etc. But those analytical details aside, the data acquisition part – often one of the most labor-intensive parts of doing international comparisons – is made considerably easier and more reliable through programmatic means.

By saving this script and the resulting dataset (or better yet, by not even saving the dataset as a static file but always regenerating it when needed), we ensure that our analysis can be updated and checked. If tomorrow the World Bank revises India’s GDP figures, our analysis will incorporate that automatically on re-run. If a reviewer questions whether using Control of Corruption instead of Government Effectiveness yields similar results, we can swap the indicator code and re-run to find out. This agility in handling data ultimately leads to higher confidence in our findings and facilitates deeper exploration, since we spend less time on manual data wrangling and more on analysis and interpretation.

3.4 3.4 Best Practices for Reproducibility and Data Transparency in IB Research

In the final section of this chapter, we turn to the practices that wrap around the technical workflows – the habits and tools that ensure your entire research process is transparent and reproducible. As noted, IB as a field has recognized the need to improve credibility of findings by making data and analysis more open. Embracing programmatic data acquisition is one pillar of this, but it must be complemented by good project management and documentation. Here are several best practices to adopt:

1. Organize your project with a clear folder structure and use version control. It’s advisable to set up an R project (or Quarto project) for your research, which will designate a working directory. Within that, create subfolders such as data/ (for raw data files or outputs), scripts/ (for R scripts), figures/ (for plots or result images), and docs/ (for Quarto or R Markdown files, or final outputs). A typical structure might be:

project_name/
├── data/
│   ├── raw/            # original raw data dumps (if any)
│   └── processed/      # datasets after cleaning/merging
├── scripts/
├── figures/
└── docs/

When you retrieve data programmatically, you might not even need to save raw files to disk (you can directly keep data in memory). However, it’s often good to cache a copy of raw data as used at a particular time – especially if the source is a live API that might change. For example, after pulling data from WDI, you could save it as data/raw/wdi_pull_2025-07-08.csv to have a record. If using Git, you might not put large data files under version control, but you could store small CSVs or at least the code to get them.

Using Git and GitHub (or other Git hosting) is highly recommended. Git tracks changes to your scripts and text, so you have a history of what you did. It integrates with Positron nicely (Positron, being VS Code-based, has a source control panel). By committing your analysis scripts and Quarto documents regularly, and pushing to GitHub (possibly in a private repo if data is confidential, or public if you can share openly), you create a paper trail of your research evolution. This practice helps in collaboration as well – co-authors can see each other’s changes, and if you need to roll back to an earlier analysis, Git makes it possible.

2. Document data provenance and transformations. Data provenance means recording where data came from and how it has been processed. In our chapter, we’ve cited sources whenever we discussed data (e.g., World Bank WDI, OECD API). In a real project, you should keep a README file or a data dictionary that for each dataset used, notes the source (with a URL or citation), the date you accessed it, and any subset or filters applied. If you use Quarto, you can include this information in the text. For instance, a Quarto report might have a section “Data Sources” where you write: “FDI data were retrieved from UNCTADstat on July 1, 2025; institutional quality indicators are from the World Bank’s Worldwide Governance Indicators (2019 release),” etc. Providing these details meets journal requirements (many journals now expect a data availability statement). In fact, as per the AEA’s standards, one should precisely document how to access the original data and any conditions of access. Our approach of including code in a Quarto document inherently contributes to this: the code shows how data were accessed. However, if some data were obtained through a web portal that doesn’t have an API (e.g., manually downloaded archival data), then explicitly say so and store those files in data/raw/ with a clear name.

Tracking transformations is equally important. If you performed data cleaning (say, recoding some variables, filtering out some observations), consider writing those steps in a script rather than doing them ad hoc. For example, if you drop outlier countries from a sample, let that be done via code (so that anyone rerunning sees that logic). The goal is that someone else could start from raw data and with your scripts arrive at the exact analytical dataset you used.

3. Use literate programming tools (Quarto/R Markdown) for integration of analysis and narrative. Quarto (the successor of R Markdown) allows you to combine text, code, outputs, and references in one document. By writing your paper or chapter as a Quarto file (.qmd), you essentially create a dynamic document where the results (tables, figures) are generated directly from the data by the embedded code. This ensures consistency – no manually copy-pasted values that could be mistyped or become outdated. It also makes the review process easier: one could examine the Quarto source to see exactly how each number was calculated. Journals and publishers are increasingly open to or even encouraging submission of such supporting documents (for replication materials). Quarto can output to PDF or Word as needed for submission, so the dynamic nature is behind the scenes.

We highly encourage writing at least the analysis sections in Quarto; if you prefer to write the main text separately, you can still use Quarto to produce an appendix of results. In our example, we could include the regression results and a figure directly in this chapter text because we ran the R code inside the document. If the data updates, those results in the chapter update too – no need to manually edit the text to reflect a new coefficient.

Quarto also integrates nicely with citation management. You can keep a bibliography (.bib file) of references. If you use Zotero for reference management (as many researchers do for organizing papers and references), you can use the Zotero Better BibTeX plugin to maintain an updated .bib file for your project. Then Quarto will automatically format your in-text citations (the [@key] syntax) and generate a references list in the chosen style (APA, etc.). Positron further streamlines this by allowing you to insert citations from Zotero directly: in visual mode, you can pull up a Zotero search and pick a reference to insert. This is incredibly useful for academic writing, as you don’t have to manually type citations or worry about formatting. In the Positron visual editor, if you have Zotero running, you might press a hotkey (as configured, e.g. Shift+Alt+Z) to trigger the citation picker, search for an author or title, and hit enter – it will insert the citation key into your Quarto doc, and later Quarto will compile it to a formatted citation and reference entry. By using these tools, you ensure that all sources – be they data or literature – are properly credited and traceable.

4. Share your data and code when possible. After doing the hard work of assembling a clean dataset, consider sharing it (unless restricted). Journals like JIBS are moving toward encouraging authors to provide replication files. You can deposit data and code in a repository (e.g., Dataverse, Zenodo, or even as supplementary material with the journal). Since our approach used mostly public data, there’s no confidentiality issue; we could include the final analysis dataset as a CSV along with the code. But even if we don’t, anyone can run our script to regenerate it – that’s the beauty of programmatic access! Do note that some sources (like proprietary databases or certain surveys) might not allow redistribution of raw data; in such cases, you share the code and perhaps a synthetic example, and the instructions for the reader to obtain the original data themselves. Always check data usage policies. For all open datasets we used (World Bank, etc.), it’s generally free to use with citation.

5. Utilize Positron features to enhance workflow. Positron, as mentioned, has integrated support for many of these practices. The Git pane will show diffs of your code changes so you can be confident in what you’re about to commit. The IDE’s notebook interface can execute code step by step, which is great for debugging your data retrieval before finalizing the script. Also, Positron’s multi-language support means if there’s a particular analysis you need to do in Python (say, using a specific machine learning library), you can do that within the same environment and even the same Quarto document (Quarto allows Python and R chunks together). This can be useful if, for example, an API has an official Python client but not an R client – you could use that in a pinch within your otherwise R-based workflow.

To summarize, reproducibility in international business research is achieved by combining tools and habits: obtaining data through code, keeping that code organized and under version control, documenting every step, and writing up the findings in a way that tightly links to the analytic process (via Quarto with embedded code and citations). By following these practices, you not only make it easier for others to trust and verify your work, but you also make your own life easier in the long run – updating or extending the project will be far less daunting. As Aguinis et al. (2017) argue, taking proactive steps to improve reproducibility and transparency is critical for the credibility of IB research. Embracing programmatic data acquisition with R in Positron is a concrete way to answer that call, aligning our research practice with the emerging norms of open science.

References: (APA style)

Aguinis, H., Cascio, W. F., & Ramani, R. S. (2017). Science’s reproducibility and replicability crisis: International business is not immune. Journal of International Business Studies, 48(6), 653–663.
Globerman, S., & Shapiro, D. (2002). Global foreign direct investment flows: The role of governance infrastructure. World Development, 30(11), 1899–1919.
OECD. (2025). OECD Data Explorer FAQ. Retrieved from OECD.org (accessed July 2025).
Radečić, D. (2024, July 4). Introducing Positron: A new, yet familiar IDE for R and Python. Appsilon Blog.
Tüzen, M. F. (2024, December 15). Extracting Data from OECD Databases in R: Using the oecd and rsdmx Packages. R-Bloggers.
World Bank. (2020). Worldwide Governance Indicators (2020 Update). Retrieved from World Bank DataBank (accessed 2025).
World Bank. (2025). World Development Indicators. Retrieved via WDI R package (version 2.7.9).
World Bank. (2021). World Investment Report (FDI data excerpt). UNCTAD, various pages.
Vuorre, M. (2025, June 6). How to add citations from Zotero to Quarto documents. Retrieved from vuorre.com.
American Economic Association. (2019). Data and Code Availability Policy. Retrieved from AEAweb.org (accessed 2025).
Adegboye, F. B., et al. (2020). Institutional quality, foreign direct investment, and economic development in sub-Saharan Africa. Humanities & Social Sciences Communications, 7(38).

# Programmatic Data Acquisition for International Business Research using R in Positron International business (IB) research increasingly relies on large, multi-country datasets covering economic, financial, and institutional variables. Questions on foreign direct investment (FDI), multinational enterprise (MNE) performance, international trade, and comparative institutional environments all demand **combining data from authoritative sources across many countries and years**. Traditionally, researchers might manually download spreadsheets from the World Bank or OECD websites and copy them into analysis files, a process that is time-consuming and error-prone. In recent years, however, there has been a strong push toward **programmatic data acquisition** – using code to retrieve data directly from online databases and APIs – to improve research efficiency, reproducibility, and transparency. This chapter provides a comprehensive guide to programmatic data retrieval in R, using the new Positron IDE (the next-generation successor to RStudio) and Quarto for reproducible reporting. We will illustrate techniques for accessing data from major international sources (World Bank, OECD, IMF, UNCTAD, Statistics Canada, etc.), integrate empirical examples inspired by studies in *Journal of International Business Studies (JIBS)* and *Journal of World Business (JWB)*, and discuss best practices for ensuring reproducibility and data transparency (such as using Git/GitHub for version control, structuring project folders, tracking data provenance, and integrating Zotero references in Quarto documents). The motivation for embracing programmatic, transparent data workflows in IB research is both practical and normative. On the practical side, IB scholars often deal with **dynamic, multi-period data** (e.g. yearly FDI inflows, quarterly trade volumes, annual governance indices) that get updated regularly. Writing code to pull the latest data on demand means analyses can be easily refreshed when new data are released. It also allows customizing data queries to retrieve only the needed indicators, countries, and time ranges, saving time and reducing errors. On the normative side, science is facing a reproducibility and replicability crisis, and IB is *“not immune”*. Studies have shown that irreproducible research results undermine the credibility and usefulness of our findings. Top-tier journals now emphasize data transparency: for example, JWB encourages data availability statements, and the American Economic Association requires accepted papers to provide complete data and code for replication. By programmatically acquiring data and integrating the retrieval code into our analysis (e.g. in a Quarto document), we **document precisely how data were obtained and processed**, allowing others (and our future selves) to replicate or audit the process. In sum, mastering programmatic data acquisition is becoming an essential skill for IB researchers aiming to produce robust, reproducible studies. In this chapter, we will first outline the general benefits and tools for programmatic data access in R (Section 3.1). We then provide detailed guidance for retrieving data from several **authoritative international databases**: the World Bank’s World Development Indicators (WDI) and related datasets, the OECD’s data API, the International Monetary Fund’s databases, the UN Conference on Trade and Development (UNCTAD) data sources, and a national statistics example (Statistics Canada) (Section 3.2). Throughout, code examples in R (within Positron) will demonstrate how to connect to APIs or use specialized R packages to query and import data. In Section 3.3, we integrate multiple sources in an **empirical example** replicating a style of analysis common in JIBS/JWB – examining how host-country institutional quality influences FDI inflows – using real-world data pulled programmatically. Finally, Section 3.4 discusses **best practices for reproducibility and transparency**: we cover how to organize research projects (folders and file naming), track changes with Git and GitHub, record data provenance and metadata, and use Quarto with reference management (e.g. Zotero) to produce publication-quality output with properly cited sources. By the end of the chapter, readers should be able to set up an IB data analysis project in Positron where raw data are fetched in code from trusted sources, analyses are fully reproducible, and results can be transparently shared. ## 3.1 Benefits of Programmatic Data Acquisition and Tools in R Programmatic data acquisition means using scripts or code to directly retrieve data from online sources (via web APIs, data portals, or other programmatic interfaces) rather than manually downloading files. Embracing this approach yields several key benefits for international business research: * **Customization:** The researcher can tailor queries to exactly the data needed – specific countries, indicators, years, etc. – instead of downloading large general files. This is especially useful in IB where one might need a custom panel (e.g. GDP, trade, and institutional scores for a particular set of emerging economies). * **Efficiency:** Automating data retrieval saves time and reduces human error. Once the code is written, it can pull updated data or new subsets in seconds, avoiding repetitive manual work. * **Reproducibility:** Perhaps most importantly, programmatic access ensures that *anyone can rerun the code to get the same data*. The data extraction steps are documented in the script, making the research reproducible by others. This addresses reproducibility concerns raised in IB’s methodological debates. * **Automation and Integration:** Data retrieval can be integrated into a larger workflow (data cleaning, analysis, visualization) that can be executed end-to-end with a single command. This enables fully automated updates and dynamic reports (e.g. a Quarto document that always uses the latest available data). In the R language, a rich ecosystem of packages and base functions exists to facilitate programmatic data access. At a low level, R can download files from the web via functions like `download.file()` or read data directly from URLs (e.g. `read.csv("http://...")`). For APIs that return JSON or XML, R provides packages like **`httr`** (for HTTP requests) and **`jsonlite`** (for parsing JSON) to fetch and handle data. These generic tools allow access to virtually any open API on the web. For example, a researcher could use `httr::GET()` to call an API endpoint and then parse the JSON content into an R data frame with `jsonlite::fromJSON()`. More conveniently, many major data providers have dedicated R packages that wrap their APIs and provide user-friendly functions. We will encounter several in this chapter: **`WDI`** (World Bank Development Indicators), **`OECD`** (for OECD data), **`imf.data`/`imfr`** (IMF databases), and **`cansim`** (Statistics Canada), among others. These packages abstract away the technical details of API calls. For instance, with the `WDI` package, one can simply specify the indicator codes, country codes, and date range, and the package handles constructing the URL query, downloading the data, and formatting it as a neat R data frame. It is worth noting that programmatic data access sometimes requires dealing with API keys or registration (some providers issue free API keys for tracking usage). Most of the sources we discuss (World Bank, OECD, IMF, etc.) currently provide open access without API keys, though they may have request rate limits. Always consult the source’s API documentation for any access requirements or limitations. R package maintainers often include these details in their package vignettes or README. Finally, a few words on the computing environment: we will use **Positron**, Posit’s next-generation data science IDE. Positron is essentially a fusion of RStudio and Visual Studio Code – it supports R and Python out of the box in a modern interface. In Positron, you can run R scripts or Quarto notebooks seamlessly. For example, you can open a Quarto document in Positron and execute R code chunks interactively, seeing results in-line, just as you would in RStudio. Positron’s multi-language support means you could even mix R and Python if needed (useful if, say, a Python library has a particular data-access feature). We focus on R here, but it’s reassuring that Positron provides a unified environment for whatever tools are needed. Using Positron, you can also take advantage of built-in Git integration (for version control) and the Visual Studio Code extension ecosystem – for instance, there’s an extension to connect Zotero for citation insertion directly within Quarto documents. The combination of R, Positron, and Quarto thus offers a powerful platform for *reproducible research workflows*: data can be pulled via code, analyzed, and the entire analysis (including code, results, and references) can be rendered to an output format (HTML, PDF, etc.) in one click. We will leverage this environment throughout the chapter. ## 3.2 Data Sources for International Business Research: Retrieval Techniques International business scholars draw on a variety of **cross-country data sources**. In this section, we survey several key providers: the World Bank, OECD, IMF, UNCTAD, and a national statistics agency (Statistics Canada). For each, we demonstrate how to retrieve data in R, highlighting relevant packages or APIs. These sources collectively cover a broad range of indicators commonly used in IB research – from macroeconomic indicators (GDP, trade, inflation) to institutional and regulatory metrics (governance indices, ease of doing business, investment restrictions) to firm-related aggregates (e.g. number of MNE subsidiaries, which might be available via national statistical agencies or UNCTAD reports). The examples given can be adapted to other countries, years, or similar datasets from these organizations. ### 3.2.1 World Bank Data (WDI and other datasets) The World Bank’s databases are a staple in international business and development research. The **World Development Indicators (WDI)** in particular is a comprehensive collection of development and macroeconomic indicators for virtually all countries, spanning topics such as economic output, trade, investment, education, health, infrastructure, governance, and more. Many IB studies use WDI for control variables or main variables – for example, GDP per capita to proxy market size, or net FDI inflows as a percentage of GDP to measure investment exposure. The World Bank also hosts other datasets relevant to IB, like the **Worldwide Governance Indicators (WGI)** (which cover institutional quality dimensions such as Government Effectiveness, Rule of Law, Control of Corruption, etc.) and previously the **Doing Business** indicators (ease of doing business scores). All these are accessible through a unified API. The R package **`WDI`** (maintained by Vincent Arel-Bundock) provides a convenient interface to search and download data from over 40 World Bank databases, including WDI, International Debt Statistics, Doing Business, Subnational statistics, and more. Using the `WDI` package is straightforward. First, ensure the package is installed (`install.packages("WDI")`) and loaded. The core function is `WDI()` which takes arguments for country, indicator, start and end years, etc. The World Bank indicators are referenced by specific codes. You can search for indicators by keywords with `WDIsearch()`. For example, to find the code for “GDP per capita,” you could do `WDIsearch("GDP per capita")` – one of the returned results would be **NY.GDP.PCAP.KD** (GDP per capita, constant dollars). Similarly, searching "FDI" might return **BX.KLT.DINV.WD.GD.ZS** (FDI net inflows as % of GDP). For demonstration, let’s retrieve two indicators from WDI: (1) GDP per capita (constant 2015 USD) and (2) FDI net inflows (% of GDP), for a sample of countries over a span of years. We’ll get data for four countries – say, the United States, China, Brazil, and South Africa – from 2000 to 2020: ```r # Load the WDI package library(WDI) # Specify indicator codes and countries indicators <- c( "NY.GDP.PCAP.KD", # GDP per capita, constant USD "BX.KLT.DINV.WD.GD.ZS" # FDI net inflows, % of GDP ) countries <- c("USA", "CHN", "BRA", "ZAF") # ISO-3 country codes # Retrieve data from 2000 to 2020 wdi_data <- WDI(country = countries, indicator = indicators, start = 2000, end = 2020) head(wdi_data) ``` When this code runs, the `WDI()` function sends a query to the World Bank API for the specified countries, indicators, and years, and returns a data frame (`wdi_data`). If we inspect `head(wdi_data)`, we will see columns: `country` (country name), `iso2c` (ISO 2-letter code), `year`, and one column for each requested indicator (with names like NY.GDP.PCAP.KD and BX.KLT.DINV.WD.GD.ZS). The data frame is in a “long” format: each row is a country-year observation. For example, a row might show `country="China", year=2000, NY.GDP.PCAP.KD=…, BX.KLT.DINV.WD.GD.ZS=…`. We can easily reshape or filter this as needed. The argument `country` in WDI can also take `"all"` to retrieve all countries, but be cautious as that will fetch a very large dataset if combined with many indicators and years. A few additional tips when using the World Bank data: * The WDI package by default uses the latest available data. If an indicator has missing years for some countries, those simply won’t appear in the data frame. You may need to check data availability (the World Bank API returns NA for years with no data). * If you set `extra=TRUE` in `WDI()`, you get additional metadata columns like region, income level, capital city, etc., for each country. This can be useful for filtering (e.g. focusing on only low-income countries). * The World Bank API does not require an API key and has generous usage limits. But if you are pulling very large amounts of data in loops, you might consider spacing requests to avoid hitting any transient rate limits. * The `WDI` package caches indicator metadata the first time you use it in a session, which makes subsequent searches faster. If the list of indicators updates (the World Bank occasionally adds series or changes definitions), you might call `WDIcache()` to refresh the cache. For other World Bank datasets beyond WDI: you can often access them by indicator code as well, if you know the code. For example, the Worldwide Governance Indicators have codes like **GE.EST** for Government Effectiveness (estimate). You could retrieve those via the WDI package similarly (the WDI package actually includes WGI under the hood since WGI is another database in the API). Alternatively, the World Bank provides separate downloads for WGI and others, but using the API via `WDI()` or the newer `wbdata` interface is typically easier and ensures you have the latest data. **Practical example in IB research:** Suppose you are interested in the relationship between institutional quality and FDI inflows, a topic explored by many studies (e.g., Globerman & Shapiro, 2002 showed that countries with stronger governance attract more FDI). You could use World Bank data to get an indicator of institutional quality, such as Government Effectiveness, alongside FDI inflow data. Using `WDI()`, one could fetch **GE.EST** (Government Effectiveness index) and **BX.KLT.DINV.WD.GD.ZS** (FDI % of GDP) for a set of countries. Those data can then be merged and analyzed (we will do exactly this in Section 3.3). The ease with which WDI delivers these series to your R session – in just a few lines of code – illustrates why programmatic access is so powerful. It not only saves time, but it documents *exactly* which version of the data was used (since the API will always give the most up-to-date values or a specified historical version if using the v2 API with date queries). ### 3.2.2 OECD Data via API and the `OECD` R Package The **Organisation for Economic Co-operation and Development (OECD)** hosts a wealth of economic data, particularly for its member countries (mostly high-income economies) and some partner countries. The OECD datasets cover areas highly relevant to IB: for example, the OECD International Direct Investment statistics (FDI inflows/outflows, bilateral FDI positions), trade in value-added, R\&D expenditures, multinational enterprises’ activities, productivity, employment, and various indices (like the OECD’s Product Market Regulation index or FDI Regulatory Restrictiveness Index). Many IB researchers turn to OECD data for deeper or higher-frequency measures that complement broader sources like the World Bank. The OECD provides a unified RESTful API for accessing all its datasets, and it adheres to the SDMX standard (a standard for exchanging statistical data). In practical terms, each dataset in the OECD repository has an *identifier* and predefined dimensions (such as country, variable, time). One can query the API by constructing a URL that specifies the dataset and desired filters (e.g. which countries, which variables, which years). For example, an API query might look like: ``` https://sdmx.oecd.org/public/rest/data/<<DatasetID>>/<<Filters>>?startPeriod=2010&endPeriod=2020&format=csv ``` which would return a CSV of the requested data. The OECD’s online **Data Explorer** tool actually allows you to interactively select data and then gives you the corresponding API link – a very helpful way to build queries without guesswork. In R, the **`OECD`** package (developed by DataLab and available on GitHub and CRAN) simplifies using this API. It has two primary functions: `get_data_structure()` to get metadata about a dataset (like available countries or indicator codes), and `get_dataset()` to retrieve the actual data. As of 2024, due to some updates in the OECD API, users are advised to use the development version of the `OECD` package from GitHub (which is more up-to-date). Assuming the package is installed and loaded, here’s how one could use it: ```r library(OECD) # Example: Retrieve OECD data (pseudo-code example) # Suppose we want the OECD FDI Regulatory Restrictiveness Index (dataset "FDIINDEX") dataset_id <- "FDIINDEX" # (Example dataset code for illustration) # We can first get the data structure to see dimensions (countries, years, etc.) ds_structure <- get_data_structure(dataset_id) str(ds_structure$VAR_DESC) # this would list the dimensions and codes # Now retrieve a subset: e.g., restrictiveness index for USA, CHN from 2010-2020 data <- get_dataset( dataset = dataset_id, filter = list(Country=c("USA","CHN")), # assuming "Country" is a dimension start_time = 2010, end_time = 2020 ) head(data) ``` *(Note: The above is a simplified illustration. The actual dataset ID and filter names need to match OECD’s definitions. For instance, the OECD’s FDI Regulatory Restrictiveness Index dataset has a certain code and the country dimension might be called something like `COUNTRY` or have specific codes for each country. One would inspect `ds_structure` to find the right codes.)* The result `data` would contain the requested panel of data. The `get_dataset` function returns a data frame where each column is a dimension or measure in the dataset. In this FDI restrictiveness example, likely dimensions include country, year, sector (if applicable), and the value of the index. An alternative way without the specialized R package is to directly use the API via a URL and R’s reading functions. Since the OECD API can return CSV, one can do something like: ```r url <- "https://sdmx.oecd.org/public/rest/data/<<dataset query string>>&format=csv" oecd_df <- read.csv(url) ``` In fact, the OECD Data Explorer FAQ provides a ready example of using `read.csv()` on an API URL to fetch data into R. This method is useful if you already have the exact query (perhaps prepared via the OECD website) or if you don’t want to rely on an extra R package. However, for interactive use and exploring datasets, the `OECD` R package is extremely helpful – you can search for datasets, and use `get_data_structure()` to navigate what’s available in a dataset (which can otherwise be a bit opaque). **Use cases in IB research:** The OECD’s data are often used for more granular analysis of developed economies or to obtain specific indicators not in WDI. For example, the **OECD FDI Regulatory Restrictiveness Index** (mentioned above) quantifies how open or closed a country’s FDI regulations are, on a 0 (open) to 1 (closed) scale. A researcher examining how policy barriers affect inbound investment could retrieve this index for a set of countries and years, and then relate it to FDI inflow data. Another example is the **OECD AFA database** (Activity of Foreign Affiliates), which provides statistics on foreign-owned firms within economies – e.g., the number of employees in foreign affiliates by sector. Such detailed data can help study MNE performance and impact. By using R to programmatically get these series, one ensures that the exact definitions and latest updates from OECD are used (and it’s easy to update the analysis when the OECD releases new data each year). In summary, the OECD’s API, coupled with the R tools, opens up a trove of high-quality data for IB scholars. It does require learning the specific dataset codes and structure, but once mastered, it greatly streamlines comparative research on topics like trade, investment, and economic policy among advanced and emerging economies. ### 3.2.3 International Monetary Fund (IMF) Data The **International Monetary Fund (IMF)** is another critical source for international data, particularly in the realms of macroeconomics and financial statistics. The IMF maintains several databases of interest: * **International Financial Statistics (IFS):** A broad database covering exchange rates, interest rates, monetary aggregates, national accounts, prices, etc., for almost all countries – often with monthly or quarterly frequency. IB researchers might use IFS for things like exchange rate data or money supply when examining financial factors affecting MNEs. * **Balance of Payments (BOP) and Direction of Trade Statistics (DOTS):** These include detailed information on countries’ balance of payments components (including FDI flows, portfolio flows) and bilateral trade flows. * **Coordinated Direct Investment Survey (CDIS):** A dataset focusing on bilateral FDI positions (stock of investment) across countries – useful for network analyses of FDI. * **Coordinated Portfolio Investment Survey (CPIS):** Similar concept for portfolio investments. * Other datasets like government finance statistics, etc. The IMF provides a JSON REST API for its data (at [https://data.imf.org](https://data.imf.org)) and there are R packages to facilitate access. One widely used package was **`imfr`** (by Christopher Gandrud) and more recently **`imf.data`** (which is available on CRAN). These packages allow you to query IMF data by specifying the database, series, country, and time period, much like WDI does for World Bank data. The `imf_data()` function (from `imfr` or `imf.data`, which have similar interfaces) can be used after finding the relevant “database ID” and “indicator code” for the series of interest. For example, the IFS database is identified by `"IFS"`. Within IFS, each series has a code (for instance, the real effective exchange rate might be coded as "EREER\_IX"). To get the annual Real Effective Exchange Rate index for, say, China and the UK, one could do: ```r # Using imfr or imf.data package library(imfr) # or library(imf.data) reer_data <- imf_data( database_id = "IFS", # IMF database: International Financial Statistics indicator = "EREER_IX", # Real Effective Exchange Rate index (CPI-based) country = c("CN", "GB"), # ISO country codes: China, United Kingdom start = 2000, end = 2022, freq = "A" # annual frequency ) head(reer_data) ``` This call will reach out to the IMF API and return a data frame of the REER index for China and the UK, annually from 2000 to 2022. (The `imf_data` function automatically handles the API calls and parses the JSON into a data frame.) The structure typically includes columns for country code, year, indicator, and value. In this example, `reer_data` might show entries like: Country = CN, Indicator = EREER\_IX, Year = 2000, Value = 100 (just an example). If we wanted multiple series in one go, some IMF databases allow bundling them. For instance, one could request both an interest rate and an exchange rate series together by providing a vector of indicators. The `imf_data` function is quite flexible (and has arguments like `return_raw`, `print_url` for debugging, etc.). One challenge with IMF data is discovering the series codes. The packages usually provide helper functions like `imf_ids()` to list available databases and `imf_codes(database_id)` to list series codes in a given database. It may take some effort to find the exact code for, say, *FDI net inflow in the BOP data*, but the investment is worthwhile. The IMF data often go further back historically or have higher frequency than other sources, which can enrich IB analyses (for example, studying quarterly trends around financial crises, or comparing monthly exchange rate volatility). **Use in IB research:** IMF’s CDIS data is particularly relevant if you are analyzing bilateral FDI patterns – for example, how much FDI from country A goes to country B. If a JWB study was examining Chinese outward FDI and its determinants in various host countries, one could fetch the CDIS data on China’s outward investment positions. Similarly, DOTS could be used for analyses of bilateral trade exposure of multinational firms (as a macro proxy). The key advantage of using the IMF API via R is that you can get these specific slices of data without manually downloading multiple Excel files from the IMF website. Everything is reproducible and easily adjustable (e.g., changing the set of countries or extending the time range). A quick example: suppose we want to replicate a finding that countries with more stable exchange rates have attracted more FDI (a plausible IB hypothesis related to currency stability and investment). We could retrieve from IMF the standard deviation of monthly exchange rate for each country (or an index of exchange rate stability) and the FDI inflows from either IMF BOP or World Bank. Writing an R script to do this for all countries over two decades would be far less painful than manually collecting those series country by country. And by incorporating it into our code, we ensure that if new data (e.g., 2023 values) become available, we just rerun the script to include them. In summary, the IMF data sources, accessed through R, provide a powerful complement to World Bank and OECD data – especially for financial and high-frequency economic indicators. The combination of `imf_data()` and careful code documentation means even complex data gathering (like assembling a panel of dozens of countries’ macro-financial indicators) becomes tractable and reproducible. ### 3.2.4 UNCTAD Data (Trade and Investment) The **UN Conference on Trade and Development (UNCTAD)** is known for its focus on trade, investment, and development, and it produces influential reports like the *World Investment Report*. UNCTAD compiles data on FDI, multinational enterprises, trade in commodities, and increasingly on topics like the digital economy and sustainable development. Some key UNCTAD databases of interest include: * **FDI Statistics:** UNCTADstat (UNCTAD’s data portal) provides annual data on FDI inflows and outflows, inward and outward FDI stock, cross-border M\&A values, and the number of Greenfield investment projects, among others. These often complement the IMF/WB data by providing additional breakdowns (e.g., FDI by sector or by group of economies). * **Trade Data:** While basic trade values can be obtained from sources like the World Bank or IMF, UNCTAD, in collaboration with WTO, provides detailed merchandise and services trade statistics and indicators like terms of trade, concentration indices, etc. * **Tariffs and Non-Tariff Measures:** UNCTAD’s TRAINS database (Trade Analysis Information System) contains detailed tariff rates and non-tariff measure data by country. This is often accessed via the World Bank’s WITS interface (World Integrated Trade Solution). Accessing UNCTAD data programmatically is a bit less standardized than for WB/OECD/IMF, because UNCTADstat’s API is not as publicly advertised. However, many UNCTAD datasets can be retrieved either through partner APIs or via direct download links. For example, the **WITS API** (managed by the World Bank) provides access to the UNCTAD TRAINS data (tariff and trade measures) – there is even an R package on GitHub (`witstrainsr`) to interface with it. Similarly, some UNCTAD data appear in World Bank’s WDI or other databases (for instance, WDI’s data on *investment* often cite UNCTAD as a source). If an indicator from UNCTAD is included in WDI, using the WDI package might be the easiest route. For direct access, researchers can use generic methods. UNCTADstat allows users to download data tables in CSV or TSV format from its website. If those download URLs can be determined, one can use `read.csv()` on them. As an illustration, suppose UNCTADstat has a table for “Inward FDI stock, annual, by country”. One could manually download that CSV and then use it in R, but a better way is: find the direct link (often, data portals have a way to share a link to the data behind a table selection) and use R to download it. For example: ```r url <- "https://unctadstat.unctad.org/EN/DownloadCSV.ashx?reportType=FDI_InwardStock&...<parameters>..." fdi_stock <- read.csv(url) ``` (This URL is hypothetical – UNCTADstat queries would include specific report and filter parameters.) Another approach is web scraping, though it should be a last resort if an API or direct link isn’t available. Using R’s **`rvest`** package, one could scrape a page like UNCTAD’s country profiles for specific indicators, but this is less robust and not officially supported by the data provider. Despite the lack of a dedicated R package for UNCTAD general statistics, it’s still possible to integrate UNCTAD data into a programmatic workflow. For trade data, many researchers actually rely on **UN Comtrade** (the UN’s trade database) which *does* have an API and even an R package (`comtradr`). Comtrade provides detailed bilateral trade flows by product. If one’s IB research question is about trade (say, the export structure of countries where MNEs invest), Comtrade could be queried for the relevant trade stats. **Use case example:** The *World Investment Report* might state a fact like “FDI flows to developing economies declined by X% in 2020”. If we want to use data to analyze trends in FDI flows across regions ourselves, we could retrieve the FDI inflow data for groups of countries. UNCTAD provides regional aggregates (like “Developing Asia” total FDI). Programmatically, one might fetch data for all countries, then aggregate by region in R (using a country-to-region mapping). This ensures we are using the same underlying data that UNCTAD analysts use, and we can then reproduce or examine further (e.g., checking which sub-region had the biggest drop). Another example: IB scholars often examine the effect of institutions on FDI. UNCTAD’s FDI data can be paired with institutional indicators (like those from WGI or Doing Business). One could use WDI for institutional data and UNCTADstat for FDI data, merging them in R by country-year. We will demonstrate a version of this in the empirical example section. In summary, **UNCTAD data can be accessed in R**, but it may require a bit more creativity (leveraging partner APIs or manual URL construction). The effort is justified by the unique data UNCTAD offers, especially on investment. By ensuring that even these data are pulled via code (and not manual downloads), we keep our research workflow transparent. If, say, UNCTAD revises its FDI figures (which happens as data get updated), a quick re-run of the script will incorporate those revisions, and our entire analysis updates accordingly. ### 3.2.5 National Statistical Office Data (Example: Statistics Canada via `cansim`) In addition to international organizations, IB researchers may need data from national sources, especially for firm-level or detailed industry-level information that international databases might not capture. Many national statistical offices now provide open data portals and APIs. For instance, the US Bureau of Economic Analysis (BEA) has data on US MNE activities abroad, the UK’s Office for National Statistics provides detailed trade and FDI stats, and so on. As a prototypical example, let’s look at **Statistics Canada**, which offers an API for its economic and social data. Statistics Canada’s main socioeconomic database (formerly called CANSIM) contains thousands of data tables, each identified by a table number (e.g., *36-10-0104-01* might be a GDP table). The R package **`cansim`** is designed to interface with Statistics Canada’s data, allowing users to search tables and download them as tidy data frames. Under the hood, it accesses StatCan’s JSON API. To use `cansim`, you install and load it, and then you can do: `get_cansim("table_number")`. For example, suppose we want the annual GDP of Canada (which is table *36-10-0222-01* in StatCan’s database). We can fetch it like so: ```r library(cansim) gdp_data <- get_cansim("36-10-0222-01") head(gdp_data) ``` When run, `get_cansim("36-10-0222-01")` will reach out to StatCan and return the table as a data frame. The `cansim` package automatically converts the raw data into a tidy format: typically, you get columns like `REF_DATE` (the date or year), geographical classification (if any), and the value along with units. It also handles things like converting strings to numeric and adding useful attributes. In our GDP example, since it’s national GDP, the table might have just Year and GDP (and possibly components if included). If the table has multiple series (say GDP by expenditure components), the data frame will include a column for the category. The package also provides helper functions. For instance, you can search for tables by keywords: `search_cansim("foreign direct investment")` might return a list of table IDs related to FDI involving Canada. Once you have the ID, you plug it into `get_cansim`. The package documentation and vignettes list many such capabilities (including bilingual data retrieval in English/French, since Canada is bilingual, but that’s ancillary for most analytical needs). Statistics Canada’s API (and similarly others like BEA’s API in the US) allows retrieving data in a programmatic way that ensures you always get the latest revision. If StatsCan revises last quarter’s GDP, your code will automatically get the revised number next time it’s run. This is important because IB research, especially when focusing on a particular country, might use the official national statistics as the definitive source. **Use case in IB research:** Imagine studying the performance of Canadian multinationals – you might need Canadian FDI outflows by country or the number of Canadian-controlled affiliates abroad. StatCan provides such data in tables (for example, there are tables for Canada’s direct investment position abroad by country). Using `cansim`, you could retrieve *Table 36-10-0009-01: Canadian direct investment abroad, position by country*. The code would be similar: `cda_outward_fdi <- get_cansim("36-10-0009-01")`. Then you could filter that data frame for the years and partner countries of interest. By doing it in code, you make it easy to update when the next year’s data is released or to switch to a different country’s data by just changing the table number (for instance, a similar table might exist for another country if their statistics agency has an API). Another scenario: national data might give more granular insights, like provincial breakdowns or industry breakdowns within one country. For example, StatsCan might have a table on *FDI inflows to Canada by industry sector*. If an IB researcher is analyzing which sectors attract the most FDI in Canada and how that relates to policy, they could programmatically retrieve that sectoral FDI data, and perhaps merge it with sector-level employment or productivity data – also available from StatCan – all through code. While our focus here is StatsCan, it’s worth mentioning that many other national agencies have similar APIs (and sometimes R packages). The US Census Bureau and BEA have R packages (`tidycensus`, `bea.R`), the UK ONS has an API (though not sure of an R package, one can use JSON queries), Eurostat (for EU countries) has a well-used R package `eurostat`, etc. The principles are the same: find the dataset ID or query parameters, then use either a specific package or a generic GET request to fetch the data. Incorporating these into an IB research project can significantly broaden the data available for analysis and allow triangulating international datasets with national details. In conclusion, **national data sources** are an important piece of the puzzle for international business research, especially for deep dives into specific countries or validating data. Using packages like `cansim` ensures that even when using country-specific data, we maintain a reproducible pipeline. This also underscores a best practice: keep all data retrieval in your code – whether it’s global from the World Bank or local from a national source – so that anyone running your analysis script can obtain every dataset needed from scratch. ## 3.3 Empirical Example: Integrating Multi-Source Data in IB Research (FDI and Institutional Quality) Having discussed how to retrieve various data programmatically, we now turn to an empirical example that mimics a typical international business research scenario. Our example question will be: **How do host-country institutional quality and market size relate to inward FDI flows?** This question is inspired by a stream of IB literature examining determinants of FDI – for instance, the idea that countries with better governance attract more foreign investment (supported by studies like Globerman & Shapiro, 2002 and recent findings in sub-Saharan Africa) and that larger or richer markets pull in more FDI. We will assemble a panel dataset for a set of countries, combining **foreign investment data** with **institutional indicators and economic controls**, then do a brief analysis (descriptive or a simple regression) to illustrate the process. The focus here is not on cutting-edge econometric modeling but on the *workflow* – how multiple data sources can be brought together seamlessly in R, with full reproducibility. **Step 1: Define the scope of data (countries, years, variables).** Let’s say we are interested in a diverse set of countries, including both developed and developing economies. For example, we might include the United States, China, India, Brazil, South Africa, Germany, and Nigeria – a mix of large economies and emerging markets across different regions. We will cover the period 2005–2020 (15 years, giving a reasonable timeline that includes pre- and post-financial crisis, etc.). The variables we want for each country-year are: * **FDI Inflows (% of GDP):** as the dependent/outcome variable. We can get this from the World Bank WDI (indicator code BX.KLT.DINV.WD.GD.ZS as used earlier). * **Institutional Quality Indicator:** we will use *Government Effectiveness (GE)* from the Worldwide Governance Indicators (WGI) as a proxy for institutional quality. GE is measured in units from approximately -2.5 (weak) to +2.5 (strong governance). (Alternatively, we could use Control of Corruption or an average of WGI indicators, but let’s pick one for simplicity.) We’ll obtain this from the World Bank as well (WGI data, which can be accessed via WDI using the code **GV.GOVT.EF.ES** or **GE.EST**). * **Market Size / Economic Development:** we use GDP per capita (constant USD) to proxy the market attractiveness and development level (from WDI, indicator NY.GDP.PCAP.KD). * **Trade Openness:** as a control, the ratio of trade to GDP (from WDI, indicator NE.TRD.GNFS.ZS – total exports+imports as % of GDP). This is a common control in FDI models, capturing how open the economy is to trade. * (We could include other controls like natural resource endowment, human capital, etc., but we’ll limit to the above for this illustration.) All these indicators are conveniently available via the World Bank API, meaning we can actually get them in one go with the `WDI` package. However, to illustrate multi-source integration, we might simulate a scenario where we get FDI from UNCTAD instead. But since WDI’s FDI data is reliable and likely sourced from similar origins (IMF/UNCTAD), we will use WDI for all for now. In a real project, one could cross-verify WDI’s FDI figures with UNCTADstat if needed. **Step 2: Retrieve the data using R.** We will use the `WDI` function as earlier, but with the new set of indicators and the specified countries and years: ```r library(WDI) # Define the indicators of interest wb_indicators <- c( FDI_inflow = "BX.KLT.DINV.WD.GD.ZS", # FDI net inflows (% of GDP) GovEffect = "GV.GOVT.EF.ES", # Government Effectiveness (estimate) GDPpc = "NY.GDP.PCAP.KD", # GDP per capita (constant USD) TradeOpen = "NE.TRD.GNFS.ZS" # Trade (% of GDP) ) countries <- c("USA","CHN","IND","BRA","ZAF","DEU","NGA") # 7 example countries data_panel <- WDI(country = countries, indicator = wb_indicators, start = 2005, end = 2020, extra = TRUE) ``` In this call, we used a named vector for indicators – this will automatically name the columns in the returned data frame as “FDI\_inflow”, “GovEffect”, etc., instead of the raw codes. We also set `extra=TRUE` to get additional country metadata, which will give us columns like `iso2c`, `country`, `region`, `income` (income level category). This can be useful for later filtering or grouping (for instance, to group results by region or to color code plotted points by income level). After running the above, `data_panel` contains our assembled dataset. We should check the dimensions: it should have 7 countries × 16 years = 112 rows (minus any missing data). WDI will return NA for GovEffect for 2020 if WGI hasn’t been updated through 2020 (suppose WGI is available up to 2019; in that case 2020 might be NA for GovEffect). Also, note that some countries might have missing FDI%GDP in some years. We need to be mindful of missing values in any analysis or visualization. **Step 3: Data cleaning and merging (if needed).** In this case, since we pulled everything in one go, the data is already merged by country-year. If we had fetched some variables from different sources (say FDI from UNCTAD via a different route), we would need to merge them by country and year. In R, merging can be done with base R’s `merge()` or with dplyr’s `left_join()`. For example, if `fdi_df` came from UNCTAD and `inst_df` came from WDI, both containing country-year, we could do: `merged_df <- merge(fdi_df, inst_df, by=c("country","year"))`. Ensuring a common country identifier is crucial – often using ISO country codes is easiest to avoid spelling mismatches. The `WDI` data gave us `iso2c` and `iso3c`, which are ISO codes; UNCTAD might use country names that differ (e.g., “United States of America” vs “United States”). In a multi-source project, one might use the **`countrycode`** package to harmonize country names to ISO codes to facilitate merges. For our `data_panel`, let’s remove the extra columns we don’t need for analysis and ensure types are correct: ```r library(dplyr) data_panel <- data_panel %>% select(country, iso3c, year, FDI_inflow, GovEffect, GDPpc, TradeOpen) %>% arrange(country, year) summary(data_panel) ``` This uses dplyr to select only the relevant columns and then arrange by country-year. A quick `summary` will show descriptive stats and importantly how many NAs are in each column. Suppose we find that `GovEffect` has an NA for 2020 (because WGI last update was 2019). If so, for consistency we might drop 2020 or note it when interpreting results. For simplicity, let’s assume all data through 2020 except GovEffect 2020 are present. **Step 4: Descriptive analysis.** We can now explore the data. Let’s compute some simple correlations or plot a relationship to see if it matches expectations. A quick check: do countries with higher governance scores have higher FDI/GDP on average? ```r # Compute country-average of GovEffect and FDI_inflow over the period country_means <- data_panel %>% group_by(country) %>% summarize(GovEffect_avg = mean(GovEffect, na.rm=TRUE), FDI_inflow_avg = mean(FDI_inflow, na.rm=TRUE)) print(country_means) ``` This will list each country with their average governance and FDI%GDP. We might see, for example, the USA and Germany have high GovEffect and perhaps moderate FDI inflows relative to GDP, whereas Nigeria might have lower GovEffect and also perhaps lower FDI as % of GDP (or it could have high FDI% if it’s resource-seeking FDI – interestingly, governance is one factor among many). The point is to observe patterns. We could also create a scatterplot: ```r library(ggplot2) ggplot(country_means, aes(x=GovEffect_avg, y=FDI_inflow_avg, label=country)) + geom_point() + geom_text(vjust=-0.5) + labs(x="Avg Government Effectiveness (WGI)", y="Avg FDI Inflows (% of GDP)", title="Governance vs FDI (country averages, 2005-2020)") ``` This would produce a scatterplot of our seven countries, which we could embed if needed. We might expect a positive slope (better governance, more FDI), which would align with prior studies that find a negative relationship between institutional weakness and FDI (i.e., strong institutions correlate with *higher* FDI inflows). With only seven observations it’s just an illustration, but one could expand this to all countries in the dataset to see a broader pattern. **Step 5: Simple regression analysis.** As a final step, we can run a quick panel regression on our data to see if the hypothesized relationships hold (keeping in mind this is a small sample for demonstration): ```r # Simple pooled OLS regression of FDI%GDP on GovEffect, GDPpc, TradeOpen (with year dummies) model <- lm(FDI_inflow ~ GovEffect + log(GDPpc) + TradeOpen + factor(year), data=data_panel) summary(model) ``` This regression regresses FDI\_inflow on Government Effectiveness, log GDP per capita (since GDPpc can be highly skewed, log is often used), Trade openness, and includes year fixed effects (`factor(year)`) to account for global shocks each year (like crisis years). The output summary would tell us coefficient estimates. We might find, for example, a positive coefficient on GovEffect (suggesting better governance is associated with higher FDI/GDP, as expected) and perhaps a positive coefficient on GDP per capita (wealthier countries attract more FDI relative to their GDP – or it could be negative if richer countries’ FDI is lower as % of GDP because GDP is large; interpretation needs care). Trade openness might also show positive if open economies tend to also be open to FDI. The exact numbers are less important here than the fact that we could run this analysis *immediately after obtaining the data* because everything is in R. Moreover, if we decided to add more countries to the sample or update the data for 2021, we’d just change the `countries` vector or the end year and rerun – the entire data assembly and analysis updates in one go. This is a clear advantage over manual data prep, where each change could require hours of re-collecting and adjusting spreadsheets. **Step 6: Documenting and citing data sources.** It’s good practice to record where each data series came from. In a Quarto document, we could add footnote citations when mentioning, e.g., “FDI inflow data are from the World Bank” or “Governance indicators are from the World Governance Indicators project”. By citing the World Bank or UNCTAD sources in text (as we’ve done with bracketed citations to evidence or definitions), we maintain transparency about data provenance. Quarto (or R Markdown) can also generate a references list at the end if we provide bibliography entries for these sources (for example, an APA reference for the World Bank database). In fact, as part of reproducibility, one should also **cite data** (not just academic papers). The World Bank, IMF, etc., often provide suggested citations for their datasets. For instance, one might cite “World Bank (2025), World Development Indicators” in the reference list. This level of detail might go into an appendix or the data section of a paper. For our purposes, demonstrating that our code itself fetches from those official sources is already a strong transparency measure. Anyone can inspect the code and see, for example, that `BX.KLT.DINV.WD.GD.ZS` is a World Bank indicator (which they could verify on the World Bank’s metadata site). This example exercise illustrates how an IB researcher can **leverage multiple data sources in one coherent workflow**. We combined data on institutional quality, economic size, and openness (all World Bank in this case, but easily extendable to other sources) to explain variation in FDI inflows, replicating qualitatively the approach of numerous IB studies. If this were a real research project, we would extend it: perhaps include many more countries, use panel data methods (fixed effects, etc.), address endogeneity, etc. But those analytical details aside, the data acquisition part – often one of the most labor-intensive parts of doing international comparisons – is made *considerably easier and more reliable* through programmatic means. By saving this script and the resulting dataset (or better yet, by not even saving the dataset as a static file but always regenerating it when needed), we ensure that our analysis can be updated and checked. If tomorrow the World Bank revises India’s GDP figures, our analysis will incorporate that automatically on re-run. If a reviewer questions whether using Control of Corruption instead of Government Effectiveness yields similar results, we can swap the indicator code and re-run to find out. This agility in handling data ultimately leads to **higher confidence in our findings** and facilitates deeper exploration, since we spend less time on manual data wrangling and more on analysis and interpretation. ## 3.4 Best Practices for Reproducibility and Data Transparency in IB Research In the final section of this chapter, we turn to the practices that wrap around the technical workflows – the habits and tools that ensure your entire research process is transparent and reproducible. As noted, IB as a field has recognized the need to improve credibility of findings by making data and analysis more open. Embracing programmatic data acquisition is one pillar of this, but it must be complemented by good project management and documentation. Here are several best practices to adopt: **1. Organize your project with a clear folder structure and use version control.** It’s advisable to set up an R project (or Quarto project) for your research, which will designate a working directory. Within that, create subfolders such as `data/` (for raw data files or outputs), `scripts/` (for R scripts), `figures/` (for plots or result images), and `docs/` (for Quarto or R Markdown files, or final outputs). A typical structure might be: ``` project_name/ ├── data/ │ ├── raw/ # original raw data dumps (if any) │ └── processed/ # datasets after cleaning/merging ├── scripts/ ├── figures/ └── docs/ ``` When you retrieve data programmatically, you might not even need to save raw files to disk (you can directly keep data in memory). However, it’s often good to **cache** a copy of raw data as used at a particular time – especially if the source is a live API that might change. For example, after pulling data from WDI, you could save it as `data/raw/wdi_pull_2025-07-08.csv` to have a record. If using Git, you might not put large data files under version control, but you could store small CSVs or at least the code to get them. Using **Git and GitHub** (or other Git hosting) is highly recommended. Git tracks changes to your scripts and text, so you have a history of what you did. It integrates with Positron nicely (Positron, being VS Code-based, has a source control panel). By committing your analysis scripts and Quarto documents regularly, and pushing to GitHub (possibly in a private repo if data is confidential, or public if you can share openly), you create a paper trail of your research evolution. This practice helps in collaboration as well – co-authors can see each other’s changes, and if you need to roll back to an earlier analysis, Git makes it possible. **2. Document data provenance and transformations.** Data provenance means **recording where data came from and how it has been processed**. In our chapter, we’ve cited sources whenever we discussed data (e.g., World Bank WDI, OECD API). In a real project, you should keep a README file or a data dictionary that for each dataset used, notes the source (with a URL or citation), the date you accessed it, and any subset or filters applied. If you use Quarto, you can include this information in the text. For instance, a Quarto report might have a section “Data Sources” where you write: “FDI data were retrieved from UNCTADstat on July 1, 2025; institutional quality indicators are from the World Bank’s Worldwide Governance Indicators (2019 release),” etc. Providing these details meets journal requirements (many journals now expect a *data availability statement*). In fact, as per the AEA’s standards, one should precisely document how to access the original data and any conditions of access. Our approach of including code in a Quarto document inherently contributes to this: the code shows how data were accessed. However, if some data were obtained through a web portal that doesn’t have an API (e.g., manually downloaded archival data), then explicitly say so and store those files in `data/raw/` with a clear name. Tracking transformations is equally important. If you performed data cleaning (say, recoding some variables, filtering out some observations), consider writing those steps in a script rather than doing them ad hoc. For example, if you drop outlier countries from a sample, let that be done via code (so that anyone rerunning sees that logic). The goal is that someone else could start from raw data and *with your scripts* arrive at the exact analytical dataset you used. **3. Use literate programming tools (Quarto/R Markdown) for integration of analysis and narrative.** Quarto (the successor of R Markdown) allows you to combine text, code, outputs, and references in one document. By writing your paper or chapter as a Quarto file (`.qmd`), you essentially create a dynamic document where the results (tables, figures) are generated directly from the data by the embedded code. This ensures consistency – no manually copy-pasted values that could be mistyped or become outdated. It also makes the review process easier: one could examine the Quarto source to see exactly how each number was calculated. Journals and publishers are increasingly open to or even encouraging submission of such supporting documents (for replication materials). Quarto can output to PDF or Word as needed for submission, so the dynamic nature is behind the scenes. We highly encourage writing at least the analysis sections in Quarto; if you prefer to write the main text separately, you can still use Quarto to produce an appendix of results. In our example, we could include the regression results and a figure directly in this chapter text because we ran the R code inside the document. If the data updates, those results in the chapter update too – no need to manually edit the text to reflect a new coefficient. Quarto also integrates nicely with citation management. You can keep a bibliography (`.bib` file) of references. If you use **Zotero** for reference management (as many researchers do for organizing papers and references), you can use the Zotero **Better BibTeX** plugin to maintain an updated `.bib` file for your project. Then Quarto will automatically format your in-text citations (the `[@key]` syntax) and generate a references list in the chosen style (APA, etc.). Positron further streamlines this by allowing you to insert citations from Zotero directly: in visual mode, you can pull up a Zotero search and pick a reference to insert. This is incredibly useful for academic writing, as you don’t have to manually type citations or worry about formatting. In the Positron visual editor, if you have Zotero running, you might press a hotkey (as configured, e.g. Shift+Alt+Z) to trigger the citation picker, search for an author or title, and hit enter – it will insert the citation key into your Quarto doc, and later Quarto will compile it to a formatted citation and reference entry. By using these tools, you ensure that all sources – be they data or literature – are properly credited and traceable. **4. Share your data and code when possible.** After doing the hard work of assembling a clean dataset, consider sharing it (unless restricted). Journals like JIBS are moving toward encouraging authors to provide replication files. You can deposit data and code in a repository (e.g., Dataverse, Zenodo, or even as supplementary material with the journal). Since our approach used mostly public data, there’s no confidentiality issue; we could include the final analysis dataset as a CSV along with the code. But even if we don’t, anyone can run our script to regenerate it – that’s the beauty of programmatic access! Do note that some sources (like proprietary databases or certain surveys) might not allow redistribution of raw data; in such cases, you share the code and perhaps a synthetic example, and the instructions for the reader to obtain the original data themselves. Always check data usage policies. For all open datasets we used (World Bank, etc.), it’s generally free to use with citation. **5. Utilize Positron features to enhance workflow.** Positron, as mentioned, has integrated support for many of these practices. The Git pane will show diffs of your code changes so you can be confident in what you’re about to commit. The IDE’s notebook interface can execute code step by step, which is great for debugging your data retrieval before finalizing the script. Also, Positron’s multi-language support means if there’s a particular analysis you need to do in Python (say, using a specific machine learning library), you can do that within the same environment and even the same Quarto document (Quarto allows Python and R chunks together). This can be useful if, for example, an API has an official Python client but not an R client – you could use that in a pinch within your otherwise R-based workflow. To summarize, reproducibility in international business research is achieved by **combining tools and habits**: obtaining data through code, keeping that code organized and under version control, documenting every step, and writing up the findings in a way that tightly links to the analytic process (via Quarto with embedded code and citations). By following these practices, you not only make it easier for others to trust and verify your work, but you also make your own life easier in the long run – updating or extending the project will be far less daunting. As Aguinis et al. (2017) argue, taking proactive steps to improve reproducibility and transparency is critical for the credibility of IB research. Embracing programmatic data acquisition with R in Positron is a concrete way to answer that call, aligning our research practice with the emerging norms of open science. *References:* (APA style) * Aguinis, H., Cascio, W. F., & Ramani, R. S. (2017). **Science’s reproducibility and replicability crisis: International business is not immune.** *Journal of International Business Studies, 48*(6), 653–663. * Globerman, S., & Shapiro, D. (2002). **Global foreign direct investment flows: The role of governance infrastructure.** *World Development, 30*(11), 1899–1919. * OECD. (2025). **OECD Data Explorer FAQ.** Retrieved from OECD.org (accessed July 2025). * Radečić, D. (2024, July 4). **Introducing Positron: A new, yet familiar IDE for R and Python.** Appsilon Blog. * Tüzen, M. F. (2024, December 15). **Extracting Data from OECD Databases in R: Using the oecd and rsdmx Packages.** R-Bloggers. * World Bank. (2020). **Worldwide Governance Indicators** (2020 Update). Retrieved from World Bank DataBank (accessed 2025). * World Bank. (2025). **World Development Indicators.** Retrieved via WDI R package (version 2.7.9). * World Bank. (2021). **World Investment Report** (FDI data excerpt). *UNCTAD*, various pages. * Vuorre, M. (2025, June 6). **How to add citations from Zotero to Quarto documents.** Retrieved from vuorre.com. * American Economic Association. (2019). **Data and Code Availability Policy.** Retrieved from AEAweb.org (accessed 2025). * Adegboye, F. B., et al. (2020). **Institutional quality, foreign direct investment, and economic development in sub-Saharan Africa.** *Humanities & Social Sciences Communications, 7*(38).