6  Data Management for International Business Statistics

Data management and manipulation are foundational skills for statistical analysis, especially in international business and economics where datasets can be large, complex, and sourced from around the world. In fact, data scientists often spend the majority of their time preparing and cleaning data before any analysis can be performed (Kandel et al., 2011). This chapter lays the groundwork for effective data manipulation in R, guiding you through essential techniques with an emphasis on examples relevant to international business contexts. We will cover:

  1. Using Data in R: How to import and handle data in R (using RStudio/Posit or related environments).

  2. Data Wrangling Basics: Core operations for manipulating data frames, including:

    • Adding or removing variables (columns).
    • Filtering and sorting observations (rows).
  3. Reshaping Data (Long vs. Wide): Converting between “tidy” long format and wide format data for different analytical needs.

  4. Merging Datasets: Joining data from different sources by keys, a common task when combining country-level indicators with firm-level data in international business research.

  5. Practical Examples: Illustrative exercises using international business and economics data (e.g., country GDP series, industry statistics across countries).

By the end of this chapter, you should be comfortable performing essential data wrangling tasks in R. This will prepare you to tackle complex datasets such as global economic indicators or multinational firm performance metrics. We assume you have R and RStudio (now part of Posit) set up, with basic familiarity in running R code. The examples apply equally whether you use the traditional RStudio desktop, Posit Cloud, or the newer Posit Workbench/Positron IDE – we will note any minor differences where relevant. As organizations increasingly rely on data-driven decision-making across borders (George, Haas & Pentland, 2014), mastering these skills will enable you to efficiently clean and prepare data for analysis in international business studies.

6.1 Git Setup for the Project

A recommended practice before diving into data manipulation is setting up version control for your project. This ensures that your data and code are backed up, trackable, and sharable – important factors for collaboration and reproducibility (Ram, 2013; Peng, 2011). In this course, we use Git (via GitHub) to distribute materials and to let you save your work. Reproducibility and transparency are increasingly emphasized in research, including international business scholarship (Aguinis, Cascio & Ramani, 2017), and using Git is one step toward those goals. Below are steps to set up your project repository via GitHub and RStudio/Posit Cloud:

  1. Fork the Repository: Log in to GitHub and navigate to the course repository (for example, a repository named DPR for “Data Programming with R”). Click the “Fork” button to create your own copy of the repository under your GitHub account. This fork acts as your personal version of the course materials (code and data). Using a fork means you can edit and save your work without affecting the original repository, while still pulling in updates from the instructor later.

  2. Clone in RStudio/Posit: In RStudio (or Posit Cloud), create a New Project from Version Control by providing the Git repository URL. Use the URL of your forked repository (you can find this via the green “Code” button on GitHub, copying the HTTPS link). Cloning will download the repository into a new RStudio project on your machine or cloud workspace. After cloning, you should see all the files (e.g., R scripts, data folders) in the Files pane of RStudio.

  3. Configure Git (optional but recommended): In RStudio’s Terminal tab, set your Git username and email for this project. This attribution is helpful for collaboration, so that your commits carry your identity. For example, run:

    git config --global user.email "your_email@example.com"  
    git config --global user.name "Your Name"

    Replace the email and name with those associated with your GitHub account. (The --global flag applies this setting for any Git repository on your system; you can also do it per project by omitting --global.)

  4. Pull Latest Changes: If the original course repository (upstream) gets updated (e.g. the instructor adds new data or corrections), you’ll want to integrate those into your fork. One common way is to add the original repo as an upstream remote and then pull changes. For example:

    git pull --rebase https://github.com/OriginalAuthor/DPR.git main

    This command fetches the latest commits from the original repository’s main branch and reapplies your changes on top of that (rebase). After a successful pull, your local project will have all the latest files and updates from the instructor. In collaborative data science teams, regularly pulling updates and managing version control is crucial for keeping everyone’s work in sync (Blischak, Davenport & Wilson, 2016).

Note: RStudio was renamed to Posit in 2022. You can use either the classic RStudio interface or Posit’s newer tools (such as Posit Cloud for a browser-based IDE, or Posit Workbench/Positron which integrates R with VS Code’s interface). The Git integration process is similar in all cases. The instructions above assume the classic RStudio interface on Posit Cloud for simplicity, but you can adapt them to your setup. The key point is that using Git from the start helps maintain a reliable record of your data and analysis steps, aligning with best practices for reproducible research (Aguinis, Ramani & Alabduljader, 2018).

With the repository cloned and Git configured, you now have the project’s files ready in your RStudio/Posit environment. This includes any example datasets we’ll use and R scripts for exercises. You’re set up to begin importing and manipulating data in R, knowing that your work is version-controlled and backed up.

6.2 Using Data in R

Before diving into wrangling techniques, it’s important to understand how R organizes data in memory and how to get external data into R. In data science—especially for international business applications—you will frequently work with data frames, which are R’s primary structure for storing tabular data (like spreadsheets or database tables). Understanding data frames and related structures will ensure you can smoothly import and inspect your data prior to cleaning it.

  • Data Frames: In R, a data frame is essentially a table where each column is a vector (a series of values of the same type) representing a variable, and each row represents an observation (a single record). For example, one column might be Country (character type), another Year (numeric), and another GDP (numeric), with each row giving the values of those variables for a particular country and year. Data frames can contain mixed types (some columns numeric, others character or logical), which makes them very flexible for real-world data (where you often have a mix of numeric measurements and categorical labels or IDs). All columns must be the same length, so each row has an entry for each variable.
  • Tibbles: Modern R usage often prefers tibbles (from the tibble package in the tidyverse) as a refined version of data frames. Tibbles are essentially data frames with a few improvements that make them easier to work with. Notably, they do not convert strings to factors by default (historically, data.frame would turn character strings into factor variables unless prevented, which sometimes led to surprises). Tibbles also have a cleaner printing method that shows only the first 10 rows and as many columns as fit on screen, which is convenient for large datasets. In short, tibbles keep the behavior of data frames but avoid some of the “automatic” behaviors of base R that could confuse new users. (As the tibble documentation humorously notes, tibbles are “lazy” in that they do not change variable types or names without explicit instruction, and “surly” in that they will throw an error or warning rather than silently adjust things (Wickham et al., 2019).) For our purposes, you can treat data frames and tibbles interchangeably most of the time; the difference is just in convenience features.
  • Variables (Columns): Each column in a data frame or tibble is one variable or feature of the data. For example, a column Region might classify observations by world region (Asia, Europe, etc.), or a column Sector might indicate industry sector for a business. All values in a given column should represent the same type of information. In R, because each column is a vector, all values in that column have the same data type (all numeric, or all character strings, etc.). This is why, for instance, a column of GDP figures will be numeric, whereas a column of country names will be character strings.
  • Observations (Rows): Each row corresponds to one observation or record in the dataset. In an international economics dataset, a single row could be something like “Canada – 2020 – GDP = 1.64 trillion (USD)” meaning the observation of Canada’s GDP in the year 2020. In a firm-level dataset, a row might be a specific company in a specific year, or a single transaction or investment instance, depending on how the data is structured. It’s crucial that each row represents a distinct case without ambiguity. If multiple rows represent the same entity and time, then you have duplicated data or a level-of-analysis problem.
  • Missing Values: R uses the symbol NA (not available) to denote missing data. An NA can appear in any column where a value wasn’t recorded or is unknown. By default, most operations involving NA will yield NA – for example, calculating the mean of a vector that contains NA will result in NA unless you tell R to ignore missing values. Dealing with missing data is an important part of data cleaning (e.g., deciding whether to omit those observations or impute values). In this chapter, we will mainly point out the presence of NA and not delve deeply into imputation techniques. Just be aware that any summary statistic or model in R might need an argument like na.rm=TRUE (remove NAs) if you want to exclude missing values from the calculation.

When you load data into R, these structures will appear in RStudio’s Environment pane (usually in the upper right). You can click on a data frame in that pane to open a spreadsheet-like view of it, which is useful for an initial scan. You can also use functions like head(dataframe) to see the first few rows in the console, or str(dataframe) to see its structure (types of each column and a preview of values). Early inspection of data is good practice to ensure it loaded as expected (e.g., check that numeric columns are numeric and not mistakenly read as text, etc.).

Importing Data

Getting data into R is often the first step of any analysis. In international business research, data might come from CSV files downloaded from the World Bank or IMF, Excel spreadsheets shared by colleagues, databases of financial information, or even APIs (web services) providing real-time data. R has well-developed tools for all of these. Here we highlight some common import methods:

  • From a CSV or Text File: Comma-separated values (CSV) files are one of the most common formats for sharing data. Many open data portals (like the World Bank’s World Development Indicators, or UN trade data) allow downloading data as CSV. In R, the readr package’s read_csv() function is a fast and convenient way to import CSV files (Wickham & Hester, 2017). For example, gdp <- readr::read_csv("country_gdp.csv") will read a file and create a tibble gdp. This function automatically treats the first row as column headers (if they look like headers), guesses the data type of each column (you’ll see a message if it guesses, say, “column X parsed as col_integer”), and handles common issues like quoted text, different delimiters, etc. There are similar functions for other delimiters, like read_tsv() for tab-separated data. Base R also has read.csv() but we prefer read_csv() for its better performance and defaults.
  • From Excel Files: In business settings, data often comes in Excel (*.xls or .xlsx) files. The readxl package (from the tidyverse) provides read_excel() to import Excel spreadsheets without needing Excel installed. It can read specific sheets by name or index. For example, library(readxl); data <- read_excel("GlobalMarkets.xlsx", sheet = "2019") would read the sheet named “2019” from an Excel file. Many government and industry reports publish statistical annexes as Excel files, so this is a useful tool. (There are also packages to directly interact with Google Sheets or other spreadsheet formats.)
  • From Google Sheets: Google Sheets are often used for collaboration, and sometimes datasets are shared via a Google Sheet link. The gsheet package provides a simple way to fetch data from a public Google Sheet URL. Using gsheet2tbl("https://docs.google...") will return a tibble of the sheet’s contents. This is essentially doing an on-the-fly import of a sheet. Keep in mind that the sheet needs to be accessible (either published or at least shared publicly). Another more robust option is googlesheets4, which can handle private sheets via OAuth authentication. In a classroom or small team setting, using gsheet is quick and easy for shared data (assuming no sensitive info).
  • From Databases: If your data resides in a database (SQL databases like MySQL, Postgres, or NoSQL stores, etc.), R can connect to those as well. The DBI package and various database-specific adapters allow you to run SQL queries and import the results into R data frames. For instance, if a company’s sales data is in a SQL database, one could use DBI::dbConnect() and then dbGetQuery() to retrieve the data. This is more advanced and requires knowing the database credentials and possibly some SQL, so we won’t cover details here, but be aware it’s possible. In international business research, one might use this to pull data from proprietary databases like Bloomberg or S&P Capital IQ (if they provide ODBC connections), or even from cloud storage.
  • From Web APIs: A lot of interesting data can be accessed through web APIs. For example, the World Bank has an API for its indicators, and organizations like WTO or OECD have API access for their statistics. R’s httr package (for HTTP requests) combined with jsonlite (to parse JSON data) allows programmatic access to such APIs. Additionally, many APIs have dedicated R packages (for example, wbstats for World Bank data, quantmod or fredr for financial and economic data from the U.S. Federal Reserve). Using APIs can automate data retrieval and ensure you’re getting the latest available data.
  • Built-in and Package Datasets: Lastly, R comes with some built-in sample datasets (like mtcars, iris, etc.), and many packages include datasets relevant to their topic. For example, a package on international trade might include a sample trade flows dataset. These are mainly for learning and examples, but can be convenient for practice. You load them simply by calling their name once the package is loaded.

In this chapter’s examples, we will demonstrate importing a CSV file and a Google Sheet, since those are common and straightforward. Imagine a scenario: you download a dataset of country GDPs over years from the World Bank as a CSV, and you have another dataset (perhaps a Google Sheet maintained by colleagues) that lists country regions or income groups. We will import both and then combine them.

Example: Importing a CSV File

Suppose we have a CSV file containing GDP data for several countries over multiple years – a typical panel dataset in international economics. Perhaps you obtained this from the World Bank or another source and saved it as chapter6_data.csv in a data/ folder of your project. To read this into R:

# Ensure readr is installed and loaded
library(readr)
# Import the dataset
gdpData <- read_csv("data/chapter6_data.csv")

Breaking down this code:

  • We load the readr package with library(readr), making sure we have access to read_csv(). (If you hadn’t installed readr yet, you’d first do install.packages("readr"), but in the Posit Cloud environment or a pre-set course project, it’s likely already installed as part of the tidyverse.)
  • The read_csv("data/chapter6_data.csv") function call reads the file and returns a tibble. We assign it to a variable gdpData. The string "data/chapter6_data.csv" is a relative file path: it assumes that our working directory is set to the project folder that contains the data subfolder. In RStudio projects, the working directory is automatically the project folder, so this should work if the CSV is in data/. You can always check your working directory with getwd() or adjust it with setwd().
  • After running this, you should see gdpData appear in your Environment pane. We can then inspect it. Running head(gdpData) will show the first six rows; dim(gdpData) might show, for example, 200 rows and 8 columns (depending on how many countries and years of data). Let’s say the columns in gdpData are Country, CountryCode, and columns for GDP in 2015, 2016, …, 2020. The first few rows might look like:
Country    CountryCode   2015    2016    2017   ...  2020  
Canada     CAN           1550    1520    1600   ...  1640  
France     FRA           2850    2900    2950   ...  2700  
...        ...            ...     ...     ...   ...   ...

Here, countries are rows and years are columns (wide format). If the file had a header row (which it likely did), read_csv used that for column names. It also likely inferred that Country is character data, CountryCode is character, and the year columns are numeric. If there were any issues parsing (say a non-numeric character in the GDP column), read_csv would display a message about parsing problems. You can address such issues by cleaning the source file or specifying column types explicitly (using col_types argument in read_csv).

One quick tip: sometimes CSV files (especially if saved from Excel) might include a blank column or row, or an index column. If you see an unexpected column like X1 or one called “ï..”, that might be an artifact. You can remove it after import by doing gdpData <- select(gdpData, -X1) (to drop the column X1) or using gdpData$X1 <- NULL as a base R way to remove a column. In our example, assume the data came in cleanly.

Example: Importing Data from a Google Sheet

Now, let’s say we have a supplemental dataset containing country metadata (perhaps continent, region, or latitude/longitude for mapping), which is stored in a Google Sheet. A collaborator from another country might have prepared this and shared the link with you. Instead of manually downloading it as CSV, we can fetch it directly. We’ll use the gsheet package for simplicity.

First, ensure the Google Sheet is accessible. Often, setting it to “anyone with the link can view” or using the Sheet’s “Publish to web” feature provides a URL that gsheet can use. Suppose the shareable URL is:

https://docs.google.com/spreadsheets/d/1nehKEBKTQx11LZuo5ZJFKTVS0p5y1ysMPSOSX_m8dS8/edit?usp=sharing

This looks like a Google Sheets link (the long string in the URL is the sheet’s unique ID). We can import it as follows:

library(gsheet)
locations <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1nehKEBKTQx11LZuo5ZJFKTVS0p5y1ysMPSOSX_m8dS8/edit?usp=sharing")

What this does: gsheet2tbl() takes the URL and returns a tibble of the sheet’s contents, which we assign to locations. Now, suppose locations has columns like Country, Latitude, Longitude, and maybe Region. It might look like:

Country    Latitude      Longitude      Region  
Canada     56.1304       -106.3468      Americas  
France     46.2276         2.2137      Europe  
...         ...            ...         ...

We can verify by running head(locations) to see the first few rows. One thing to note: Google Sheets don’t enforce data types, so everything might come in as text. If Latitude and Longitude are meant to be numeric, they could be character strings initially. We might need to convert them: for example, locations$Latitude <- as.numeric(locations$Latitude) (and similarly for Longitude) to get them as numeric type. Alternatively, within our analysis we might not need to explicitly convert if we just use them in plotting functions that can handle the conversion on the fly. Just be cautious: if you see them printed with quotes around the numbers, they’re still character.

Now we have two datasets in R:

  • gdpData: a wide-format dataset of country GDP over years.
  • locations: a dataset of country metadata (coordinates/region).

These will serve in upcoming sections to demonstrate data wrangling tasks. Our next step is to transform and combine these data frames for analysis.

6.3 R Grammar Essentials

Before performing transformations on data frames, it’s useful to cover some essentials of R’s syntax and the tidyverse approach to data manipulation. In particular, we will explain how to call functions from specific packages and how to use the pipe operator to write clear, readable code. These concepts form the backbone of an efficient workflow in R for data science.

Calling Functions and Packages in R

R is a language that is extended through packages. When you want to use a function, you need to either tell R which package it comes from or load that package into your session. There are two primary styles to do this:

  • Using package::function syntax: If you know exactly which package provides a function, you can call it directly without loading the whole package. For example, readr::read_csv() calls the read_csv function from the readr package. This is very explicit and helps avoid confusion in case multiple packages have functions with the same name. It’s common in examples or documentation to see code like dplyr::filter() to emphasize that we want the filter function from dplyr (and not, say, the filter function from the stats package). This approach is great for writing scripts or documentation where reproducibility is important – anyone reading the code knows exactly which package is needed for that function (Xie, 2015).

  • Using library(package) to attach the package: You can load a package into your session with library(packageName). For example, library(dplyr) will attach the dplyr package, making all its functions available by their short names (so you can just call filter(), summarize(), etc., without the dplyr:: prefix). This is convenient for interactive work and when you’ll be using many functions from one package. A typical R analysis script might start by loading several packages at once, like:

    library(dplyr); library(tidyr); library(readr)

    After this, all the functions from these packages are accessible. The downside is namespace conflicts: if two loaded packages have a function with the same name, the one loaded last will mask the other. R will print a message when a function is masked. For instance, after library(dplyr), you might see: “The following object is masked from ‘package:stats’: filter”. This means dplyr’s filter() is taking precedence over the default stats::filter() function. If you did need the other one, you would still have to use stats::filter() explicitly. Generally, this is not a big problem if you’re aware of it, and the tidyverse team tries to avoid name collisions in common usage.

Both approaches are valid. In this chapter (and textbook), we sometimes use library() at the start of a code block to assume that the reader has attached the necessary packages. In other places, we’ll use package::function for clarity. For reproducibility in research code (especially if sharing code with colleagues or students), it’s good practice to either include library() calls at the top or use the double-colon syntax to avoid ambiguity.

The Pipe Operator (%>% and |>)

One of the most celebrated features of the tidyverse approach is the pipe operator. Introduced by the magrittr package and now available as a native R feature, the pipe lets you express a sequence of data transformations in a clear, linear fashion (Bache & Wickham, 2014). The original pipe is written as %>%. Starting with R 4.1, a built-in pipe |> was added to base R with similar functionality. We’ll use the %>% here as it’s widely seen in many tutorials and older code, but it’s good to know |> can be used in modern R as well.

The concept of the pipe is to take the output of one function and feed it as the input to the next function, chaining operations together. Instead of nesting function calls inside one another (which can become hard to read), the pipe allows you to write code that reads top-down, left-to-right, like a recipe for data manipulation.

In essence: X %>% f() means “take X and apply function f to it”. This is equivalent to f(X). When you string multiple pipes, X %>% f() %>% g() means g(f(X)) – the result of f(X) is passed to g(). You can add additional arguments to the functions as needed.

To illustrate, consider a simple numeric vector and some operations:

# Without pipes:
result <- round(mean(sin(my_vector)), 2)

# With pipes:
result <- my_vector %>% 
  sin() %>% 
  mean() %>% 
  round(2)

Both do the same thing: take my_vector, compute sin() of each element, then compute the mean(), then round to 2 decimal places. But the piped version reads like a clear sequence: “start with my_vector, take the sine, then the mean, then round to 2 digits.” This readability is a major advantage, especially as your sequence of data transformations grows longer.

Some details on using the pipe:

  • First-argument piping: By default, %>% takes the output on its left and inserts it as the first argument of the function on its right. For instance, Data %>% head(10) is equivalent to head(Data, 10) – the data becomes the first argument of head(). This works seamlessly for many dplyr functions because they expect the data frame as their first argument (e.g., filter(data, condition), summarize(data, ...), etc.).

  • Using the dot (.) placeholder: If you need the piped value to go somewhere other than the first argument, or multiple times, you use . as a placeholder inside the function call on the right. For example, my_vector %>% quantile(probs = c(0.25, 0.75), na.rm = TRUE) works because quantile() expects the data as its first argument, so the pipe inserts my_vector there. But if you wanted to subtract the mean from your vector and then take the sum, you might do:

    my_vector %>% { . - mean(.) } %>% sum()

    Here, within the braces, . represents the piped value (the original vector), and we use it twice: once to compute mean(.) and once to subtract that from the vector itself. The result of that expression is then piped to sum(). This is a more advanced use of the pipe, but shows its flexibility. In most cases, you won’t need braces or multiple placeholders; simple left-to-right piping is enough.

  • Clarity and style: The pipe encourages a style where each step of the transformation is on a new line, often with a comment. This makes the code self-documenting to a degree. For example:

    cleaned_data <- raw_data %>%
      filter(!is.na(Sales)) %>%              # 1. remove observations with missing sales
      mutate(LogSales = log(Sales)) %>%      # 2. add a new column with log-transformed sales
      group_by(Country) %>%                 # 3. group by country
      summarize(AvgLogSales = mean(LogSales))  # 4. compute average log sales per country

    Anyone reading this can follow what’s happening at each step. This is much easier to read than a nested expression or a long sequence of intermediate variables like step1 <- ...; step2 <- ... (though using intermediate variables is also fine if it helps clarity – it’s mostly a matter of preference and complexity).

Numerous data science workflows in R rely on pipes to string together dplyr and tidyr functions. In fact, a study of data analysis notebooks found that the use of chaining with pipes is associated with more reproducible and readable code (Bryan et al., 2019). As a beginner, it might take a little practice to get used to reading and writing %>%, but soon it will become second nature to think “and then I do this… and then I do that…” with each step separated by a pipe.

A quick illustration: Suppose we have the built-in mtcars dataset (data about 1974 car models) and we want to find the top 5 cars by fuel efficiency (miles per gallon) among those with 6 cylinders. Without pipes, we might do:

top5 <- head( arrange( filter(mtcars, cyl == 6), desc(mpg) ), 5 )

This is hard to parse because you must start from the inside: filter, then arrange, then head. With pipes:

top5 <- mtcars %>%
  filter(cyl == 6) %>%       # keep only cars with 6 cylinders
  arrange(desc(mpg)) %>%     # sort by mpg in descending order
  head(5)                    # take the first 5 rows of the sorted data

This accomplishes the same task, but reading it is straightforward: filter, then arrange, then slice the first 5. Many R users find that using pipes makes their code closer to how one would describe the task in words, which is particularly helpful when collaborating or revisiting your own code after some time.

Finally, note that the new base R pipe |> works similarly for most situations, but it does not allow some of the more flexible placeholder usage that %>% does. For instance, x |> f(y, _) would pipe x into the _ position in f(). The base pipe might become more standard in the future, but currently the %>% pipe from magrittr/tidyverse is more prevalent in documentation and older code, so it’s important to recognize and be comfortable with it.

R Terminology and Key Concepts

Before we move on to specific data manipulation tasks, let’s clarify a few key concepts and terms that we will use throughout this chapter (and the book). These include the idea of tidy data and different types of data structures like time series and panel data, which often arise in international business statistics.

Tidy Data

In R’s data science ecosystem, the term tidy data has a very specific meaning. Tidy data is a concept introduced by Hadley Wickham to describe a standard way of structuring a dataset for analysis (Wickham, 2014). According to the tidy data principles, a dataset is tidy if:

  1. Each variable is a column. (Each column holds exactly one type of information/measurement across all observations.)
  2. Each observation is a row. (Each row corresponds to one entity at one point in time or one instance, depending on context.)
  3. Each value is a single cell. (There are no merged cells or multiple pieces of information crammed in one cell.)

This might sound obvious, but many datasets in the wild are not tidy. For example, a typical Excel report might list countries as rows and have separate columns for each year of data; in such a layout, “Year” is not a dedicated column but embedded in the column names. That’s a wide format. A tidy (long format) version of the same data would have a column “Year” and a column “GDP”, and each year’s GDP is a separate row entry for the country. While the wide format may be convenient for presentation or certain kinds of quick comparisons, the long (tidy) format is usually much better for analysis and for using R’s tools (Wickham, 2014). In fact, many of R’s powerful packages (dplyr, ggplot2, etc.) are designed assuming data is in tidy format. For instance, ggplot2 expects data in long form: if you want to plot GDP over time for multiple countries, your data frame should have a Year column and a GDP column (with multiple rows per country), rather than having separate GDP columns for each year.

Why insist on tidy data? Because a consistent structure means you can apply a common set of operations to many different datasets. Wickham demonstrated that by tidying data, a wide range of messy real-world data problems can be handled with a small set of tools (like the ones in tidyr and dplyr) instead of custom processing for each new dataset (Wickham, 2014). Tidy data is to data analysis what normalization is to relational databases – a design that minimizes redundancy and maximizes clarity. In an academic context, using tidy data formats makes your analysis more transparent and easier to extend. For example, if your dataset is tidy, adding a new year of data is as simple as appending more rows, rather than adding a new column and then modifying a bunch of code to accommodate that new column.

Let’s cement this with a concrete example. Suppose we have GDP data for three countries (Canada, France, Japan) for the years 2018 and 2019:

  • Untidy (wide format):
Country GDP_2018 GDP_2019
Canada 1.71e+12 1.74e+12
France 2.78e+12 2.71e+12
Japan 5.00e+12 5.08e+12

Here, GDP for each year is in a separate column. There’s no single “Year” column. The variable “GDP” has been spread across two columns.

  • Tidy (long format):
Country Year GDP
Canada 2018 1.71e+12
Canada 2019 1.74e+12
France 2018 2.78e+12
France 2019 2.71e+12
Japan 2018 5.00e+12
Japan 2019 5.08e+12

Now we have a column Year and a column GDP. Each row is a single country-year observation. The data is redundant in the sense that “Canada” is written twice, but it’s consistent: every row has the same structure. In this form, if we want to filter for 2019 data only, we just filter Year == 2019. If we want to compute the GDP growth from 2018 to 2019 for each country, we can group by Country and do a calculation with those rows. In the wide format, those tasks would require handling column names or manual calculations.

Throughout this chapter, whenever we encounter data that isn’t tidy, we will convert it to tidy form using R tools (you’ll learn how in the Reshaping Data section). As a practitioner, whenever data is provided to you (especially from various international databases), think about how to tidy it for analysis. For instance, many macroeconomic datasets come in Excel with separate sheets or columns for years; your job is to turn them into a standardized structure that you can then join, filter, and analyze systematically.

Data Frames vs. Matrices vs. Lists (and why it matters)

We’ve talked about data frames (and tibbles) as the go-to structure for data. For completeness, note that R also has matrices (which are 2D structures but only one type of data, like all numeric) and lists (which are collections of objects that can be of any type or size, essentially Python-like lists). A data frame is actually a special kind of list: it’s a list where each element is a column (vector) of equal length. Most of the time, you’ll stick to data frames for data wrangling. However, knowing that a data frame is a list of columns can clarify why certain things happen (for instance, why adding a column via assignment works the way it does).

Also, R has factors, which are a special type for categorical data. In older R workflows (and base R functions), character data often got loaded as factors. A factor looks like a vector of integers internally, with an associated set of “levels” (the categories). They are useful for statistical modeling or to ensure only predefined values are used. In our international business data context, you might use a factor for something like Region (to have a fixed list of regions) or IncomeLevel (low, middle, high). In the tidyverse approach, strings are left as strings unless you explicitly convert to factor, giving you more control. We won’t delve deeply into factors here, but just be aware if you see Factor w/ 4 levels in a str() output, it means that column is stored as a factor. You can convert factors to characters with as.character() if needed.

Time Series, Cross-Sectional, and Panel Data

Many datasets in international business and economics have a temporal dimension and an entity dimension. It’s useful to classify data by these dimensions:

  • Time Series Data: A time series is data for a single entity (a country, a company, etc.) over time, typically with regular intervals (years, quarters, months). For example, the yearly GDP of Japan from 1990 to 2020 is a time series (with 31 observations). If you have just one column of GDP and one column of Year, for a single country, that’s a time series. Time series data is often analyzed with techniques like trend analysis, time series modeling (ARIMA, etc.), and visualized with line charts.
  • Cross-Sectional Data: A cross-section is data of multiple entities at a single point in time. For example, GDP of all countries in the year 2020 is cross-sectional – you have many countries, but all for the same year. In international business, an example cross-section could be a survey of many firms in a single year, or a comparison of GDP across countries for one specific year. Cross-sectional analysis looks at variation across entities, not over time (since time is fixed).
  • Panel Data (Longitudinal Data): Panel data combines both – it’s multiple entities, each observed at multiple time periods. For instance, GDP of 100 countries from 1990 to 2020 is panel data (100 entities × 31 time points = 3100 observations). Panel data is extremely common in international business research because we often look at how entities evolve over time and how that varies across entities. For example, one might study how different regulatory environments (country-level factor) affect firm performance over time by analyzing a panel of firms across several countries (e.g., firm-year panel data). Panel data is also called longitudinal or cross-sectional time series. In economics, classic examples include Penn World Table data (Feenstra, Inklaar & Timmer, 2015) which gives multiple indicators for many countries over many years, or firm-level financial panels like Compustat data over years.

Why distinguish these types? Because certain data manipulations and analyses are specific to one structure or another. For instance, lagging a variable (using the previous time period’s value) only makes sense if you have time series or panel data and your data is sorted by time within each entity. If your data are panel, you might need to group by entity and arrange by year before creating lagged variables. Similarly, calculating growth rates or changes year-over-year is a panel/time-series operation. On the other hand, if you have a pure cross-section, you might be more interested in cross-sectional comparisons (like rankings, percentiles, correlations across entities at the same time).

In terms of data structure in R, any of these can be stored in a data frame. A panel dataset in tidy long format would typically have columns for the entity (e.g., Country or Firm), time (Year), and then various measured variables (GDP, population, etc. for that country-year). The uniqueness of a row might be a composite key: (Country, Year) together define an observation. Recognizing that structure in your data helps in operations like merging (we might merge on both country and year, for example, to combine two panel datasets), filtering (we might filter for Year >= 2010 for a recent subset), and grouping (we might group by Country to do operations within each country’s time series).

Example (International Business context): Suppose we have a dataset of multinational enterprises (MNEs) with their number of foreign subsidiaries over the years 2010-2020, and another dataset of country-level indices like ease of doing business or corruption index over the same period. The firm data is panel (firm-year panel), the country indices data is also panel (country-year). If we want to analyze how country conditions affect firms’ expansion, we might merge the two datasets on country and year, resulting in a combined panel where each row is a firm-country-year (if firms are tied to a home country, say). Even this simple scenario shows how panel structures drive the need for merging and for careful identification of observations (here the “entity” might actually be hierarchical: firm within country).

Understanding your data structure guides your wrangling: e.g., you wouldn’t use a time-series specific function on cross-sectional data. It also guides visualization; time series often use line plots, cross-sections might use bar plots or scatter plots, and panels might require facetting (small multiples) or more advanced techniques to display.

(For a visual sense: imagine three charts – one line chart showing GDP of one country over years (time series), one bar chart comparing GDP of several countries in one year (cross-section), and a grid of line charts each showing GDP for a different country (panel data visualization). In code, the panel data might be one data frame that you slice or group for each country’s line.)

Example Data Context: International Business

To connect these concepts, let’s briefly consider how an international business researcher might encounter these data types:

  • If you study a single country’s economic indicator over time (e.g., the trend of export volume of China over 20 years), you have a time series. You’d likely tidy that into two columns: Year and ExportVolume.
  • If you compare many countries’ export volumes but only for the year 2020, that’s a cross-sectional snapshot. You’d have Country and ExportVolume columns, one row per country (Year is constant so you might even omit it).
  • If you examine export volumes for many countries over many years, that’s panel data (Country-Year as the combined key). You’d have columns: Country, Year, ExportVolume (and potentially other variables like GDP, exchange rate, etc., also varying by country-year).

Each scenario might require different wrangling steps (e.g., calculating growth rate makes sense for time series or panel, not for a single cross-section). We will mainly work with panel-like data in this chapter (multiple countries over multiple years), since that’s very common in international business statistics. The principles you learn, however, apply to time series or cross-sections when those are relevant.

6.4 Data Wrangling 1/4: Importing Data (Practice)

Now that we have covered how to import data in theory, let’s actually practice it with our example datasets. We set up earlier that we have a CSV file (country GDP data over years) and a Google Sheet (country locations). Let’s walk through importing these in our R session and ensure everything is ready for subsequent transformations.

1. Import the CSV file (Country GDP data): The file is chapter6_data.csv, located in chapter6/data/chapter6_data.csv within our project. It contains GDP values for a set of countries across several years (we’ll assume it covers, say, 2010–2020 for a dozen countries, for illustration). We use read_csv to load it:

library(readr)  # load the readr package for CSV reading

gdp <- read_csv("chapter6/data/chapter6_data.csv")

After running this, check the result:

  • dim(gdp) might output something like 12 13 (if 12 countries and 13 columns: Country, Code, and 11 years of data).
  • colnames(gdp) would list the columns. We expect something like: Country, Code, 2010, 2011, …, 2020. If the CSV had a header row with year names, those will appear as column names (note: column names that are just numbers like “2010” will actually be read in as character strings “2010” in R, even though they look numeric; that’s okay).
  • head(gdp) will show the first few rows. We might see the country names and some GDP figures. For example, a row could be Canada, CAN, 1610, 1640, ... (numbers in whatever units, maybe billions of USD).

One thing to look out for: sometimes country names or codes might have extra whitespace or odd characters from the file. We should ensure consistency because when we later join with the locations data, the country names need to match exactly. If we see, for example, “United States” (with an extra space) or “Korea, Rep.” vs “South Korea” mismatches, we’d need to clean those. For now, let’s assume the country names are standardized (perhaps using ISO country names or something).

If an unnecessary index column appeared (as mentioned, like an X1 column of row numbers or IDs), we can remove it. For example, if colnames(gdp) showed an X1, we’d do gdp <- select(gdp, -X1). In our case, let’s say it did not appear because the CSV was well-formed (no explicit row names).

So, gdp is now a data frame in R containing our country-year GDP data in a wide format (years as separate columns). We will soon reshape this.

2. Import the Google Sheet (Country locations data): We have the URL for the Google Sheet with country coordinates (and possibly region info). We’ll use gsheet2tbl as planned:

library(gsheet)  # load the gsheet package

locations <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1nehKEBKTQx11LZuo5ZJFKTVS0p5y1ysMPSOSX_m8dS8/edit?usp=sharing")

After this, we should inspect locations:

  • str(locations) might show that each column is of type character, which is common for Google Sheets import. Let’s say it shows Country, Latitude, Longitude, Region all as chr (character). If Latitude/Longitude are numeric values stored as text, we’ll convert them:

    locations$Latitude <- as.numeric(locations$Latitude)
    locations$Longitude <- as.numeric(locations$Longitude)

    Now str(locations) should show Latitude and Longitude as numeric (num).

  • We might also check how many rows locations has compared to gdp. Ideally, it should have the same number of countries (assuming our locations sheet included all countries present in the GDP data). If nrow(locations) is slightly different, maybe the sheet had some extra countries or missing one. That could be addressed by filtering or updating the sheet. For our example, assume it matches.

A quick example of what locations might contain:

Country       Latitude    Longitude    Region
Canada        56.1304     -106.3468    North America
France        46.2276       2.2137     Europe
Japan         36.2048     138.2529     Asia
...           ...          ...         ...

Now we have:

  • gdp: country GDP data (wide format, numeric columns for each year).
  • locations: country geographic (and possibly regional) data.

Having both in R, we are ready to perform transformations. Often, the next step is to clean or select the parts of the data we need and then merge them. We know we eventually want to join these two data sets (to combine GDP with location by country), but before merging, it might be wise to reshape gdp into a tidy format.

6.5 Data Wrangling 2/4: Working with Variables and Observations

Data wrangling involves a lot of operations that add, remove, or modify variables (columns) and observations (rows). We will now walk through some of the fundamental operations using our example data. These operations are typically done using functions from dplyr, which is the tidyverse package dedicated to data frame manipulation (Wickham et al., 2023). The core dplyr functions are often referred to as “verbs” because they correspond to actions you want to perform on your data:

  • select() – choose or drop columns.
  • filter() – choose rows based on conditions.
  • mutate() – add new columns or modify existing ones.
  • arrange() – reorder rows.
  • rename() – rename columns.
  • summarize() – aggregate/summarize data (usually with grouping).

For now, we’ll focus on select, filter, mutate (for adding columns), arrange, and rename, which cover our needs.

Adding or Modifying Columns (Mutate)

Adding new variables: Often, you will need to create new variables from existing ones. This could be a simple arithmetic transformation (e.g., converting GDP from millions to billions, or creating a growth rate), a conditional flag, or a combination of multiple columns.

In dplyr, the function mutate() is used to add new columns. You provide mutate() with a data frame and one or more expressions of the form newColumnName = some calculation. For example:

Let’s add a column to our gdp data frame to mark whether each country is part of the G7 (Group of Seven major economies). We have a vector of G7 country names to compare against:

library(dplyr)  # ensure dplyr is loaded for mutate

g7_countries <- c("Canada", "France", "Germany", "Italy", "Japan", "United Kingdom", "United States")

gdp <- gdp %>%
  mutate(G7 = if_else(Country %in% g7_countries, TRUE, FALSE))

Here we used the pipe to take gdp and then add a new column G7. The expression uses if_else() (from dplyr) which is similar to the base R ifelse() but type-safe. It checks the condition Country %in% g7_countries. The %in% operator returns a logical vector of the same length as Country, with TRUE if the country is found in the g7_countries vector. So, for each row (country) in gdp, if the Country is one of the G7, we assign TRUE, otherwise FALSE. The result is a new logical column G7 in the data frame. We could have also written this as mutate(G7 = Country %in% g7_countries) since %in% already yields TRUE/FALSE.

If we check the data, we’d see for example Canada, France, etc. have G7 = TRUE, whereas other countries like China or India (if present) would have G7 = FALSE. This kind of flag can be useful for filtering or grouping later (e.g., summarizing stats for G7 vs non-G7 countries).

Another example: adding a computed numeric variable. Suppose our GDP is in millions of USD and we want to add a column for GDP in billions of USD (just for readability). We could do:

gdp <- gdp %>%
  mutate(GDP_2020_billions = `2020` / 1000)

This creates a new column GDP_2020_billions by taking the 2020 GDP column and dividing by 1000 (since 1000 million = 1 billion). Note: because the column name “2020” starts with a number, we have to reference it with backticks in code (2020) to differentiate it from a number. After this, the new column appears at the end of the data frame by default. We can add multiple new columns in one mutate call if needed, separated by commas, and you can even use one new column in the creation of another (mutate computes them sequentially).

Modifying existing variables: Mutate can also be used to transform an existing column in place. If you name an existing column on the left side of =, it will overwrite that column. For instance, if we wanted to overwrite the GDP 2020 column to be in billions instead of millions:

gdp <- gdp %>%
  mutate(`2020` = `2020` / 1000)

This divides each value in the 2020 column by 1000, updating the column. One has to be cautious when overwriting data – you might lose the original values unless you kept a copy. It’s often safer to create a new column (as we did with GDP_2020_billions) unless you are sure you don’t need the original scale.

Another modification example: say the Country column has country names in uppercase and we prefer title case (First letter capitalized). We could do mutate(Country = str_to_title(Country)) if we had loaded the stringr package (or use tools::toTitleCase). These kinds of text manipulations are part of cleaning. In an academic dataset context, you might unify naming conventions by such transformations (for example, ensuring “Viet Nam” and “Vietnam” are made consistent, etc., possibly with conditional recoding rather than just case change).

Renaming columns: A related operation is renaming. If the original data has non-informative or inconvenient column names (like “Code” which might be ambiguous), we can rename them. Dplyr has a function rename(), or we could use rename_with() to apply a function to many names, but the simplest is:

gdp <- gdp %>%
  rename(CountryCode = Code)

This will change the column name Code to CountryCode. The syntax inside rename is newName = oldName. After this, our gdp has columns Country, CountryCode, 2010, 2011, …, 2020, G7 (and possibly GDP_2020_billions if we added that earlier). Clear column names are important for clarity – for example, if we had a column named “X1” or “Value”, renaming it to something like “GDP” or “Population” (whatever it actually is) will save confusion later.

Removing columns: Sometimes data comes with extra variables we don’t need. For instance, maybe the dataset had a column for “Indicator Code” that’s just a repeat or an ID we won’t use. Removing it can declutter the data frame. We already saw one way: select(gdp, -UnneededColumn) returns the data without that column. Another way is gdp <- gdp %>% select(-UnneededColumn). Or in base R, gdp$UnneededColumn <- NULL will drop it. It’s good practice to drop high-dimensional columns that you won’t use (like free-form text notes, etc.) for performance, but keep anything you might analyze. In our case, we probably keep Country and CountryCode and all year columns; we might drop nothing. But if there was, say, a column “Series Name” that is just “GDP (current US$)” repeated in every row (some data sources include such descriptor columns), that could be removed.

Filtering Rows (Subsetting Observations)

After structuring the columns, we often want to focus on a subset of observations. Filtering means selecting only those rows that meet certain criteria. For example, we might want to analyze only a particular region or only years after 2015, etc. In R, the dplyr function for this is filter().

Let’s do some filtering with our data:

Suppose we want to zoom in on a few specific countries of interest, say the United States, Canada, France, Japan, and Belgium. These might be a mix of large economies and a smaller one (Belgium) for contrast. We can filter the gdp data frame to include only those countries:

gdp_subset <- gdp %>%
  filter(Country %in% c("United States", "Canada", "France", "Japan", "Belgium"))

Now, gdp_subset will contain only the rows where the Country is one of the five listed. We used %in% again, this time inside filter to create a condition. We could equivalently write:

filter(Country == "United States" | Country == "Canada" | Country == "France" | Country == "Japan" | Country == "Belgium")

but %in% is much shorter and clearer for checking membership in a set.

If gdp originally had 12 countries, gdp_subset now has 5 rows (each row being a country in wide format). All columns remain the same.

Another common filtering task is based on numeric values. For example, maybe we only want countries with a GDP in 2020 above a certain threshold. If GDP is measured in millions, above 1e6 would mean > $1 trillion (since 1,000,000 million = 1 trillion). We could do:

gdp_trillion_2020 <- gdp %>%
  filter(`2020` > 1000000)

This will keep only those rows (countries) whose 2020 GDP column is greater than 1000000 (i.e., > $1 trillion, if units are millions). We would expect major economies like the US, China, Japan, Germany, etc., to pass this filter, whereas smaller economies would be filtered out.

Remember, in wide format, filtering on a year column means we are filtering on a specific snapshot. In long format (after we reshape, coming up next), we might filter by Year as a column value instead. But since currently gdp is wide, to filter by year we refer to the specific column for that year.

Multiple conditions can be applied together. If we wanted, say, countries with GDP > $1T and population over 50 million (assuming we had a population column), we could use a comma or & inside filter:

filter(`2020` > 1000000, Population > 50)

This implies an AND (both conditions must hold). If we wanted OR (either condition), we’d use | inside one filter call:

filter(`2020` > 1000000 | Population > 50)

However, careful with OR: it would include cases that meet one or the other (so a populous country with smaller GDP might slip in, and a small population country with big GDP also would).

Filtering in panel data context: If our data were in long format, we might filter by year or country easily. For example, after reshaping (which we will do soon), we could do:

gdp_long %>% filter(Year == 2020)

to get all countries for 2020, equivalent to the cross-section of 2020. Or filter(Country == "Canada", Year > 2009) to get Canada’s data for the last decade. In wide format, to get 2020 only, we might instead select the 2020 column (and perhaps Country) rather than filter (since year isn’t a row variable yet). We’ll see after pivoting how filter becomes even more useful.

For now, think of filter as “how do I subset my dataset to only the cases I care about?” In an international business research scenario, this could be:

  • Filtering a firm-level dataset to a specific industry.
  • Filtering country data to a particular region or income group.
  • Filtering time series to a window of years (e.g., post-2000 data only).
  • Removing observations with missing or anomalous values (e.g., filter out rows where a key variable is NA or zero).

Sorting Rows (Arranging)

Sorting, or ordering the rows of a data frame, can be useful for inspections, making reports, or just to see the highest or lowest values. In dplyr, arrange() is the function that sorts the data.

By default, arrange(data, var1, var2, ...) will sort by var1 ascending (smallest to largest). If there are ties, it will use var2 to break ties, and so on. To sort in descending order, you wrap the variable name in desc().

Let’s say after filtering to our five countries in gdp_subset, we want to sort them by 2020 GDP from largest to smallest:

gdp_subset_sorted <- gdp_subset %>%
  arrange(desc(`2020`))

This will reorder the rows of gdp_subset such that the country with the highest 2020 GDP is first. If gdp_subset had United States, Canada, France, Japan, Belgium, likely the USA would come first (assuming it has the largest GDP among those), then Japan, then France, Canada, and Belgium. If two countries had exactly the same 2020 GDP (highly unlikely in such macro data), the order between them would be determined by their original order or by adding another arrange key. We could specify a secondary sort, for example:

arrange(desc(`2020`), Country)

which would sort by 2020 descending, and within equal 2020 values, sort alphabetically by Country. In practice, for numeric data like GDP, ties at such precision are rare.

If we wanted ascending order (smallest GDP first), we could do arrange(2020) (without desc). That might put Belgium first in our subset, and USA last.

Another example: If our data is in long format with Year as a column and we want to sort by Year then GDP, we would do:

gdp_long %>% arrange(Year, desc(GDP))

This sorts the entire dataset first by year (earliest to latest), and within each year, by GDP highest to lowest. The result would start at 2010 with the largest GDP in 2010 at the top, down to smallest in 2010, then move to 2011, etc. This could be useful if you want to see, for each year, a ranking of countries – although more typically one might group and slice for that task (to get top N each year, for example).

Why sorting matters: While sorting doesn’t change any values or which rows are present, it’s helpful for quickly answering questions like “which are the top 10 countries by X?” or producing a sorted table for a report. In data frames, row order usually doesn’t affect computation (unless you’re specifically doing rank-based operations or lag/lead, which depend on order), but it affects how you view the data. A common pattern is to filter or compute something and then arrange the results to find extremes.

For example, you might filter your panel data to Year == 2020, then arrange by GDP descending to get a ranked list of countries by GDP in 2020. Or if you had firm profitability data, you might arrange by profit to see the most profitable vs. least profitable firms.

Example: Combining Filtering and Sorting

Let’s continue with our running example to illustrate a typical use case:

Suppose we are interested in examining recent GDP data (2015–2020) for a handful of countries and then comparing their 2020 GDP. We already filtered down to five countries and have data for all years (assuming our gdp originally had data from 2010 to 2020).

We can further select only the columns we need (say Country, CountryCode, 2015–2020) and then arrange by 2020. Doing this stepwise with pipes:

gdp_analysis <- gdp %>%
  filter(Country %in% c("United States", "Canada", "France", "Japan", "Belgium")) %>%
  select(Country, CountryCode, `2015`:`2020`) %>%    # select only country info and last 6 years of GDP
  arrange(desc(`2020`))

Let’s break that down:

  • We filter to the five countries of interest.
  • We select only the columns Country, CountryCode, and the range of columns from 2015 to 2020. (\2015`:`2020`` is a tidyselect syntax that grabs all columns between 2015 and 2020 inclusive, assuming they are in order in the data frame. This is a convenient way to select a sequence of year columns.)
  • We arrange the resulting data by 2020 in descending order.

After this, gdp_analysis is a tibble that might look like:

Country CountryCode 2015 2016 2017 2018 2019 2020
United States USA 21000000
Japan JPN 5080000
France FRA 2710000
Canada CAN 1640000
Belgium BEL 520000

(Note: The numbers are just illustrative; for instance, US GDP in 2020 was around 21 trillion USD, Japan ~5 trillion, France ~2.7 trillion, Canada ~1.64 trillion, Belgium ~0.52 trillion, all in USD nominal.)

Now we have a nice subset of data ready to be reshaped or merged as needed. We could use this for a quick analysis: for example, compute growth rates from 2015 to 2020 for these countries by adding a new column, or plot their time series. But a more flexible approach is to reshape it into a long format, which we do next.

6.6 Data Wrangling 3/4: Reshaping Data (Long vs. Wide Formats)

As discussed, tidy data often means converting data to a long format. Our gdp_analysis (and the larger gdp dataset) is in a classic wide format: one row per country, multiple columns for years. Many R functions and analyses prefer a long format. We will now use the tidyr package (part of the tidyverse) to reshape the data.

The main functions for reshaping are:

  • pivot_longer() – which takes multiple columns and pivots them into fewer columns (typically turning many wide columns into a pair of key/value columns, like Year and GDP).
  • pivot_wider() – which does the opposite, spreading a key/value pair of columns out into multiple columns.

In older R literature, you might see melt and cast (from the reshape2 package) or gather and spread (earlier tidyr versions). Those have been superseded by the more flexible pivot_longer and pivot_wider (Wickham & Henry, 2020).

Let’s pivot our gdp_analysis data (5 countries, 2015-2020) to long format. Currently, it has columns for 2015…2020. We want to turn those into two columns: one for Year and one for GDP value.

library(tidyr)  # load tidyr for pivot_longer

gdp_long <- gdp_analysis %>%
  pivot_longer(
    cols = `2015`:`2020`,
    names_to = "Year",
    values_to = "GDP"
  )

We specify the range of columns to pivot (2015 through 2020) as cols. We tell it to create a new column called “Year” from the names of those columns, and a new column called “GDP” from their values. After this operation, gdp_long will have the following columns: Country, CountryCode, Year, GDP, and G7 (if we kept G7 or other columns, those would just be repeated appropriately for each row). Each row of gdp_long is a single country-year observation.

What does it look like? It should have 5 countries × 6 years = 30 rows. For example, the first few rows might be:

Country CountryCode Year GDP
United States USA 2015 1.800e+13
United States USA 2016 1.850e+13
United States USA 2017 1.950e+13
United States USA 2018 2.050e+13
United States USA 2019 2.147e+13
United States USA 2020 2.100e+13
Japan JPN 2015 4.380e+12
Japan JPN 2016 4.500e+12

(Again, numbers approximate; note that “2.147e+13” is scientific notation for 2.147×10^13, which would be $21.47 trillion – R might display large numbers in sci notation by default.)

A few things to note about pivot_longer:

  • The original columns 2015-2020 are removed, and we got two new columns (Year, GDP). By default, the Year column will be of type character, because the original column names were character strings “2015”. We might want Year as numeric. We can convert it after pivoting: gdp_long$Year <- as.integer(gdp_long$Year) or integrate that into the pipe as another step: mutate(Year = as.integer(Year)).
  • The Country and CountryCode columns were not listed in cols, so pivot_longer left them as-is and just repeated their values for each year. That’s what we want: those are identifier columns that remain associated with each observation.
  • If there were other data columns (say we also had a Population column for each year in the same dataset as wide columns), we might pivot them too, possibly in one go if they share the same shape. There are more advanced uses of pivot_longer for multiple measures, but in many cases it’s simplest to pivot each measure separately and then join, or pivot one measure at a time.
  • pivot_longer has arguments like names_prefix or names_sep for handling column name formatting. For example, if our columns were named “GDP_2015”, “GDP_2016”, etc., we could strip the “GDP_” prefix by using names_prefix = "GDP_". In our case, the names were just the years, so it’s straightforward.

Now that gdp_long is tidy, we can easily filter by Year or Country using the filter() function. For instance, to get just 2020 data:

gdp_2020 <- gdp_long %>% filter(Year == 2020)

This would yield 5 rows (one for each of our five countries, for the year 2020). In fact, that should match the data we had in the 2020 column originally.

Conversely, if we wanted just Canada’s time series (assuming Canada is in those five):

canada_ts <- gdp_long %>% filter(Country == "Canada")

This would give Canada’s GDP for 2015–2020 as 6 rows.

Converting back to wide (if needed): While analysis often prefers tidy data, sometimes you might need to go back to wide format. For example, maybe after some analysis you want to present a table of values, or your collaborator is expecting an Excel sheet with years in columns. In such cases, pivot_wider() is your friend. If we were to take gdp_long and pivot it wider to get the original format:

gdp_wide_again <- gdp_long %>%
  pivot_wider(
    names_from = Year,
    values_from = GDP
  )

We’d need to be careful that Year is treated as a factor or character so that it becomes column names properly. If Year is numeric, pivot_wider will usually convert it to character to use as column names, or it might give an error asking for a values_fn (if there are duplicates). But since each Country-Year is unique in gdp_long, this should work, giving back a data frame with Country, CountryCode, and columns 2015…2020.

Why bother reshaping? Because many operations are simpler in one format or the other. Long format is generally better for:

  • Plotting (ggplot2 works best with one observation per row).
  • Modeling (e.g., running a regression on panel data: you’d have a column for year and could include year effects or trends easily).
  • Calculating summary statistics grouped by a category (e.g., mean GDP by year, or by country, which is straightforward when each observation is a single year’s GDP).

Wide format can be convenient for:

  • Certain types of comparisons (like subtracting one year’s column from another without grouping).
  • Presenting data in spreadsheets or reports.
  • Running matrix operations or time-series specific functions that expect a matrix (though those are more advanced scenarios, typically finance or signal processing tasks).

In our workflow, we will likely keep data in long format for merging and analysis. Indeed, to combine our gdp_long with the locations data, long format is appropriate because locations has one row per country (with no year). We’ll need to merge country info onto each country-year observation.

Now that we have gdp_long tidy, let’s do that merge.

6.7 Data Wrangling 4/4: Merging Datasets (Joins)

It’s rare in international business research to have all the data you need in one table. More commonly, you’ll have to merge or join data from different sources – for example, adding country-level attributes to firm-level data, or adding exchange rates to a trade dataset. In our case, we have GDP data in one table and location/region data in another. We need to combine them.

In R, the dplyr package provides a family of join functions that mirror SQL joins: inner_join, left_join, right_join, full_join, as well as semi_join and anti_join for filtering by matches. The most commonly used are inner_join and left_join. An inner join keeps only the rows that have a match in both tables, potentially dropping some data if there are non-matches. A left join keeps all rows of the left table and brings in matching info from the right table (using NA for missing matches). Generally, if you have a primary dataset (the one you don’t want to lose observations from) and a secondary dataset that just adds more info, you use a left join.

In our example, gdp_long is the primary data (we want to keep all those country-year GDP entries), and locations is the supplementary info (with coordinates for each country). We should use a left join, keeping all GDP observations and adding latitude/longitude to each.

Our key for joining is the country name (column “Country” in both data frames). It’s important that they match exactly. We have already noted the need to ensure consistency (e.g., if one dataset said “United States” and another “United States of America”, that would be a problem to resolve beforehand by editing one of them or using a key like country code instead). Assuming they match, here’s the join:

merged_data <- gdp_long %>%
  left_join(locations, by = "Country")

We pipe gdp_long into left_join, and specify by = "Country" to indicate the common key. Since both have a column named “Country”, this is straightforward. If our locations used a different name (say “CountryName”), we’d do by = c("Country" = "CountryName").

After this:

  • merged_data will have all columns from gdp_long (Country, CountryCode, Year, GDP) plus the columns from locations (Latitude, Longitude, Region, etc.).
  • The number of rows in merged_data will be the same as gdp_long (30 rows in our subset example, or if we did the full data, it would be number of countries * number of years). Each country-year now has location info attached.
  • If a country in gdp_long wasn’t found in locations, the new columns would be NA for that row. Conversely, if locations had a country not present in gdp_long, that country simply doesn’t appear in the joined result (left join doesn’t add extra rows for unmatched right-side entries).
  • If there were any duplicate country names in locations (which there shouldn’t be, ideally each country once), the join would result in multiple matches and increase rows – but in a well-designed reference table, the key is unique.

We should verify with a quick check: merged_data %>% filter(is.na(Latitude)) should return 0 rows if all countries matched successfully. If not, we’d see which countries didn’t get a match. For instance, if “Côte d’Ivoire” was spelled differently or something, that might show up. Those mismatches would need addressing (either by correcting spelling or by using an alternate key like ISO country codes if available).

In our subset example, likely everything matched (the five example countries are straightforward). So merged_data might look like:

Country CountryCode Year GDP Latitude Longitude Region
United States USA 2015 1.80e+13 37.0902 -95.7129 North America
United States USA 2016 1.85e+13 37.0902 -95.7129 North America
Belgium BEL 2020 5.20e+11 50.5039 4.4699 Europe

Notice how the location info repeats for each year of a given country, which is expected because that info is constant for the country. This is fine; redundancy in a tidy dataset (like country attributes repeated for each year) is normal and often necessary for analysis (so that each observation is self-contained, or easily grouped).

Using other joins: For completeness, if we had done an inner join instead (inner_join(gdp_long, locations, by="Country")), we would drop any country-year that doesn’t have a location match. In our case that likely wouldn’t drop anything unless a country was missing in the locations table. A full join (full_join) would keep all countries from both, possibly adding rows for countries that have location but no GDP data or vice versa (which in panel context could mean adding NA GDPs for a country that wasn’t originally in GDP). Full joins are useful in some merge of datasets where both could have unique entries, but in our scenario, left join is the safe choice.

There are also right_join (keeps all rows of right table, i.e., all location countries, dropping any GDP data for countries not in location table) and full_join (union of both). In practice, left join is most used for adding reference info, and inner join for combining two datasets when you only want the intersection (for example, combining two surveys where you only analyze respondents who appear in both).

Potential issues: When joining, one must ensure the keys are consistent and ideally unique on each side for a one-to-one or one-to-many merge. Our case is one-to-many (one country in locations to many country-year in GDP data). That’s fine. If, say, locations had duplicates (maybe two entries for “France” by accident), the join would create duplicate rows in the output for each matching pair (so France’s every year would appear twice). That’s a data quality issue to fix by removing duplicates before joining.

Another issue is case sensitivity: “china” vs “China” would not match. R’s joins are case-sensitive. So cleaning strings (to common case, or trimming whitespace, etc.) is often needed.

Merging by multiple keys: If we had a more complex key (e.g., firm and year), we’d do by = c("FirmID","Year") if both data frames share those column names, or specify the equivalent if named differently.

Finalizing the Example Workflow

At this point, we have our data combined in merged_data. For the five-country example, it has 30 rows and columns: Country, CountryCode, Year, GDP, Latitude, Longitude, Region. We can now use this in various ways:

  • Plotting: e.g., make a line plot of GDP over time for each country (with perhaps lines colored by Region or whether it’s G7).
  • Mapping: using Latitude and Longitude, perhaps plot points or use a map package to show these countries on a world map, maybe with point size proportional to GDP.
  • Analysis: calculate which region has the highest average GDP, or see if there’s a correlation between latitude (a proxy for geography) and GDP, etc. (One might recall the latitude argument in economic development – the idea that geography influences development – though with 5 countries this is just an illustration.)

To illustrate a quick analysis: We could use merged_data to compute the average GDP of these countries over 2015-2020:

merged_data %>%
  group_by(Country) %>%
  summarize(AverageGDP = mean(GDP))

This would yield a table of the five countries with their mean GDP over the period. Or if we group by Region instead:

merged_data %>%
  group_by(Region, Year) %>%
  summarize(TotalGDP = sum(GDP))

We’d get total GDP by region by year (though with 5 countries, “North America” region only has USA and Canada, “Europe” has France/Belgium, “Asia” has Japan – not a complete regional total, but an example of grouping).

The key takeaway is that after proper wrangling (tidying and joining), these kinds of questions become straightforward to answer with a few lines of dplyr code, leveraging the tidy data format. We can always pivot wider or output results to CSV/Excel if needed for reporting.

A note on concatenating data (row binding): While our example focused on joining columns from two data frames, another common task is stacking data frames vertically (adding more rows). For example, if you had GDP data in two files, one for 2010-2015 and one for 2016-2020, you might read both and then use bind_rows(gdp1, gdp2) to combine them into one longer dataset (ensuring they have the same columns). This isn’t a “join” by keys, but rather an append. It’s important to make sure columns align (same names and types) for this to work well. In international data, you sometimes get data split by region or year, and row-binding is necessary to put it together.

Now that we’ve mastered importing, filtering, transforming, reshaping, and merging, we have all the tools needed to prepare raw data for analysis. In practice, you might iterate through these steps multiple times: importing new data, cleaning it, merging with existing data, reshaping for a particular analysis, and so on. Good data management is an iterative and thoughtful process, often consuming a significant portion of the project timeline (Kandel et al., 2011). However, the payoff is that once the data is in a tidy, merged form, the analysis and visualization steps flow much more easily.

6.8 Conclusion

In this chapter, we covered the essentials of data management and wrangling in R, using examples grounded in international business and economics. To recap:

  • Setting up and Importing Data: We started by emphasizing reproducibility and project setup (using Git for version control, which aligns with the increasing push for transparency in research data (Aguinis et al., 2018)). We demonstrated how to import data from common sources like CSV files and Google Sheets. In international business, data often comes from disparate sources (Papadopoulos, 2012), so being able to load data in various formats is crucial.
  • Understanding Data Structures: We reviewed R data frames (and tibbles) and the concept of tidy data. A well-structured (tidy) dataset has each observation as a row and each variable as a column (Wickham, 2014). This structure is especially helpful when dealing with panel data typical in global statistics (Feenstra, Inklaar & Timmer, 2015).
  • Core Data Wrangling Operations: Using dplyr’s grammar, we practiced selecting relevant columns, filtering rows to focus on subsets (like particular countries or years), creating new variables (e.g., growth rates or categorical flags), and sorting data for interpretation. These operations let us clean and subset data to the exact slice needed for a given question. For example, one might filter a dataset to the post-2010 period to analyze the last decade of globalization effects, or create a new variable indicating whether a firm is state-owned or not, before proceeding with analysis.
  • Reshaping Data: We showed how to reshape data from wide to long format using pivot_longer(), turning a dataset with columns for each year into a tidy structure with one column for year and one for the value. This step is often necessary when preparing data for analysis or merging, as many analytical techniques (and ggplot2 for visualization) expect tidy data. We saw that wide-to-long conversion makes it easier to filter by time or perform calculations across years (like computing trends).
  • Merging Datasets: Perhaps most critically for international business research, we practiced merging datasets using joins. It is common to merge macro-level data (country attributes) with micro-level data (firm or transaction data), or to combine multiple indicators. We illustrated a left join to add location coordinates to country-year GDP data. The ability to join on keys allows you to enrich your dataset – for instance, adding World Bank indicators to your firm-level panel via country-year keys, or adding exchange rates to your trade flow data via date keys. We also noted the importance of consistent keys and careful matching to avoid mis-merging.

By walking through an example of GDP data and country info, we mirrored tasks a researcher might do, such as preparing a dataset of countries with both economic data and geographic or institutional data. In real studies, you might be merging many more variables (and possibly dealing with more complexity, like one-to-many merges or needing to aggregate before merging). The principles remain the same.

A few broader lessons from this chapter:

  • Plan and document your steps: Data management can get complex, so it’s wise to script your wrangling steps clearly (using comments and a logical flow, as the pipe encourages). This documentation not only helps others follow your process but also helps you remember your own logic. It contributes to reproducible research, where someone else (or you in the future) can re-run your analysis on the same raw data and get the same results (Peng, 2011).
  • Verify at each stage: We showed examples of checking data after each major operation (using head(), str(), filtering for NAs after a join, etc.). This iterative checking is crucial to ensure things went as expected. For instance, after reshaping, verify that the number of rows matches the product of original entities and years; after merging, check that you didn’t inadvertently drop observations.
  • Be mindful of data quality issues: Although our example data was relatively clean, real-world data often has inconsistencies (e.g., different naming conventions, missing values, outliers, etc.). You may need to incorporate steps to handle these, such as string cleaning, deduplicating entries, or replacing missing values with appropriate estimates. These steps go hand-in-hand with the techniques we discussed.
  • Use the right format for the task: We converted wide to long for analysis, but sometimes a wide format (or a summary table) might be needed for presentation. Don’t hesitate to reshape data as needed; tidyr makes it straightforward. Also, consider the structure required by any statistical models you plan to use – for example, some time series models might need a ts object (which can be created from a single series), whereas panel regression packages might require you to identify panel and time indices (with data in long format).
  • The importance of data management in IB research: It’s worth noting that as data sources proliferate (big data, unstructured data, etc.), researchers and analysts spend considerable effort on data wrangling. Studies have found that data scientists spend up to 80% of their time on data preparation (Kandel et al., 2011). In international business and economics, where one might merge data from the World Bank, IMF, UN, and firm-level databases, solid data management skills are indispensable. Proper data handling can also improve the credibility of research findings – for example, by enabling robustness checks with different data sources (Papadopoulos, 2012) or by facilitating replication (Aguinis, Cascio & Ramani, 2017).

Having prepared our dataset, we can now move to analysis. In subsequent chapters, we will use the wrangled data to perform exploratory data analysis, visualization, and statistical modeling. For example, we might explore correlations between GDP and other indicators, or fit a regression model to explain a business outcome with macroeconomic variables. Those analytical tasks will rely on the tidy, merged data we can now produce.

In conclusion, this chapter provided you with a toolkit for data wrangling in R: importing data, manipulating it with dplyr verbs, reshaping it with tidyr, and merging multiple sources. These are the building blocks for any empirical project in international business. With practice, you will be able to approach any new dataset methodically: import it, inspect it, clean it, reshape it to the form you need, and combine it with other data as required. Mastery of these skills not only saves time in the long run but also enhances the reliability and reproducibility of your analyses – traits highly valued in both business and academia (Aguinis et al., 2018).

As a next step, you might try applying these techniques to a real dataset of your choice – for instance, download a country-level dataset (like the World Bank’s World Development Indicators) and practice filtering for regions or years, or take two indicators (GDP and population) and merge them to compute GDP per capita. Such exercises will reinforce the concepts and prepare you for more complex data challenges ahead.

References

Reference List

Aguinis, H., Cascio, W. F., & Ramani, R. S. (2017). Science’s reproducibility and replicability crisis: International business is not immune. Journal of International Business Studies, 48(6), 653–663. https://doi.org/10.1057/s41267-017-0081-0

Aguinis, H., Ramani, R. S., & Alabduljader, N. (2018). What you see is what you get? Enhancing methodological transparency in international business research. Journal of World Business, 53(5), 811–814. https://doi.org/10.1016/j.jwb.2018.06.003

Bache, S. M., & Wickham, H. (2014). magrittr: A forward‐pipe operator for R (R package version 1.5) [Computer software]. https://CRAN.R-project.org/package=magrittr

Bryan, J., Wickham, H., & Hester, J. (2019). Happy Git and GitHub for the useR (Version 0.1) [Online book]. https://happygitwithr.com/

Feenstra, R. C., Inklaar, R., & Timmer, M. P. (2015). The next generation of the Penn World Table. American Economic Review, 105(10), 3150–3182. https://doi.org/10.1257/aer.20130954

George, G., Haas, M. R., & Pentland, A. (2014). Big data and management: From the editors. Academy of Management Journal, 57(2), 321–326. https://doi.org/10.5465/amj.2014.4002

Kandel, S., Paepcke, A., Hellerstein, J. M., & Heer, J. (2011). Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 3363–3372). ACM. https://doi.org/10.1145/1978942.1979444

Papadopoulos, N. (2012). International market selection and segmentation: Perspectives and challenges. International Marketing Review, 29(3), 325–350. https://doi.org/10.1108/02651331211242680

Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226–1227. https://doi.org/10.1126/science.1213847

Ram, K. (2013). Git can facilitate greater reproducibility and increased transparency in science. Source Code for Biology and Medicine, 8, Article 7. https://doi.org/10.1186/1751-0473-8-7

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10

Wickham, H., & Henry, L. (2020). tidyr: Tidy messy data (R package version 1.1.2) [Computer software]. https://CRAN.R-project.org/package=tidyr

Wickham, H., & Hester, J. (2017). readr: Read rectangular text data (R package version 1.1.1) [Computer software]. https://CRAN.R-project.org/package=readr

Wickham, H., François, R., Henry, L., & Müller, K. (2019). tibble: Simple data frames (R package version 2.1.3) [Computer software]. https://CRAN.R-project.org/package=tibble

Wickham, H., François, R., Henry, L., & Müller, K. (2023). dplyr: A grammar of data manipulation (R package version 1.1.4) [Computer software]. https://CRAN.R-project.org/package=dplyr

Xie, Y. (2015). Dynamic documents with R and knitr (2nd ed.). CRC Press.