Chapter 9 Data Wrangling 4/4

9.1 Introduction

What makes R a compiling programming language is its facility to wrangle data on the fly. In this chapter, you will learn the basics of data manipulation. Based on the knowledge acquired in the previous chapters, you will transform datasets in order to prepare them for the chapter 10 which is all about data visualization!

We will use the United Nations Industrial Development Organization (UNIDO) dataset to illustrate this session.

But you can think about yourself as a hacker and merge a hacked Caisse Desjardins file with all the consumers’s accounts with the hacked social security numbers file from Equifax. You will see that you can do it with 2 lines of code. Once merged, this new dataset allows you to know a lot of things!

So, when you think you give some information to one website and some other information to another website, make no mistake, if both companies belong to the same holding or if they have a partnership of some sort, they can merge these two sources of information and suddenly know a lot about you, more than maybe you do about yourself! This is no science-fiction, it takes 2 lines of code. So, let’s do it!

At the end of the chapter, you should be able to:

  1. merge two datasets;

You will go from a database of 655’350 points to a graphic made of 6 observations.

9.2 Merging datasets

In this chapter, you will merge several datasets together. Let’s consider two datasets with the following settings:

First dataset, named dataCanada131, with:

  • the number of employees: tableCode == 4
  • in the textile sector: isicCode == 131
  • from Canada: countryCode == 124
  • after 2008: year > 2008

Second dataset, named dataCanada181, with:

  • the number of employees: tableCode == 4
  • in the printing sector: isicCode == 181
  • from Canada: countryCode == 124
  • after 2008: year > 2008

This is the code to obtain the two dataframes:

# Loading the gsheet package
library(gsheet)
# Read data from the URL
dataUnido <- gsheet::gsheet2tbl("https://docs.google.com/spreadsheets/d/1uLaXke-KPN28-ESPPoihk8TiXVWp5xuNGHW7w7yqLCc/edit#gid=416085055")
dataUnido

# Erasing non important variables
dataUnido$isicCodeCombinaison <- NULL
dataUnido$tableDefinitionCode <- NULL
dataUnido$sourceCode <- NULL
dataUnido$updateYear <- NULL
dataUnido$unit <- NULL

# Loading the dplyr package
library(tidyverse)

# Dataset for dataCanada
dataCanada131 <- filter(dataUnido, countryCode == 76)
dataCanada131 <- filter(dataCanada131, isicCode == 131)
dataCanada131 <- filter(dataCanada131, tableCode == 4)
dataCanada131 <- filter(dataCanada131, year > 2008)

dataCanada131 <- dataCanada131 %>%
  pivot_wider(names_from = isicCode, values_from = value)

head(dataCanada131)

# Dataset for dataCanada181
dataCanada181 <- filter(dataUnido, countryCode == 76)
dataCanada181 <- filter(dataCanada181, isicCode == 181)
dataCanada181 <- filter(dataCanada181, tableCode == 4)
dataCanada181 <- filter(dataCanada181, year > 2008)

dataCanada181 <- dataCanada181 %>%
  pivot_wider(names_from = isicCode, values_from = value)

head(dataCanada181)

In order to merge your datasets, you need to use the left_join() function from the dplyr package. The first 2 arguments are the name of the datasets, whereas the third argument is the named of the common columns (here year, tableCode and countryCode).

# Merging 2 datasets
dataCanadaFull <- left_join(dataCanada131, dataCanada181, c("year","tableCode","countryCode"))

# First 6 lines
head(dataCanadaFull)

Now that the dataset are merged, we need to put it in a wider format in order to visualize it.

# Transform dataCanadaFull in long data format
dataCanadaFullLong <- dataCanadaFull %>% 
  pivot_longer(!c(year, tableCode, countryCode), names_to = "isicCode", values_to = "value")

Your data are now prepared for the next lab about visualization! Let’s save the data in a csv format.

readr::write_csv(dataCanadaFullLong, "./data/dataCanadaFullLong.csv")

TL;DR

# Loading the gsheet package
library(gsheet)
# Read data from the URL
dataUnido <- gsheet::gsheet2tbl("https://docs.google.com/spreadsheets/d/1uLaXke-KPN28-ESPPoihk8TiXVWp5xuNGHW7w7yqLCc/edit#gid=416085055")
dataUnido

# Dataset for dataCanada
dataCanada131 <- filter(dataUnido, countryCode == 76)
dataCanada131 <- filter(dataCanada131, isicCode == 131)
dataCanada131 <- filter(dataCanada131, tableCode == 4)
dataCanada131 <- filter(dataCanada131, year > 2008)

dataCanada131 <- dataCanada131 %>%
  pivot_wider(names_from = isicCode, values_from = value)

head(dataCanada131)

# Dataset for dataCanada181
dataCanada181 <- filter(dataUnido, countryCode == 76)
dataCanada181 <- filter(dataCanada181, isicCode == 181)
dataCanada181 <- filter(dataCanada181, tableCode == 4)
dataCanada181 <- filter(dataCanada181, year > 2008)

dataCanada181 <- dataCanada181 %>%
  pivot_wider(names_from = isicCode, values_from = value)

head(dataCanada181)

# Merging 2 datasets
dataCanadaFull <- left_join(dataCanada131, dataCanada181, c("year","tableCode","countryCode"))

# First 6 lines
head(dataCanadaFull)

# Transform dataCanadaFull in long data format
dataCanadaFullLong <- dataCanadaFull %>% 
  pivot_longer(!c(year, tableCode, countryCode), names_to = "isicCode", values_to = "value")

readr::write_csv(dataCanadaFullLong, "./data/lab2/dataCanadaFullLong.csv")

Code learned in this chapter

Command Detail
gsheet2tbl() Download a table
filter() Find rows where conditions are true
pivot_wider() “Widens” data
head() Returns the first rows
left_join() Join two tables together
write_csv() Save a data frame to a csv file

Getting your hands dirty

It’s time to practice! This exercise begins in Chapter 6 and continues through Chapter 9. This exercise is therefore divided into 4 parts. For this exercise, you’ll work with a csv file available on Github in the chapter6 folder.

Before starting the last part of this exercise, let’s remember the first three parts:

  • Step 1 : Import via a csv

Import the csv file called chapter6data.csv.

library(readr)
gdp <- 
  • Step 2 : Import via a gsheet

Import a dataset containing longitude and latitude from this gsheet: https://docs.google.com/spreadsheets/d/1nehKEBKTQx11LZuo5ZJFKTVS0p5y1ysMPSOSX_m8dS8/edit?usp=sharing

library(gsheet)
locations <- 
  • Step 3 : Delete the column

Delete the column X1.

  • Step 4 : Filter the data

Filter the data to only keep the following countries: “United States”, “Canada”, “Japan”, “Belgium” and “France”.

library(dplyr)
gdp2 <- 
  • Step 5 : “Lengthens” the data

You need to “lengthens” (modify from wide to long) the dataframe “gdp2” to get three column: “country”, “year”, “gdp”.

library(tidyr)
gdp3 <-

Now, let’s begin with the last part of this exercise:

  • Step 6 : Merging datasets

Join the data frame locations to the data frame gdp3.

gdp4 <-