Chapter 8 Data Wrangling 3/4

8.1 Introduction

What makes R a compiling programming language is its facility to wrangle data on the fly. In this chapter, you will learn the basics of data manipulation. Based on the knowledge acquired in the previous chapters, you will transform datasets in order to prepare them for the chapter 10 which is all about data visualization!

We will use the United Nations Industrial Development Organization (UNIDO) dataset to illustrate this session.

At the end of the chapter, you should be able to:

  1. transform your dataframe from long to wide form;

You will go from a database of 655’350 points to a graphic made of 6 observations.

8.2 Long and wide form

We can observe two types of layouts in a dataset:

  • wide form: 1 column per variable (longitudinal data)
  • long form: 1 column with all information (panel data)

8.2.1 From Long to Wide

Presently, our dataset dataSorted (obtained in the previous chapter) is presented in a long form. It could be interesting to switch its layout. In order to do so, you will use the pivot_wider() function of the tidyr package.

Here, we want to obtain the number of establishments and the number of employees per isicCode for each year.

# Loading reshape2
library(tidyr)

# Using pivot_wider() to transform a long dataframe into a wide dataframe
wideData <- dataSorted %>%
  pivot_wider(names_from = isicCode, values_from = value)

# First 6 lines
head(wideData)

# Dimension of the dataset
dim(wideData)

The dataframe wideData is a dataframe composed of 6 lines and 165 columns.

8.2.2 From Wide to Long

We can do the opposite, i.e. presenting data from a wide format to a long format, using the pivot_longer() function. Please note the columns preceded by an exclamation mark.

# Loading reshape2
library(reshape2)

# Using pivot_longer() to transform from wide to long data
longData <- wideData %>% 
  pivot_longer(!c(year, tableCode, countryCode), names_to = "isicCode", values_to = "value")

# Dimension of the dataframe
dim(longData)

# First 6 lines
head(longData)

In order to visualize your data in R, it is important to present your dataframe in the long format.

TL;DR

# Loading reshape2
library(tidyr)

# Using pivot_wider() to transform a long dataframe into a wide dataframe
wideData <- dataSorted %>%
  pivot_wider(names_from = isicCode, values_from = value)

# First 6 lines
head(wideData)

# Dimension of the dataset
dim(wideData)

# Loading reshape2
library(reshape2)

# Using pivot_longer() to transform from wide to long data
longData <- wideData %>% 
  pivot_longer(!c(year, tableCode, countryCode), names_to = "isicCode", values_to = "value")

# Dimension of the dataframe
dim(longData)

# First 6 lines
head(longData)

Code learned in this chapter

Command Detail
pivot_wider() “Widens” data
head() Returns the first rows
dim() Retrieve or set the dimension of an object
pivot_longer() “Lengthens” data

Getting your hands dirty

It’s time to practice! This exercise begins in Chapter 6 and continues through Chapter 9. This exercise is therefore divided into 4 parts. For this exercise, you’ll work with a csv file available on Github in the chapter6 folder.

Before starting the third part of this exercise, let’s remember the first two parts:

  • Step 1 : Import via a csv

Import the csv file called chapter6data.csv.

library(readr)
gdp <- 
  • Step 2 : Import via a gsheet

Import a dataset containing longitude and latitude from this gsheet: https://docs.google.com/spreadsheets/d/1nehKEBKTQx11LZuo5ZJFKTVS0p5y1ysMPSOSX_m8dS8/edit?usp=sharing

library(gsheet)
locations <- 
  • Step 3 : Delete the column

Delete the column X1.

  • Step 4 : Filter the data

Filter the data to only keep the following countries: “United States”, “Canada”, “Japan”, “Belgium” and “France”.

library(dplyr)
gdp2 <- 

Now, let’s begin with the third part of this exercise:

  • Step 5 : “Lengthens” the data

You need to “lengthens” (modify from wide to long) the dataframe “gdp2” to get three column: “country”, “year”, “gdp”.

library(tidyr)
gdp3 <-