Chapter 6 Data Wrangling 1/4
6.1 Introduction
Before you can analyse the data, you actually need to collect it. And in Margo’s and David’s course, you’ve talked a lot about spreadsheets, databases, and data warehouses. So I don’t want to spend any time on that.
What I want to talk about are data sources outside your organisation. These could be public repositories of data. For example, data.gov has an excellent collection of government data sets, sec.gov, Kaggle competitions, etc. It could also be data that you collect through an API, an application programming interface, such as the Google Map API, the Twitter API, or the Facebook API. Or it could be blogs that you get through RSS, a rich site summary, or general websites that you collect through web scrapping.
So let me go in a little bit more detail. So, first of all, what are the typical data formats that you will encounter when you go and collect your data? On the one hand, you will have your spreadsheets either as separate comma values, CSV files, or sometimes as separate tab values, TSV files. More commonly, you will find a markup language, the HTML HyperText Markup Language, which is the language of the web so that every website uses it.
You will also find a lot of data in JSON, which is JavaScript Object Notation. And sometimes, especially when it comes to scientific data, you can see it in a hierarchical data format, or in HDF format. There are also many ad hoc formats available, for example, if you have network data. So it could have graph edge lists, voting records, could be in a kind of non-standard format, fixed with files, etc.
What makes R a compiling programming language is its facility to wrangle data on the fly. In this chapter, you will learn the basics of data manipulation. Based on the knowledge acquired in the previous chapters, you will transform datasets in order to prepare them for the chapter 10 which is all about data visualization!
We will use the United Nations Industrial Development Organization (UNIDO) dataset to illustrate this session.
At the end of the chapter, you should be able to:
- import data from a .csv file as well as from a Google Sheets document;
- know what a dataframe is;
You will go from a database of 655’350 points to a graphic made of 6 observations.
6.2 Importing data
6.2.1 From a .csv file
First, upload your document in the lower right-hand panel of your console by pressing the upload button in the Files tab. The document will appear in the list of documents that you can use for your project.
Second, you need to load your data into your environment (top right-hand panel). You need to create a new variable that will hold all information from your .csv file. This way, you will be able to interact with your data. For a .csv file, you need to write the following command inside a R chunk:
# Loading the dataset into a variable named 'gdpCountryData'
gdpCountryData <- readr::read_csv("./data/gdpCountry.csv")
With:
gdpCountryData
the name of your new variable that will hold all information from a dataset;gdpCountry.csv
corresponding to the name of the files you previously imported;
Now, you can interact with your dataset by using the name of your variable. We will go in details during the following sections of this chapter.
6.2.2 From a statistical software
We will use the haven
package.
6.2.3 From a Google sheet
A second way to import data is to go through Google Drive, by using Google Sheets. RStudio is able to connect to any spreadsheet in a Google Drive account, which helps any work involving multiple people.
You need to create a new Google Sheets document. After naming your spreadsheet, you need to configure the sharing options. In order to do so, click on the Share button in the corner of your spreadsheet. A panel will appear and you need to select the following setting:
- Anyone with the link can edit
Copy the link you will obtain after this step. The link will be somewhat like this:
First, you need to load a package in your document, the gsheet
package and then use the gsheet2tbl()
function. You have to copy the link between parenthesis and quotation marks as follow:
# Loading the gsheet package
library(gsheet)
# Read data from the URL
dataUnido <- gsheet::gsheet2tbl("https://docs.google.com/spreadsheets/d/1uLaXke-KPN28-ESPPoihk8TiXVWp5xuNGHW7w7yqLCc/edit#gid=416085055")
dataUnido
To load the variables names of the UNIDO data stored in a Gsheet:
variablesUnido <- gsheet::gsheet2tbl("https://docs.google.com/spreadsheets/d/1JYiydgI9QEOKAStmtmrBTmGuGQ0bx1gnnarO4Db0MD8/edit#gid=919352283")
variablesUnido
Now you should be able to fullfil the first task of this chapter, i.e. loading data from a .csv file or from a Google Sheets document.
We will continue to work with this data from UNIDO in the following sections.
6.2.4 For almost anything at the C++ speed
We will use here data.table
package.
TL;DR
# Loading the dataset into a variable named gdpCountryData
gdpCountryData <- readr::read_csv("./data/gdpCountry.csv")
# Loading the gsheet package
library(gsheet)
# Read data from the URL
dataUnido <- gsheet::gsheet2tbl("https://docs.google.com/spreadsheets/d/1uLaXke-KPN28-ESPPoihk8TiXVWp5xuNGHW7w7yqLCc/edit#gid=416085055")
dataUnido
variablesUnido <- gsheet::gsheet2tbl("https://docs.google.com/spreadsheets/d/1JYiydgI9QEOKAStmtmrBTmGuGQ0bx1gnnarO4Db0MD8/edit#gid=919352283")
variablesUnido
Code learned in this chapter
Command | Detail |
---|---|
gsheet2tbl() | Download a table |
read_csv() | Read comma separated values (csv) |
Getting your hands dirty
It’s time to practice! This exercise begins in Chapter 6 and continues through Chapter 9. This exercise is therefore divided into 4 parts. For this exercise, you’ll work with a csv file available on Github in the chapter6 folder.
Let’s begin with the first part of this exercise!
- Step 1 : Import via a csv
Import the csv file called chapter6data.csv
.
- Step 2 : Import via a gsheet
Import a dataset containing longitude and latitude from this gsheet: https://docs.google.com/spreadsheets/d/1nehKEBKTQx11LZuo5ZJFKTVS0p5y1ysMPSOSX_m8dS8/edit?usp=sharing