[R Course] Data Wrangling with R: Computing Summary Statistics

Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.

Thierry Warin https://warin.ca/aboutme.html (HEC Montréal and CIRANO (Canada))https://www.hec.ca/en/profs/thierry.warin.html
05-28-2019

Table of Contents


Set up

Library

In the course, we will introduced you a new way to refer a function to its package. In fact, we will use the code line Library(package) to mention the needed package and use any function of that package anywhere under that line without having to refer to it everytime. It will be easier as we will use the piping syntax in this session.

This package will be used for this course:


library(dplyr)

Piping

In this course we will use the “piping” syntax to compute summary statistics.

A pipe is represented by this sign: %>%.

The point of the pipe is to help you write code in a way that is easier to read and understand. In fact, “piping” makes your code more readable. Pipes are a powerful tool for clearly expressing a sequence of multiple operations.

To see why the pipe is so useful, let’s take a look at the most basic example that use the operator before you go into the more advanced usages of the operator in the following sections.

log(x) can be rewritten as x %>% log()

In short, this means that functions that take one argument, function(argument), can be rewritten as follows: argument %>% function().

Data

First, load the data on which summary statistics will be done.


library(gsheet)

dataFull <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1uLaXke-KPN28-ESPPoihk8TiXVWp5xuNGHW7w7yqLCc/edit#gid=329712902")
rank company country industrial.sector RD.usd sales.usd year
1 VOLKSWAGEN Germany Automobiles Parts 14712470286 NA 2015
2 DAIMLER Germany Automobiles Parts 6335781792 145635513797 2015
3 ROBERT BOSCH Germany Automobiles Parts 5653984389 54892540624 2015
4 SANOFI France Pharmaceuticals Biotechnology 5396067608 37868911705 2015
5 BMW Germany Automobiles Parts 5120208790 90159856973 2015
6 SIEMENS Germany Electronic Electrical Equipment 4908268479 80649456021 2015

Here some data on R&D and Sales in USD of companies per country and industrial sector from 2004 to 2015.

Functions

To compute summary statistics on the data, the aggregate function is used. The function allow you to split the data into subsets, compute summary statistics for each such as the mean and the sum, and returns the result in a convenient form. We will also use another way to compute summary statistics by piping the following functions: group_by(), summarize(), mean(), sum() and ungroup().

One ~ one

Sum

To calculate the sum of R&D per Country:


sum_RD_Country <- aggregate(RD.usd ~ country, data = dataFull, sum)

Detailed arguments of the function:
RD.usd: The value on which to average
country: To obtain the average RD per country
dataFull: Data used
sum: To calculate the sum

country RD.usd
Australia 31701141962
Austria 15967871461
Belgium 50598161005
Bermuda 25757237590
Brazil 27585939295
Bulgaria 13681470

Mean

To calculate the mean of R&D per Country:


mean_RD_Country <- aggregate(RD.usd ~ country, data = dataFull, mean)

Detailed arguments of the function:
RD.usd: The value on which to average
country: To obtain the average RD per country
dataFull: Data used
mean: To calculate the average

country RD.usd
Australia 320213555
Austria 44110142
Belgium 109996002
Bermuda 429287293
Brazil 452228513
Bulgaria 6840735

One ~ many

Sum

To calculate the sum of R&D per Country and Year:


sum_RD_CountryYear <- aggregate(RD.usd ~ country + year, data = dataFull, sum)

Detailed arguments of the function:
RD.usd: The value on which to average
country + year : To obtain the sum of R&D per country and per year
dataFull: Data used
sum: To calculate the sum

country year RD.usd
Australia 2004 1326329413
Austria 2004 815106978
Belgium 2004 3918844424
Bermuda 2004 2855248522
Brazil 2004 936977342
Canada 2004 7392941158

Mean

To calculate the mean of R&D per Country and Year:


mean_RD_CountryYear <- aggregate(RD.usd ~ country + year, data = dataFull, mean)

Detailed arguments of the function:
RD.usd: The value on which to average
country + year : To obtain the sum of R&D per country per year
dataFull: Data used
mean: To calculate the average

country year RD.usd
Australia 2004 331582353
Austria 2004 42900367
Belgium 2004 122463888
Bermuda 2004 475874754
Brazil 2004 234244336
Canada 2004 568687781

Many ~ one

Sum

To calculate the sum of R&D and Sales per Country, using the cbind() function:


sum_RDSales_Country <- aggregate(cbind(RD.usd, sales.usd) ~ country, data = dataFull, sum)

Detailed arguments of the function:
cbind(): To put the columns RD.usd and sales.usd next to each other, otherwise the aggregate function will sum the two columns.
RD.usd, sales.usd: Values on which to average.
country: To obtain the average of RD per country.
dataFull: Data used
sum: To calculate the sum

country RD.usd sales.usd
Australia 28738126700 8.908038e+11
Austria 6877398846 6.357453e+11
Belgium 21422944979 1.013106e+12
Bermuda 25757237590 6.820938e+11
Brazil 25515764646 1.906141e+12
Canada 51566211570 1.101636e+12

To calculate the sum of R&D and Sales per Country, without the cbind() function:


sum2_RDSales_Country <- aggregate(RD.usd + sales.usd ~ country, data = dataFull, sum)

Detailed arguments of the function:
RD.usd + sales.usd: Values on which to average.
country: To obtain the average of RD per country.
dataFull: Data used
sum: To calculate the sum

country RD.usd + sales.usd
Australia 9.195419e+11
Austria 6.426227e+11
Belgium 1.034529e+12
Bermuda 7.078511e+11
Brazil 1.931657e+12
Canada 1.153202e+12

You can see the difference with the operation above. The columns RD.usd and sales.usd are put in the same column and sum together.

Mean

To calculate the mean of R&D and Sales per Country, using the cbind() function:


mean_RDSales_Country <- aggregate(cbind(RD.usd, sales.usd) ~ country, data = dataFull, mean)

Detailed arguments of the function:
cbind(): To put the columns RD.usd and sales.usd next to each other, otherwise the aggregate function will do the mean of the two columns
RD.usd, sales.usd: Values on which to average
country: To obtain the average of RD per country
dataFull: Data used
mean: To calculate the average

country RD.usd sales.usd
Australia 312370942 9682650226
Austria 54152747 5005868443
Belgium 132240401 6253742481
Bermuda 429287293 11368230610
Brazil 439926977 32864497385
Canada 280251150 5987153638

To calculate the mean of R&D and Sales per Country, without the cbind() function:


mean2_RDSales_Country <- aggregate(RD.usd + sales.usd ~ country, data = dataFull, mean)

Detailed arguments of the function:
RD.usd + sales.usd: Values on which to average
country: To obtain the average of RD per country
dataFull: Data used
mean: To calculate the average

country RD.usd + sales.usd
Australia 9995021168
Austria 5060021189
Belgium 6385982882
Bermuda 11797517903
Brazil 33304424361
Canada 6267404788

You can see the difference with the operation above. The columns RD.usd and sales.usd are put in the same column and the average is done on both of them together.

Many ~ many

Sum

To calculate the sum of R&D and Sales per Country and Year:


sum_RDSales_CountryYear <- aggregate(cbind(RD.usd, sales.usd) ~ country + year, data = dataFull, sum)

Detailed arguments of the function:
cbind(RD.usd, sales.usd): Values on which to average
country + year: To obtain the average RD per country and per year
dataFull: Data used
sum: To calculate the sum

country year RD.usd sales.usd
Australia 2004 1326329413 59001083381
Bermuda 2004 2855248522 109528219651
Brazil 2004 936977342 82666092199
Canada 2004 7392941158 117503606183
China 2004 1187048998 191124385020
Croatia 2004 272229903 2703306251

Mean

To calculate the mean of R&D and Sales per Country and Year:


mean_RDSales_CountryYear <- aggregate(cbind(RD.usd, sales.usd) ~ country + year, data = dataFull, mean)

Detailed arguments of the function:
cbind(RD.usd, sales.usd): Values on which to average
country + year: To obtain the average RD per country and per year
dataFull: Data used
mean: To calculate the average

country year RD.usd sales.usd
Australia 2004 331582353 14750270845
Bermuda 2004 475874754 18254703275
Brazil 2004 234244336 20666523050
Canada 2004 568687781 9038738937
China 2004 296762249 47781096255
Croatia 2004 136114952 1351653125

Dot notation

To perform this last operation, do a subset of dataFull by keeping the columns 2 to 5 (company, country, industrial.sector and RD.usd).


datasubset <- dataFull[2:5]
company country industrial.sector RD.usd
VOLKSWAGEN Germany Automobiles Parts 14712470286
DAIMLER Germany Automobiles Parts 6335781792
ROBERT BOSCH Germany Automobiles Parts 5653984389
SANOFI France Pharmaceuticals Biotechnology 5396067608
BMW Germany Automobiles Parts 5120208790
SIEMENS Germany Electronic Electrical Equipment 4908268479


Sum

To calculate the mean of R&D per Country, Company and Industrial Sector:


sum_RD_CountryCompanyIndustry <- aggregate(RD.usd ~ ., data = datasubset, sum)

Detailed arguments of the function:
RD.usd: The value on which to average
.: To obtain the sum of R&D on every column
dataFull: Data used
sum: To calculate the sum

company country industrial.sector RD.usd
FACC Austria Aerospace Defence 12567800
SABCA Belgium Aerospace Defence 127281394
EMBRAER Brazil Aerospace Defence 2684133551
BOMBARDIER Canada Aerospace Defence 6920050267
CAE Canada Aerospace Defence 309496291
AERO VODOCHODY Czech Republic Aerospace Defence 17266478

Mean

To calculate the mean of R&D per Country, Company and Industrial Sector:


mean_RD_CountryCompanyIndustry<- aggregate(RD.usd ~ ., data = datasubset, mean)

Detailed arguments of the function:
RD.usd: The value on which to average
.: To obtain the average of R&D on every column
dataFull: Data used
mean: To calculate the average

company country industrial.sector RD.usd
FACC Austria Aerospace Defence 12567800
SABCA Belgium Aerospace Defence 15910174
EMBRAER Brazil Aerospace Defence 206471812
BOMBARDIER Canada Aerospace Defence 532311559
CAE Canada Aerospace Defence 103165430
AERO VODOCHODY Czech Republic Aerospace Defence 8633239

Group one by one

Sum

To calculate the sum of R&D per Country:


sum_RD_Country <- dataFull %>% 
  group_by(country) %>%  
  summarize(RDusd_sum = sum(RD.usd, na.rm=TRUE)) %>%
  ungroup()

Detailed arguments of the function:
dataFull: Data used
group_by: Convert into a grouped tbl
country: The column on which to group_by
summarize: To reduce multiple values down to a single value
sum: To calculate the sum
RD.usd: The value on which to sum
ungroup: Remove grouping

country RDusd_sum
Australia 31701141962
Austria 15967871461
Belgium 50598161005
Bermuda 25757237590
Brazil 27585939295
Bulgaria 13681470

Mean

To calculate the mean of R&D per Country:


mean_RD_Country <- dataFull %>% 
  group_by(country) %>%  
  summarize(RDusd_mean = mean(RD.usd, na.rm=TRUE)) %>%
  ungroup()

Detailed arguments of the function:
dataFull: Data used
group_by: Convert into a grouped tbl
country: The column on which to group_by
summarize: To reduce multiple values down to a single value
mean: To calculate the average
RD.usd: The value on which to average
ungroup: Remove grouping

country RDusd_mean
Australia 320213555
Austria 44110142
Belgium 109996002
Bermuda 429287293
Brazil 452228513
Bulgaria 6840735

Group one by many

Sum

To calculate the sum of R&D per Country and Year:


sum_RD_CountryYear <- dataFull %>% 
  group_by(country, year) %>%  
  summarize(RDusd_sum = sum(RD.usd, na.rm=TRUE)) %>%
  ungroup()

Detailed arguments of the function:
dataFull: Data used
group_by: Convert into a grouped tbl
country and year: Columns on which to group_by
summarize: To reduce multiple values down to a single value
sum: To calculate the sum
RD.usd: The value on which to sum
ungroup: Remove grouping

country year RDusd_sum
Australia 2004 1326329413
Australia 2005 409801916
Australia 2006 756237929
Australia 2007 1046598742
Australia 2008 1960352411
Australia 2009 1936156478

Mean

To calculate the mean of R&D per Country and Year:


mean_RD_CountryYear <- dataFull %>% 
  group_by(country, year) %>%  
  summarize(RDusd_mean = mean(RD.usd, na.rm=TRUE)) %>%
  ungroup()

Detailed arguments of the function:
dataFull: Data used
group_by: Convert into a grouped tbl
country and year: Columns on which to group_by
summarize: To reduce multiple values down to a single value
mean: To calculate the average
RD.usd: The value on which to average
ungroup: Remove grouping

country year RDusd_mean
Australia 2004 331582353
Australia 2005 409801916
Australia 2006 189059482
Australia 2007 149514106
Australia 2008 245044051
Australia 2009 242019560

Group many by one

Sum

To calculate the sum of R&D and Sales per Country:


sum_RDSales_Country <- dataFull %>% 
  group_by(country) %>%  
  summarize(RDusd_sum = sum(RD.usd, na.rm=TRUE),
            salesusd_sum = sum(sales.usd, na.rm=TRUE)) %>%
  ungroup()

Detailed arguments of the function:
dataFull: Data used
group_by: Convert into a grouped tbl
country: The column on which to group_by
summarize: To reduce multiple values down to a single value
sum: To calculate the sum
RD.usd and sales.usd: Values on which to sum
ungroup: Remove grouping

country RDusd_sum salesusd_sum
Australia 31701141962 8.908038e+11
Austria 15967871461 6.357453e+11
Belgium 50598161005 1.013106e+12
Bermuda 25757237590 6.820938e+11
Brazil 27585939295 1.906141e+12
Bulgaria 13681470 0.000000e+00

Mean

To calculate the mean of R&D and Sales per Country:


mean_RD_Country <- dataFull %>% 
  group_by(country) %>%  
  summarize(RDusd_mean = mean(RD.usd, na.rm=TRUE),
            salesusd_mean = mean(sales.usd, na.rm=TRUE)) %>%
  ungroup()

Detailed arguments of the function:
dataFull: Data used
group_by: Convert into a grouped tbl
country: The column on which to group_by
summarize: To reduce multiple values down to a single value
mean: To calculate the average
RD.usd and sales.usd: Values on which to average
ungroup: Remove grouping

country RDusd_mean salesusd_mean
Australia 320213555 9682650226
Austria 44110142 5005868443
Belgium 109996002 6253742481
Bermuda 429287293 11368230610
Brazil 452228513 32864497385
Bulgaria 6840735 NaN

Summary statistics

To calculate summary statistics per country:


summary_RD_Country <- dataFull %>% 
  group_by(country) %>%  
  summarize(RD.usd_mean = mean(RD.usd, na.rm = TRUE), 
            RD.usd_median = median(RD.usd, na.rm = TRUE),
            RD.usd_min = min(RD.usd, na.rm = TRUE),
            RD.usd_max = max(RD.usd, na.rm = TRUE),
            RD.usd_q0.25 = quantile(RD.usd, 0.25, na.rm = TRUE)) %>%
  ungroup()
country RD.usd_mean RD.usd_median RD.usd_min RD.usd_max RD.usd_q0.25
Australia 320213555 153538583 40988873 1404025144 100332737
Austria 44110142 26177086 4383595 177854217 11918427
Belgium 109996002 28034132 4082317 1302354760 13447074
Bermuda 429287293 286303239 43881142 1144833914 108475235
Brazil 452228513 216834284 41363693 1632838868 90044131
Bulgaria 6840735 6840735 6715507 6965964 6778121

Group many by many

Sum

To calculate the sum of R&D and Sales per Country and Year:


sum_RDSales_CountryYear <- dataFull %>% 
  group_by(country, year) %>%  
  summarize(RDusd_sum = sum(RD.usd, na.rm=TRUE),
            salesusd_sum = sum(sales.usd, na.rm=TRUE)) %>%
  ungroup()

Detailed arguments of the function:
dataFull: Data used
group_by: Convert into a grouped tbl
country and year: Columns on which to group_by
summarize: To reduce multiple values down to a single value
sum: To calculate the sum
RD.usd and sales.usd: Values on which to sum
ungroup: Remove grouping

country year RDusd_sum salesusd_sum
Australia 2004 1326329413 59001083381
Australia 2005 409801916 18907890659
Australia 2006 756237929 25341999677
Australia 2007 1046598742 66264859636
Australia 2008 1960352411 54140271458
Australia 2009 1936156478 48945981160

Mean

To calculate the mean of R&D and Sales per Country and Year:


mean_RDSales_CountryYear <- dataFull %>% 
  group_by(country, year) %>%  
  summarize(RDusd_mean = mean(RD.usd, na.rm=TRUE),
            salesusd_mean = mean(sales.usd, na.rm=TRUE)) %>%
  ungroup()

Detailed arguments of the function:
dataFull: Data used
group_by: Convert into a grouped tbl
country and year: Columns on which to group_by
summarize: To reduce multiple values down to a single value
mean: To calculate the average
RD.usd and sales.usd: Values on which to average
ungroup: Remove grouping

country year RDusd_mean salesusd_mean
Australia 2004 331582353 14750270845
Australia 2005 409801916 18907890659
Australia 2006 189059482 6335499919
Australia 2007 149514106 9466408519
Australia 2008 245044051 6767533932
Australia 2009 242019560 6118247645


Citation

For attribution, please cite this work as

Warin (2019, May 28). Thierry Warin: [R Course] Data Wrangling with R: Computing Summary Statistics. Retrieved from https://warin.ca/posts/datawranglingwithr-computing/

BibTeX citation

@misc{warin2019[r,
  author = {Warin, Thierry},
  title = {Thierry Warin: [R Course] Data Wrangling with R: Computing Summary Statistics},
  url = {https://warin.ca/posts/datawranglingwithr-computing/},
  year = {2019}
}