Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.
In the course, we will introduced you a new way to refer a function to its package. In fact, we will use the code line Library(package)
to mention the needed package and use any function of that package anywhere under that line without having to refer to it everytime. It will be easier as we will use the piping syntax in this session.
This package will be used for this course:
In this course we will use the “piping” syntax to compute summary statistics.
A pipe is represented by this sign: %>%.
The point of the pipe is to help you write code in a way that is easier to read and understand. In fact, “piping” makes your code more readable. Pipes are a powerful tool for clearly expressing a sequence of multiple operations.
To see why the pipe is so useful, let’s take a look at the most basic example that use the operator before you go into the more advanced usages of the operator in the following sections.
log(x)
can be rewritten as x %>% log()
In short, this means that functions that take one argument, function(argument)
, can be rewritten as follows: argument %>% function()
.
First, load the data on which summary statistics will be done.
library(gsheet)
dataFull <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1uLaXke-KPN28-ESPPoihk8TiXVWp5xuNGHW7w7yqLCc/edit#gid=329712902")
rank | company | country | industrial.sector | RD.usd | sales.usd | year |
---|---|---|---|---|---|---|
1 | VOLKSWAGEN | Germany | Automobiles Parts | 14712470286 | NA | 2015 |
2 | DAIMLER | Germany | Automobiles Parts | 6335781792 | 145635513797 | 2015 |
3 | ROBERT BOSCH | Germany | Automobiles Parts | 5653984389 | 54892540624 | 2015 |
4 | SANOFI | France | Pharmaceuticals Biotechnology | 5396067608 | 37868911705 | 2015 |
5 | BMW | Germany | Automobiles Parts | 5120208790 | 90159856973 | 2015 |
6 | SIEMENS | Germany | Electronic Electrical Equipment | 4908268479 | 80649456021 | 2015 |
Here some data on R&D and Sales in USD of companies per country and industrial sector from 2004 to 2015.
To compute summary statistics on the data, the aggregate
function is used. The function allow you to split the data into subsets, compute summary statistics for each such as the mean and the sum, and returns the result in a convenient form. We will also use another way to compute summary statistics by piping the following functions: group_by()
, summarize()
, mean()
, sum()
and ungroup()
.
To calculate the sum of R&D per Country:
sum_RD_Country <- aggregate(RD.usd ~ country, data = dataFull, sum)
Detailed arguments of the function:
RD.usd
: The value on which to average
country
: To obtain the average RD per country
dataFull
: Data used
sum
: To calculate the sum
country | RD.usd |
---|---|
Australia | 31701141962 |
Austria | 15967871461 |
Belgium | 50598161005 |
Bermuda | 25757237590 |
Brazil | 27585939295 |
Bulgaria | 13681470 |
To calculate the mean of R&D per Country:
mean_RD_Country <- aggregate(RD.usd ~ country, data = dataFull, mean)
Detailed arguments of the function:
RD.usd
: The value on which to average
country
: To obtain the average RD per country
dataFull
: Data used
mean
: To calculate the average
country | RD.usd |
---|---|
Australia | 320213555 |
Austria | 44110142 |
Belgium | 109996002 |
Bermuda | 429287293 |
Brazil | 452228513 |
Bulgaria | 6840735 |
To calculate the sum of R&D per Country and Year:
sum_RD_CountryYear <- aggregate(RD.usd ~ country + year, data = dataFull, sum)
Detailed arguments of the function:
RD.usd
: The value on which to average
country + year
: To obtain the sum of R&D per country and per year
dataFull
: Data used
sum
: To calculate the sum
country | year | RD.usd |
---|---|---|
Australia | 2004 | 1326329413 |
Austria | 2004 | 815106978 |
Belgium | 2004 | 3918844424 |
Bermuda | 2004 | 2855248522 |
Brazil | 2004 | 936977342 |
Canada | 2004 | 7392941158 |
To calculate the mean of R&D per Country and Year:
mean_RD_CountryYear <- aggregate(RD.usd ~ country + year, data = dataFull, mean)
Detailed arguments of the function:
RD.usd
: The value on which to average
country + year
: To obtain the sum of R&D per country per year
dataFull
: Data used
mean
: To calculate the average
country | year | RD.usd |
---|---|---|
Australia | 2004 | 331582353 |
Austria | 2004 | 42900367 |
Belgium | 2004 | 122463888 |
Bermuda | 2004 | 475874754 |
Brazil | 2004 | 234244336 |
Canada | 2004 | 568687781 |
To calculate the sum of R&D and Sales per Country, using the cbind() function:
Detailed arguments of the function:
cbind()
: To put the columns RD.usd and sales.usd next to each other, otherwise the aggregate function will sum the two columns.
RD.usd, sales.usd
: Values on which to average.
country
: To obtain the average of RD per country.
dataFull
: Data used
sum
: To calculate the sum
country | RD.usd | sales.usd |
---|---|---|
Australia | 28738126700 | 8.908038e+11 |
Austria | 6877398846 | 6.357453e+11 |
Belgium | 21422944979 | 1.013106e+12 |
Bermuda | 25757237590 | 6.820938e+11 |
Brazil | 25515764646 | 1.906141e+12 |
Canada | 51566211570 | 1.101636e+12 |
To calculate the sum of R&D and Sales per Country, without the cbind() function:
sum2_RDSales_Country <- aggregate(RD.usd + sales.usd ~ country, data = dataFull, sum)
Detailed arguments of the function:
RD.usd + sales.usd
: Values on which to average.
country
: To obtain the average of RD per country.
dataFull
: Data used
sum
: To calculate the sum
country | RD.usd + sales.usd |
---|---|
Australia | 9.195419e+11 |
Austria | 6.426227e+11 |
Belgium | 1.034529e+12 |
Bermuda | 7.078511e+11 |
Brazil | 1.931657e+12 |
Canada | 1.153202e+12 |
You can see the difference with the operation above. The columns RD.usd and sales.usd are put in the same column and sum together.
To calculate the mean of R&D and Sales per Country, using the cbind() function:
Detailed arguments of the function:
cbind()
: To put the columns RD.usd and sales.usd next to each other, otherwise the aggregate function will do the mean of the two columns
RD.usd, sales.usd
: Values on which to average
country
: To obtain the average of RD per country
dataFull
: Data used
mean
: To calculate the average
country | RD.usd | sales.usd |
---|---|---|
Australia | 312370942 | 9682650226 |
Austria | 54152747 | 5005868443 |
Belgium | 132240401 | 6253742481 |
Bermuda | 429287293 | 11368230610 |
Brazil | 439926977 | 32864497385 |
Canada | 280251150 | 5987153638 |
To calculate the mean of R&D and Sales per Country, without the cbind() function:
mean2_RDSales_Country <- aggregate(RD.usd + sales.usd ~ country, data = dataFull, mean)
Detailed arguments of the function:
RD.usd + sales.usd
: Values on which to average
country
: To obtain the average of RD per country
dataFull
: Data used
mean
: To calculate the average
country | RD.usd + sales.usd |
---|---|
Australia | 9995021168 |
Austria | 5060021189 |
Belgium | 6385982882 |
Bermuda | 11797517903 |
Brazil | 33304424361 |
Canada | 6267404788 |
You can see the difference with the operation above. The columns RD.usd and sales.usd are put in the same column and the average is done on both of them together.
To calculate the sum of R&D and Sales per Country and Year:
Detailed arguments of the function:
cbind(RD.usd, sales.usd)
: Values on which to average
country + year
: To obtain the average RD per country and per year
dataFull
: Data used
sum
: To calculate the sum
country | year | RD.usd | sales.usd |
---|---|---|---|
Australia | 2004 | 1326329413 | 59001083381 |
Bermuda | 2004 | 2855248522 | 109528219651 |
Brazil | 2004 | 936977342 | 82666092199 |
Canada | 2004 | 7392941158 | 117503606183 |
China | 2004 | 1187048998 | 191124385020 |
Croatia | 2004 | 272229903 | 2703306251 |
To calculate the mean of R&D and Sales per Country and Year:
Detailed arguments of the function:
cbind(RD.usd, sales.usd)
: Values on which to average
country + year
: To obtain the average RD per country and per year
dataFull
: Data used
mean
: To calculate the average
country | year | RD.usd | sales.usd |
---|---|---|---|
Australia | 2004 | 331582353 | 14750270845 |
Bermuda | 2004 | 475874754 | 18254703275 |
Brazil | 2004 | 234244336 | 20666523050 |
Canada | 2004 | 568687781 | 9038738937 |
China | 2004 | 296762249 | 47781096255 |
Croatia | 2004 | 136114952 | 1351653125 |
To perform this last operation, do a subset of dataFull by keeping the columns 2 to 5 (company, country, industrial.sector and RD.usd).
datasubset <- dataFull[2:5]
company | country | industrial.sector | RD.usd |
---|---|---|---|
VOLKSWAGEN | Germany | Automobiles Parts | 14712470286 |
DAIMLER | Germany | Automobiles Parts | 6335781792 |
ROBERT BOSCH | Germany | Automobiles Parts | 5653984389 |
SANOFI | France | Pharmaceuticals Biotechnology | 5396067608 |
BMW | Germany | Automobiles Parts | 5120208790 |
SIEMENS | Germany | Electronic Electrical Equipment | 4908268479 |
To calculate the mean of R&D per Country, Company and Industrial Sector:
sum_RD_CountryCompanyIndustry <- aggregate(RD.usd ~ ., data = datasubset, sum)
Detailed arguments of the function:
RD.usd
: The value on which to average
.
: To obtain the sum of R&D on every column
dataFull
: Data used
sum
: To calculate the sum
company | country | industrial.sector | RD.usd |
---|---|---|---|
FACC | Austria | Aerospace Defence | 12567800 |
SABCA | Belgium | Aerospace Defence | 127281394 |
EMBRAER | Brazil | Aerospace Defence | 2684133551 |
BOMBARDIER | Canada | Aerospace Defence | 6920050267 |
CAE | Canada | Aerospace Defence | 309496291 |
AERO VODOCHODY | Czech Republic | Aerospace Defence | 17266478 |
To calculate the mean of R&D per Country, Company and Industrial Sector:
mean_RD_CountryCompanyIndustry<- aggregate(RD.usd ~ ., data = datasubset, mean)
Detailed arguments of the function:
RD.usd
: The value on which to average
.
: To obtain the average of R&D on every column
dataFull
: Data used
mean
: To calculate the average
company | country | industrial.sector | RD.usd |
---|---|---|---|
FACC | Austria | Aerospace Defence | 12567800 |
SABCA | Belgium | Aerospace Defence | 15910174 |
EMBRAER | Brazil | Aerospace Defence | 206471812 |
BOMBARDIER | Canada | Aerospace Defence | 532311559 |
CAE | Canada | Aerospace Defence | 103165430 |
AERO VODOCHODY | Czech Republic | Aerospace Defence | 8633239 |
To calculate the sum of R&D per Country:
Detailed arguments of the function:
dataFull
: Data used
group_by
: Convert into a grouped tbl
country
: The column on which to group_by
summarize
: To reduce multiple values down to a single value
sum
: To calculate the sum
RD.usd
: The value on which to sum
ungroup
: Remove grouping
country | RDusd_sum |
---|---|
Australia | 31701141962 |
Austria | 15967871461 |
Belgium | 50598161005 |
Bermuda | 25757237590 |
Brazil | 27585939295 |
Bulgaria | 13681470 |
To calculate the mean of R&D per Country:
Detailed arguments of the function:
dataFull
: Data used
group_by
: Convert into a grouped tbl
country
: The column on which to group_by
summarize
: To reduce multiple values down to a single value
mean
: To calculate the average
RD.usd
: The value on which to average
ungroup
: Remove grouping
country | RDusd_mean |
---|---|
Australia | 320213555 |
Austria | 44110142 |
Belgium | 109996002 |
Bermuda | 429287293 |
Brazil | 452228513 |
Bulgaria | 6840735 |
To calculate the sum of R&D per Country and Year:
Detailed arguments of the function:
dataFull
: Data used
group_by
: Convert into a grouped tbl
country
and year
: Columns on which to group_by
summarize
: To reduce multiple values down to a single value
sum
: To calculate the sum
RD.usd
: The value on which to sum
ungroup
: Remove grouping
country | year | RDusd_sum |
---|---|---|
Australia | 2004 | 1326329413 |
Australia | 2005 | 409801916 |
Australia | 2006 | 756237929 |
Australia | 2007 | 1046598742 |
Australia | 2008 | 1960352411 |
Australia | 2009 | 1936156478 |
To calculate the mean of R&D per Country and Year:
Detailed arguments of the function:
dataFull
: Data used
group_by
: Convert into a grouped tbl
country
and year
: Columns on which to group_by
summarize
: To reduce multiple values down to a single value
mean
: To calculate the average
RD.usd
: The value on which to average
ungroup
: Remove grouping
country | year | RDusd_mean |
---|---|---|
Australia | 2004 | 331582353 |
Australia | 2005 | 409801916 |
Australia | 2006 | 189059482 |
Australia | 2007 | 149514106 |
Australia | 2008 | 245044051 |
Australia | 2009 | 242019560 |
To calculate the sum of R&D and Sales per Country:
Detailed arguments of the function:
dataFull
: Data used
group_by
: Convert into a grouped tbl
country
: The column on which to group_by
summarize
: To reduce multiple values down to a single value
sum
: To calculate the sum
RD.usd
and sales.usd
: Values on which to sum
ungroup
: Remove grouping
country | RDusd_sum | salesusd_sum |
---|---|---|
Australia | 31701141962 | 8.908038e+11 |
Austria | 15967871461 | 6.357453e+11 |
Belgium | 50598161005 | 1.013106e+12 |
Bermuda | 25757237590 | 6.820938e+11 |
Brazil | 27585939295 | 1.906141e+12 |
Bulgaria | 13681470 | 0.000000e+00 |
To calculate the mean of R&D and Sales per Country:
Detailed arguments of the function:
dataFull
: Data used
group_by
: Convert into a grouped tbl
country
: The column on which to group_by
summarize
: To reduce multiple values down to a single value
mean
: To calculate the average
RD.usd
and sales.usd
: Values on which to average
ungroup
: Remove grouping
country | RDusd_mean | salesusd_mean |
---|---|---|
Australia | 320213555 | 9682650226 |
Austria | 44110142 | 5005868443 |
Belgium | 109996002 | 6253742481 |
Bermuda | 429287293 | 11368230610 |
Brazil | 452228513 | 32864497385 |
Bulgaria | 6840735 | NaN |
To calculate summary statistics per country:
summary_RD_Country <- dataFull %>%
group_by(country) %>%
summarize(RD.usd_mean = mean(RD.usd, na.rm = TRUE),
RD.usd_median = median(RD.usd, na.rm = TRUE),
RD.usd_min = min(RD.usd, na.rm = TRUE),
RD.usd_max = max(RD.usd, na.rm = TRUE),
RD.usd_q0.25 = quantile(RD.usd, 0.25, na.rm = TRUE)) %>%
ungroup()
country | RD.usd_mean | RD.usd_median | RD.usd_min | RD.usd_max | RD.usd_q0.25 |
---|---|---|---|---|---|
Australia | 320213555 | 153538583 | 40988873 | 1404025144 | 100332737 |
Austria | 44110142 | 26177086 | 4383595 | 177854217 | 11918427 |
Belgium | 109996002 | 28034132 | 4082317 | 1302354760 | 13447074 |
Bermuda | 429287293 | 286303239 | 43881142 | 1144833914 | 108475235 |
Brazil | 452228513 | 216834284 | 41363693 | 1632838868 | 90044131 |
Bulgaria | 6840735 | 6840735 | 6715507 | 6965964 | 6778121 |
To calculate the sum of R&D and Sales per Country and Year:
Detailed arguments of the function:
dataFull
: Data used
group_by
: Convert into a grouped tbl
country
and year
: Columns on which to group_by
summarize
: To reduce multiple values down to a single value
sum
: To calculate the sum
RD.usd
and sales.usd
: Values on which to sum
ungroup
: Remove grouping
country | year | RDusd_sum | salesusd_sum |
---|---|---|---|
Australia | 2004 | 1326329413 | 59001083381 |
Australia | 2005 | 409801916 | 18907890659 |
Australia | 2006 | 756237929 | 25341999677 |
Australia | 2007 | 1046598742 | 66264859636 |
Australia | 2008 | 1960352411 | 54140271458 |
Australia | 2009 | 1936156478 | 48945981160 |
To calculate the mean of R&D and Sales per Country and Year:
Detailed arguments of the function:
dataFull
: Data used
group_by
: Convert into a grouped tbl
country
and year
: Columns on which to group_by
summarize
: To reduce multiple values down to a single value
mean
: To calculate the average
RD.usd
and sales.usd
: Values on which to average
ungroup
: Remove grouping
country | year | RDusd_mean | salesusd_mean |
---|---|---|---|
Australia | 2004 | 331582353 | 14750270845 |
Australia | 2005 | 409801916 | 18907890659 |
Australia | 2006 | 189059482 | 6335499919 |
Australia | 2007 | 149514106 | 9466408519 |
Australia | 2008 | 245044051 | 6767533932 |
Australia | 2009 | 242019560 | 6118247645 |
For attribution, please cite this work as
Warin (2019, May 28). Thierry Warin, PhD: [R Course] Data Wrangling with R: Computing Summary Statistics. Retrieved from https://warin.ca/posts/rcourse-datawranglingwithr-computing/
BibTeX citation
@misc{warin2019[r, author = {Warin, Thierry}, title = {Thierry Warin, PhD: [R Course] Data Wrangling with R: Computing Summary Statistics}, url = {https://warin.ca/posts/rcourse-datawranglingwithr-computing/}, year = {2019} }