class: center, middle, inverse, title-slide # Northern Corridor: A Data Science Perspective ### Thierry Warin, HEC Montréal & CIRANO, Sept. 15, 2020 --- --- # Things to remember - one article on the current relevance of going through the Northern corridor [international trade] - one article on on a systematic review of the literature on global transportation by ships [international trade] - one article on a gravity model using our data from the corridoR package [international trade] - one article on the corridoR package [computer science] --- # Things to remember - corridoR: An R package to compute distances - an algorithmic systematic review of the literature --- # 1 - The corridoR package - Why? - to simulate the opening of the Northern corridor and find out which course would benefit from the opening of the corridor. - Can we do it without the package? - yes, we dit it once, and we do not want to do it again ;-)! --- # 1 - The corridoR package - We compiled a dataset of more than 26,000 courses going through the Panama Canal (Source of the raw data: Marine Traffic) - Assumption: either ships go through the panama canal or the Northern corridor - Issue #1: to compute distances, we need ports' latitudes and longitudes - We added latitudes and longitudes to ports for these 26,0000 courses - Issue #2: we need to model the earth with land, rivers and lakes - We created a shapefile by merging two shapefiles, one for the countries and one for the rivers and lakes centerlines. --- # 1 - The corridoR package - Issue #3: on the final shapefile, lots of courses were closed - We opened manually these courses, corrected the courses by hand - Issue #4: how to calculate the distances? - We computed distances for courses going through the Panama Canal and distances going through the Northern corridor. --- # 1 - The corridoR package - After all these steps, we decided to create a package that does all these computations for the researcher (<https://github.com/warint/corridoR>) - We created two functions to easily integrate distances in a researcher's workflow (gravity models, etc.): - corridor_country(): find the country ISO code - corridor_data(country = "", port = ""): find a maritime course and distances for a country or a port --- # 1 - The corridoR package - In terms of publications: - Package: To be submitted in the fall to CRAN - Article: To be submitted in the fall to SoftwareX or Data in Brief. --- # 2 - An Algorithmic Systematic Literature Review - 1,452 documents - average years from publication: 8.98 - average citations per documents: 12.24 - average citations per year per documents: 1.345 - number of references: 36,750 --- # 2 - An Algorithmic Systematic Literature Review - articles: 1,029 - data paper: 1 - book review: 86 - proceeding papers: 275 --- # 2 - An Algorithmic Systematic Literature Review - authors: 4,102 - author appearances: 5,483 - authors of single-authored documents: 292 - authors of multi-authored documents: 3,810 --- # 2 - An Algorithmic Systematic Literature Review - single-authored documents: 339 - documents per author: 0.354 - authors per document: 2.83 - co-authors per documents: 3.78 --- # 2 - An Algorithmic Systematic Literature Review <img src="images/figure1.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- # 2 - Most frequent cited first authors <img src="images/figure2.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- # 2 - Most frequent cited manuscripts <img src="images/figure3.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- # 2 - Authors’ Dominance ranking The Dominance Factor is a ratio indicating the fraction of multi-authored articles in which a scholar appears as the first author <img src="images/figure4.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- # 2 - Lotka’s Law coefficient estimation - Lotka’s law describes the frequency of publication by authors in any given field as an inverse square law, where the number of authors publishing a certain number of articles is a fixed ratio to the number of authors publishing a single article. This assumption implies that the theoretical beta coefficient of Lotka’s law is equal to 2 (Lotka A.J., 1926). - Using lotka function is possible to estimate the Beta coefficient of our bibliographic collection and assess, through a statistical test, the similarity of this empirical distribution with the theoretical one. --- # 2 - Lotka’s Law coefficient estimation <img src="images/figure5.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- # 2 - Author Collaboration Network <img src="images/figure6.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- # The summary statistics of the network This group of statistics allows to describe the structural properties of a network: - `Size`: is the number of vertices composing the network; - `Density`: is the proportion of present edges from all possible edges in the network; - `Transitivity`: is the ratio of triangles to connected triples; - `Diameter`: is the longest geodesic distance (length of the shortest path between two nodes) in the network; - `Degree distribution`: is the cumulative distribution of vertex degrees; - `Degree centralization`: is the normalized degree of the overall network; --- # The summary statistics of the network - `Closeness centralization`: is the normalized inverse of the vertex average geodesic distance to others in the network; - `Eigenvector centralization`: is the first eigenvector of the graph matrix; - `Betweenness centralization`: is the normalized number of geodesics that pass through the vertex; - `Average path length`: is the mean of the shortest distance between each pair of vertices in the network. --- # 2 - Author Collaboration Network <img src="images/figure7.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- # 2 - Universities Collaboration Network <img src="images/figure8.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- # 2 - Co-Citation Network The first is the co-citation network. When a reference was addressed by two articles published in the same journal, this reference was included in the co-citation network of references. Therefore, the co-citation network addressed the common references to the concept of uncertainty in articles published by a journal. <img src="images/figure9.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- # 2 - Journal (Source) co-citation analysis <img src="images/figure10.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- # 2 - Co-occurrences network (Keywords) <img src="images/figure11.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- # 2 - Author's Keywords Usage Evolution Over Time <img src="images/figure13.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- # Things to remember - one article on the current relevance of going through the Northern corridor [international trade] - one article on on a systematic review of the literature on global transportation by ships [international trade] - one article on a gravity model using our data from the corridoR package [international trade] - one article on the corridoR package [computer science] --- # Thank you! - Thanks all - Thanks to Martin Paquette (Research Professional at CIRANO), Marine Leroi (Research Professional at CIRANO) and Thibault Sénégas (nüance-R)