bibliometrix: An R-Tool for Bibliometric Analysis

Bibliometrix

Bibliometrix is developed by Massimo Aria and Corrado Cuccurullo and is an open-source tool for quantitative research in scientometrics and bibliometrics that includes all the main bibliometric methods of analysis.

Bibliometrix package provides various routines for importing bibliographic data from SCOPUS, Clarivate Analytics' Web of Science, PubMed and Cochrane databases, performing bibliometric analysis and building data matrices for co-citation, coupling, scientific collaboration analysis and co-word analysis.

A shiny app is also available through this R package. See section Bibiloshiny.

This procedure is based on the Bibliometrix's vignette and the article written by the package authors (Aria and Cuccurullo, 2017).

Import & Convert

Import a bibfile and convert it into a dataframe.

library("bibliometrix")
library(dplyr)

D <- readFiles("savedrecs.bib")

M <- convert2df(D, dbsource = "isi", format = "bibtex")

D is a large character vector. readFiles argument contains the name of files downloaded from SCOPUS, Clarivate Analytics WOS, or Cochrane CDSR website.

readFiles combines all the text files onto a single large character vector. Furthermore, the format is converted into UTF-8.

convert2df creates a bibliographic data frame with cases corresponding to manuscripts and variables to Field Tag in the original export file.

Each manuscript contains several elements, such as authors' names, title, keywords and other information. All these elements constitute the bibliographic attributes of a document, also called metadata. Data frame columns are named using the standard Clarivate Analytics WoS Field Tag codify.

Field Tag Description

Field Tag Description
AU Authors
TI Document Title
SO Publication Name (or Source)
JI ISO Source Abbreviation
DT Document Type
DE Authors' Keywords
ID Keywords associated by SCOPUS or ISI database
AB Abstract
C1 Author Address
RP Reprint Address
TC Times Cited
PY Year
SC Subject Category
UT Unique Article Identifier
DB Bibliographic Database
DI Digital Object Identifier (DOI)

Bibliometric Analysis

results <- biblioAnalysis(M, sep = ";")
S <- summary(object = results, k = 10, pause = FALSE)
S

biblioAnalysis calculates main bibliometric measures and returns an object of class "bibliometrix".

summary displays main information about the bibliographic data frame and several tables, such as annual scientific production, top manuscripts per number of citations, most productive authors, most productive countries, total citation per country, most relevant sources (journals) and most relevant keywords. accepts two additional arguments.

k is a formatting value that indicates the number of rows of each table. Choosing k=10 you decide to see the first 10 Authors, the first 10 sources, etc.

pause is a logical value (TRUE or FALSE) used to allow (or not) pause in screen scrolling.

# plot the results
plot(x = results, k = 10, pause = FALSE)

plot() allows you to plot the results created by the function biblioAnalysis().


Analysis of Cited References and Author's influence

The function citations generates the frequency table of the most cited references or the most cited first authors (of references).

For each manuscript, cited references are in a single string stored in the column “CR” of the data frame.

For a correct extraction, you need to identify the separator field among different references, used by ISI or SCOPUS database. Usually, the default separator is ";" or ". " (a dot with double space).

# To see what the sep is
M$CR[1]

# To obtain the most frequent cited manuscripts:
CR <- citations(M, field = "article", sep = ";")
cbind(CR$Cited[1:10])

# To obtain the most frequent cited first authors:
CR <- citations(M, field = "author", sep = ";")
cbind(CR$Cited[1:10])

citations: generates the frequency table of the most cited references or the most cited first authors

localCitations: generates the frequency table of the most local cited authors. Local citations measure how many times an author (or a document) included in this collection have been cited by other authors also in the collection.

Authors’ Dominance ranking

The function dominance calculates the authors’ dominance ranking as proposed by Kumar and Jan (2013).

The Dominance Factor is a ratio indicating the fraction of multi-authored articles in which a scholar appears as the first author

DF <- dominance(results, k = 10)
DF

dominance calculates the authors’ dominance ranking as proposed by Kumar & Kumar (2008).

results is an object of class bibliometrix obtained by biblioAnalysis.

k is the number of authors to consider in the analysis.

Authors’ h-index

The h-index is an author-level metric that attempts to measure both the productivity and citation impact of the publications of a scientist or scholar.

The index is based on the set of the scientist’s most cited papers and the number of citations that they have received in other publications

indices <- Hindex(M, field = "author", elements="LASSERRE F", sep = ";", years = 10)

# Lasserre's impact indices:
indices$H

# Lasserre's citations
indices$CitationList

function: Hindex calculates the authors’ H-index or the sources’ H-index and its variants (g-index and m-index) in a bibliographic collection.

arguments:

  • M is a bibliographic data frame
  • field is a character element that defines the unit of analysis in terms of authors (field = "auhtor") or sources (field = "source");
  • elements is a character vector containing the authors’ names (or the sources’ names) for which you want to calculate the H-index. The argument has the form c("SURNAME1 N","SURNAME2 N",…). For each author: surname and initials are separated by one blank space. i.e for the authors ARIA MASSIMO and CUCCURULLO CORRADO, elements argument is elements = c(“ARIA M”, “CUCCURULLO C”).
# To calculate the h-index of the first 10 most productive authors (in this collection):
authors = gsub(","," ", names(results$Authors)[1:10])

indices <- Hindex(M, field = "author", elements=authors, sep = ";", years = 50)

indices$H

In this case the argument elements in the function Hidex() is a vector containing the 10 most productive authors. This is what it's done with the function gsub(). If the 15 most productive authors were wanted, then [1:10] would have to be changed to [1:15].

Top-Authors’ Productivity over the Time

topAU <- authorProdOverTime(M, k = 10, graph = TRUE)

function: AuthorProdOverTime calculates and plots the authors’ production (in terms of number of publications, and total citations per year) over the time.

arguments:

  • M is a bibliographic data frame.
  • k is the number of k Top Authors.
  • graph is a logical. If graph=TRUE, the function plots the author production over time graph.

Lotka’s Law coefficient estimation

Lotka’s law describes the frequency of publication by authors in any given field as an inverse square law, where the number of authors publishing a certain number of articles is a fixed ratio to the number of authors publishing a single article. This assumption implies that the theoretical beta coefficient of Lotka’s law is equal to 2 (Lotka A.J., 1926).

Using lotka function is possible to estimate the Beta coefficient of our bibliographic collection and assess, through a statistical test, the similarity of this empirical distribution with the theoretical one.

L <- lotka(results)

# Author Productivity. Empirical Distribution
L$AuthorProd

# Beta coefficient estimate
L$Beta

# Constant
L$C

# Goodness of fit
L$R2

# P-value of K-S two sample test
L$p.value

lotka estimates Lotka’s law coefficients for scientific productivity.

The table L$AuthorProd shows the observed distribution of scientific productivity in our example.

The estimated Beta coefficient is 2.77 with a goodness of fit equal to 0.94. Kolmogorov-Smirnoff two sample test provides a p-value 0.02 that means there is or not? a significant difference between the observed and the theoretical Lotka distributions.

Compare the two distributions using plot function:

# Observed distribution
Observed=L$AuthorProd[,3]

# Theoretical distribution with Beta = 2
Theoretical=10^(log10(L$C)-2*log10(L$AuthorProd[,1]))

plot(L$AuthorProd[,1],Theoretical,type="l",col="red",ylim=c(0, 1), xlab="Articles",ylab="Freq. of Authors",main="Scientific Productivity")

lines(L$AuthorProd[,1],Observed,col="blue")

legend(x="topright",c("Theoretical (B=2)","Observed"),col=c("red","blue"),lty = c(1,1,1),cex=0.6,bty="n")

Bibliographic Network Matrices

Manuscript's attributes are connected to each other through the manuscript itself: author(s) to journal, keywords to publication date, etc.

These connections of different attributes generate bipartite networks that can be represented as rectangular matrices (Manuscripts x Attributes).

Furthermore, scientific publications regularly contain references to other scientific works. This generates a further network, namely, co-citation or coupling network.

These networks are analyzed in order to capture meaningful properties of the underlying research system, and in particular to determine the influence of bibliometric units such as scholars and journals.

Bipartite networks

cocMatrix is a general function to compute a bipartite network selecting one of the metadata attributes.

For example, to create a network Manuscript x Publication Source you have to use the field tag "SO":

A <- cocMatrix(M, Field = "SO", sep = ";")
sort(Matrix::colSums(A), decreasing = TRUE)[1:5]

A is a rectangular binary matrix, representing a bipartite network where rows and columns are manuscripts and sources respectively.

The generic element aij is 1 if the manuscript i has been published in source j, 0 otherwise.

The j−th column sum aj is the number of manuscripts published in source j.

Sorting, in decreasing order, the column sums of A, you can see the most relevant publication sources:

# Other networks possible

# Citation network
A <- cocMatrix(M, Field = "CR", sep = ".  ")
sort(Matrix::colSums(A), decreasing = TRUE)[1:5]

# Author network
A <- cocMatrix(M, Field = "AU", sep = ";")
sort(Matrix::colSums(A), decreasing = TRUE)[1:5]

# Author keyword network
A <- cocMatrix(M, Field = "DE", sep = ";")
sort(Matrix::colSums(A), decreasing = TRUE)[1:5]

# Keyword Plus network
A <- cocMatrix(M, Field = "ID", sep = ";")
sort(Matrix::colSums(A), decreasing = TRUE)[1:5]

Authors’ Countries is not a standard attribute of the bibliographic data frame. You need to extract this information from affiliation attribute using the function metaTagExtraction.

metaTagExtraction allows to extract the following additional field tags:

  • Authors’ countries (Field = "AU_CO");
  • First Author’s countries (Field = "AU_CO");
  • First author of each cited reference (Field = "CR_AU");
  • Publication source of each cited reference (Field = "CR_SO");
  • Authors’ affiliations (Field = "AU_UN").
# Country network
MM <- metaTagExtraction(M, Field = "AU_CO", sep = ";")
A <- cocMatrix(MM, Field = "AU_CO", sep = ";")
sort(Matrix::colSums(A), decreasing = TRUE)[1:5]

Bibliographic coupling

Two articles are said to be bibliographically coupled if at least one cited source appears in the bibliographies or reference lists of both articles (Kessler, 1963).

The function biblioNetwork calculates, starting from a bibliographic data frame, the most frequently used coupling networks: Authors, Sources, and Countries.

biblioNetwork uses two arguments to define the network to compute:

  • analysis argument can be "co-citation", "coupling", "collaboration", or "co-occurrences".
  • network argument can be "authors", "references", "sources", "countries", "universities", "keywords", "author_keywords", "titles" and "abstracts".
# The following code calculates a classical article coupling network:
NetMatrix <- biblioNetwork(M, analysis = "coupling", network = "references", sep = ".  ")

Articles with only a few references, therefore, would tend to be more weakly bibliographically coupled, if coupling strength is measured simply according to the number of references that articles contain in common.

This suggests that it might be more practical to switch to a relative measure of bibliographic coupling.

normalizeSimilarity function calculates Association strength, Inclusion, Jaccard or Salton similarity among vertices of a network. normalizeSimilarity can be recalled directly from networkPlot() function using the argument normalize.

NetMatrix <- biblioNetwork(M, analysis = "coupling", network = "authors", sep = ";")

net=networkPlot(NetMatrix, normalize = "salton", weighted=NULL, n = 100, 
                Title = "Authors' Coupling", type = "fruchterman", 
                size=5, size.cex=T, remove.multiple=TRUE, labelsize=0.8, label.n=10, label.cex=F)

Bibliographic co-citation

We talk about co-citation of two articles when both are cited in a third article. Thus, co-citation can be seen as the counterpart of bibliographic coupling.

Using the function biblioNetwork, you can calculate a classical reference co-citation network:

NetMatrix <- biblioNetwork(M, analysis = "co-citation", network = "references", sep = ".  ")

Bibliographic collaboration

Scientific collaboration network is a network where nodes are authors and links are co-authorships as the latter is one of the most well-documented forms of scientific collaboration (Glanzel, 2004).

Using the function biblioNetwork, you can calculate an authors’ collaboration network:

# collaboration network
NetMatrix <- biblioNetwork(M, analysis = "collaboration", network = "authors", sep = ";")

# country collaboration network:
NetMatrix <- biblioNetwork(M, analysis = "collaboration", network = "countries", sep = ";")

Network Graph Characteristics

The function networkStat calculates several summary statistics.

In particular, starting from a bibliographic matrix (or an igraph object), two groups of descriptive measures are computed:

  • The summary statistics of the network;

  • The main indices of centrality and prestige of vertices.

NetMatrix <- biblioNetwork(M, analysis = "co-occurrences", network = "keywords", sep = ";")
netstat <- networkStat(NetMatrix)
netstat

The summary statistics of the network

This group of statistics allows to describe the structural properties of a network:

  • Size: is the number of vertices composing the network;

  • Density: is the proportion of present edges from all possible edges in the network;

  • Transitivity: is the ratio of triangles to connected triples;

  • Diameter: is the longest geodesic distance (length of the shortest path between two nodes) in the network;

  • Degree distribution: is the cumulative distribution of vertex degrees;

  • Degree centralization: is the normalized degree of the overall network;

  • Closeness centralization: is the normalized inverse of the vertex average geodesic distance to others in the network;

  • Eigenvector centralization: is the first eigenvector of the graph matrix;

  • Betweenness centralization: is the normalized number of geodesics that pass through the vertex;

  • Average path length: is the mean of the shortest distance between each pair of vertices in the network.

The main indices of centrality and prestige of vertices

These measures help to identify the most important vertices in a network and the propensity of two vertices that are connected to be both connected to a third vertex.

The statistics, at vertex level, returned by networkStat are:

  • Degree centrality

  • Closeness centrality: measures how many steps are required to access every other vertex from a given vertex;

  • Eigenvector centrality: is a measure of being well-connected connected to the well-connected;

  • Betweenness centrality: measures brokerage or gatekeeping potential. It is (approximately) the number of shortest paths between vertices that pass through a particular vertex;

  • PageRank score: approximates probability that any message will arrive to a particular vertex. This algorithm was developed by Google founders, and originally applied to website links;

  • Hub Score: estimates the value of the links outgoing from the vertex. It was initially applied to the web pages;

  • Authority Score: is another measure of centrality initially applied to the Web. A vertex has high authority when it is linked by many other vertices that are linking many other vertices;

  • Vertex Ranking: is an overall vertex ranking obtained as a linear weighted combination of the centrality and prestige vertex measures. The weights are proportional to the loadings of the first component of the Principal Component Analysis.

To summarize the main results of the networkStat function, use the generic function summary. It displays the main information about the network and vertex description through several tables.

summary(netstat, k=10)

Three Fields Plot

Visualize the main items of three fields (e.g. authors, keywords, journals), and how they are related through a Sankey diagram.

threeFieldsPlot(M, fields = c("AU", "DE", "SO"), n = c(20, 20, 20), width = 1200, height = 600)

Visualizing Bibliographic Networks

biblioNetwork: calculates, starting from a bibliographic data frame, the most frequently used networks: Coupling, Co-citation, Co-occurrences, and Collaboration.

biblioNetwork uses two arguments to define the network to compute:

  • analysis argument can be "co-citation", "coupling", "collaboration", or "co-occurrences".
  • network argument can be "authors", "references", "sources", "countries", "universities", "keywords", "author_keywords", "titles" and "abstracts".

To get more information on how to visualize networks using the function networkPlot and VOSviewer software by Nees Jan van Eck and Ludo Waltman, check here!

Country Scientific Collaboration

# Create a country collaboration network
M <- metaTagExtraction(M, Field = "AU_CO", sep = ";")
NetMatrix <- biblioNetwork(M, analysis = "collaboration", network = "countries", sep = ";")

# Plot the network
net = networkPlot(NetMatrix, n = dim(NetMatrix)[1], 
                  Title = "Country Collaboration", type = "circle", 
                  size=TRUE, remove.multiple=FALSE,labelsize=0.7,cluster="none")
# Collaboration networks (authors)
NetMatrix <- biblioNetwork(M, analysis = "collaboration", network = "authors", sep = ";")

# Plot the network
net = networkPlot(NetMatrix, n = 30, 
                Title = "Collaboration Network authors", type = "auto", 
                size=T, remove.multiple=FALSE, labelsize=0.7,edgesize = 5)
# Collaboration networks (universities)
NetMatrix <- biblioNetwork(M, analysis = "collaboration", network = "universities", sep = ";")

# Plot the network
net = networkPlot(NetMatrix, n = 30, 
                  Title = "Collaboration Network universities", type = "auto", 
                  size=T, remove.multiple=FALSE, labelsize=0.7,edgesize = 5)

Co-Citation Network

The first is the co-citation network. When a reference was addressed by two articles published in the same journal, this reference was included in the co-citation network of references. Therefore, the co-citation network addressed the common references to the concept of uncertainty in articles published by a journal.

# Create a co-citation network 
NetMatrix <- biblioNetwork(M, analysis = "co-citation", network = "references", sep = ";")

# Plot the network
net = networkPlot(NetMatrix, n = 30, 
                  Title = "Co-Citation Network", type = "fruchterman", 
                  size=T, remove.multiple=FALSE, labelsize=0.7,edgesize = 5)
# Journal (Source) co-citation analysis
M = metaTagExtraction(M, "CR_SO", sep=";")
NetMatrix <- biblioNetwork(M, analysis = "co-citation", network = "sources", sep = ";")

# Plot the network
net = networkPlot(NetMatrix, n = 50, 
                  Title = "Journals'Co-Citation Network", type = "auto", 
                  size.cex=TRUE, size=15, remove.multiple=FALSE, labelsize=0.7, edgesize = 10, edges.min=5)

Co-occurrences network

# keywords
NetMatrix <- biblioNetwork(M, analysis = "co-occurrences", network = "keywords", sep = ";")

# Plot the network
net = networkPlot(NetMatrix, normalize="association", weighted=T, n = 30, 
                  Title = "Keyword Co-occurrences", type = "fruchterman", 
                  size=T, edgesize = 5, labelsize=0.7)

Co-Word Analysis

The aim of the co-word analysis is to map the conceptual structure of a framework using the word co-occurrences in a bibliographic collection.

The analysis can be performed through dimensionality reduction techniques such as Multidimensional Scaling (MDS), Correspondence Analysis (CA) or Multiple Correspondence Analysis (MCA).

Here, we show an example using the function conceptualStructure that performs a CA or MCA to draw a conceptual structure of the field and K-means clustering to identify clusters of documents which express common concepts. Results are plotted on a two-dimensional map.

conceptualStructure includes natural language processing (NLP) routines (see the function termExtraction) to extract terms from titles and abstracts. In addition, it implements the Porter’s stemming algorithm to reduce inflected (or sometimes derived) words to their word stem, base or root form.

# Conceptual Structure using keywords (method="CA")
CS <- conceptualStructure(M,field="ID", method="CA", minDegree=4, clust=5, stemming=FALSE, labelsize=10, documents=10)

Historical Direct Citation Network

The historiographic map is a graph proposed by E. Garfield to represent a chronological network map of most relevant direct citations resulting from a bibliographic collection (Garfield, 2016).

The function generates a chronological direct citation network matrix which can be plotted using histPlot:

# Create a historical citation network
options(width=130)
histResults <- histNetwork(M, min.citations = 10, sep = ";")

# Plot a historical co-citation network
net <- histPlot(histResults, n=15, size = 10, labelsize=5)
library(bibliometrix)
library(reshape2)
library(ggplot2)

kword <- KeywordGrowth(M, Tag = "DE", sep = ";", top = 15, cdf = TRUE)

DF = melt(kword, id='Year')

# Timeline keywords ggplot
ggplot(DF,aes(x=Year,y=value, group=variable, shape=variable, colour=variable))+
  geom_point()+geom_line()+ 
  scale_shape_manual(values = 1:15)+
  labs(color="Author Keywords")+
  scale_x_continuous(breaks = seq(min(DF$Year), max(DF$Year), by = 5))+
  scale_y_continuous(breaks = seq(0, max(DF$value), by=10))+
  guides(color=guide_legend(title = "Author Keywords"), shape=FALSE)+
  labs(y="Count", variable="Author Keywords", title = "Author's Keywords Usage Evolution Over Time")+
  theme(text = element_text(size = 10))+
  facet_grid(variable ~ .)

Biblioshiny

Finally, a shiny app has beend developped by the Bibliometrix's creator to facilitate bibliometric analysis. A Tutorial is available.

biblioshiny()

References

Aria, Massimo, and Corrado Cuccurullo. 2017. “Bibliometrix: An R-Tool for Comprehensive Science Mapping Analysis.” Journal of Informetrics 11 (4): 959–75. https://doi.org/10.1016/j.joi.2017.08.007.

Garfield, Eugene. 2016. “Historiographic Mapping of Knowledge Domains Literature:” Journal of Information Science, July. https://doi.org/10.1177/0165551504042802.

Kumar, Sameer, and Jariah Mohd. Jan. 2013. “Mapping Research Collaborations in the Business and Management Field in Malaysia, 1980–2010.” Scientometrics 97 (3): 491–517. https://doi.org/10.1007/s11192-013-0994-8.

Acknowledgments

To cite this course:

Warin, Thierry. 2020. “Covid-19 Simulation: A Data Science Perspective.” doi:10.6084/m9.figshare.12020994.v1.