## Bibliometrix

`Bibliometrix`

is developed by Massimo Aria and Corrado Cuccurullo and is an open-source tool for quantitative research in scientometrics and bibliometrics that includes all the main bibliometric methods of analysis.

`Bibliometrix`

package provides various routines for importing bibliographic data from SCOPUS, Clarivate Analytics' Web of Science, PubMed and Cochrane databases, performing bibliometric analysis and building data matrices for co-citation, coupling, scientific collaboration analysis and co-word analysis.

A shiny app is also available through this R package. See section Bibiloshiny.

This procedure is based on the Bibliometrix's vignette and the article written by the package authors (Aria and Cuccurullo, 2017).

## Import & Convert

Import a bibfile and convert it into a dataframe.

```
library("bibliometrix")
library(dplyr)
D <- readFiles("savedrecs.bib")
M <- convert2df(D, dbsource = "isi", format = "bibtex")
```

`D`

is a large character vector. readFiles argument contains the name of files downloaded from SCOPUS, Clarivate Analytics WOS, or Cochrane CDSR website.

`readFiles`

combines all the text files onto a single large character vector. Furthermore, the format is converted into UTF-8.

`convert2df`

creates a bibliographic data frame with cases corresponding to manuscripts and variables to Field Tag in the original export file.

Each manuscript contains several elements, such as authors' names, title, keywords and other information. All these elements constitute the bibliographic attributes of a document, also called metadata. Data frame columns are named using the standard Clarivate Analytics WoS Field Tag codify.

## Field Tag Description

Field Tag | Description |
---|---|

AU | Authors |

TI | Document Title |

SO | Publication Name (or Source) |

JI | ISO Source Abbreviation |

DT | Document Type |

DE | Authors' Keywords |

ID | Keywords associated by SCOPUS or ISI database |

AB | Abstract |

C1 | Author Address |

RP | Reprint Address |

TC | Times Cited |

PY | Year |

SC | Subject Category |

UT | Unique Article Identifier |

DB | Bibliographic Database |

DI | Digital Object Identifier (DOI) |

## Bibliometric Analysis

```
results <- biblioAnalysis(M, sep = ";")
S <- summary(object = results, k = 10, pause = FALSE)
S
```

`biblioAnalysis`

calculates main bibliometric measures and returns an object of class "bibliometrix".

`summary`

displays main information about the bibliographic data frame and several tables, such as annual scientific production, top manuscripts per number of citations, most productive authors, most productive countries, total citation per country, most relevant sources (journals) and most relevant keywords. accepts two additional arguments.

`k`

is a formatting value that indicates the number of rows of each table. Choosing `k=10`

you decide to see the first 10 Authors, the first 10 sources, etc.

`pause`

is a logical value (TRUE or FALSE) used to allow (or not) pause in screen scrolling.

```
# plot the results
plot(x = results, k = 10, pause = FALSE)
```

`plot()`

allows you to plot the results created by the function `biblioAnalysis()`

.

### Lotka’s Law coefficient estimation

Lotka’s law describes the frequency of publication by authors in any given field as an inverse square law, where the number of authors publishing a certain number of articles is a fixed ratio to the number of authors publishing a single article. This assumption implies that the theoretical beta coefficient of Lotka’s law is equal to 2 (Lotka A.J., 1926).

Using lotka function is possible to estimate the Beta coefficient of our bibliographic collection and assess, through a statistical test, the similarity of this empirical distribution with the theoretical one.

```
L <- lotka(results)
# Author Productivity. Empirical Distribution
L$AuthorProd
# Beta coefficient estimate
L$Beta
# Constant
L$C
# Goodness of fit
L$R2
# P-value of K-S two sample test
L$p.value
```

`lotka`

estimates Lotka’s law coefficients for scientific productivity.

The table `L$AuthorProd`

shows the observed distribution of scientific productivity in our example.

The estimated Beta coefficient is 2.77 with a goodness of fit equal to 0.94. Kolmogorov-Smirnoff two sample test provides a p-value 0.02 that means there is or not? a significant difference between the observed and the theoretical Lotka distributions.

Compare the two distributions using plot function:

```
# Observed distribution
Observed=L$AuthorProd[,3]
# Theoretical distribution with Beta = 2
Theoretical=10^(log10(L$C)-2*log10(L$AuthorProd[,1]))
plot(L$AuthorProd[,1],Theoretical,type="l",col="red",ylim=c(0, 1), xlab="Articles",ylab="Freq. of Authors",main="Scientific Productivity")
lines(L$AuthorProd[,1],Observed,col="blue")
legend(x="topright",c("Theoretical (B=2)","Observed"),col=c("red","blue"),lty = c(1,1,1),cex=0.6,bty="n")
```

## Bibliographic Network Matrices

Manuscript's attributes are connected to each other through the manuscript itself: author(s) to journal, keywords to publication date, etc.

These connections of different attributes generate bipartite networks that can be represented as rectangular matrices (Manuscripts x Attributes).

Furthermore, scientific publications regularly contain references to other scientific works. This generates a further network, namely, co-citation or coupling network.

These networks are analyzed in order to capture meaningful properties of the underlying research system, and in particular to determine the influence of bibliometric units such as scholars and journals.

### Bipartite networks

`cocMatrix`

is a general function to compute a bipartite network selecting one of the metadata attributes.

For example, to create a network Manuscript x Publication Source you have to use the field tag "SO":

```
A <- cocMatrix(M, Field = "SO", sep = ";")
sort(Matrix::colSums(A), decreasing = TRUE)[1:5]
```

`A`

is a rectangular binary matrix, representing a bipartite network where rows and columns are manuscripts and sources respectively.

The generic element aij is 1 if the manuscript i has been published in source j, 0 otherwise.

The j−th column sum aj is the number of manuscripts published in source j.

Sorting, in decreasing order, the column sums of A, you can see the most relevant publication sources:

```
# Other networks possible
# Citation network
A <- cocMatrix(M, Field = "CR", sep = ". ")
sort(Matrix::colSums(A), decreasing = TRUE)[1:5]
# Author network
A <- cocMatrix(M, Field = "AU", sep = ";")
sort(Matrix::colSums(A), decreasing = TRUE)[1:5]
# Author keyword network
A <- cocMatrix(M, Field = "DE", sep = ";")
sort(Matrix::colSums(A), decreasing = TRUE)[1:5]
# Keyword Plus network
A <- cocMatrix(M, Field = "ID", sep = ";")
sort(Matrix::colSums(A), decreasing = TRUE)[1:5]
```

Authors’ Countries is not a standard attribute of the bibliographic data frame. You need to extract this information from affiliation attribute using the function metaTagExtraction.

metaTagExtraction allows to extract the following additional field tags:

- Authors’ countries (
`Field = "AU_CO"`

); - First Author’s countries (
`Field = "AU_CO"`

); - First author of each cited reference (
`Field = "CR_AU"`

); - Publication source of each cited reference (
`Field = "CR_SO"`

); - Authors’ affiliations (
`Field = "AU_UN"`

).

```
# Country network
MM <- metaTagExtraction(M, Field = "AU_CO", sep = ";")
A <- cocMatrix(MM, Field = "AU_CO", sep = ";")
sort(Matrix::colSums(A), decreasing = TRUE)[1:5]
```

### Bibliographic coupling

Two articles are said to be bibliographically coupled if at least one cited source appears in the bibliographies or reference lists of both articles (Kessler, 1963).

The function `biblioNetwork`

calculates, starting from a bibliographic data frame, the most frequently used coupling networks: Authors, Sources, and Countries.

`biblioNetwork`

uses two arguments to define the network to compute:

`analysis`

argument can be "co-citation", "coupling", "collaboration", or "co-occurrences".`network`

argument can be "authors", "references", "sources", "countries", "universities", "keywords", "author_keywords", "titles" and "abstracts".

```
# The following code calculates a classical article coupling network:
NetMatrix <- biblioNetwork(M, analysis = "coupling", network = "references", sep = ". ")
```

Articles with only a few references, therefore, would tend to be more weakly bibliographically coupled, if coupling strength is measured simply according to the number of references that articles contain in common.

This suggests that it might be more practical to switch to a relative measure of bibliographic coupling.

`normalizeSimilarity`

function calculates Association strength, Inclusion, Jaccard or Salton similarity among vertices of a network. normalizeSimilarity can be recalled directly from `networkPlot()`

function using the argument `normalize`

.

```
NetMatrix <- biblioNetwork(M, analysis = "coupling", network = "authors", sep = ";")
net=networkPlot(NetMatrix, normalize = "salton", weighted=NULL, n = 100,
Title = "Authors' Coupling", type = "fruchterman",
size=5, size.cex=T, remove.multiple=TRUE, labelsize=0.8, label.n=10, label.cex=F)
```

### Bibliographic co-citation

We talk about co-citation of two articles when both are cited in a third article. Thus, co-citation can be seen as the counterpart of bibliographic coupling.

Using the function `biblioNetwork`

, you can calculate a classical reference co-citation network:

`NetMatrix <- biblioNetwork(M, analysis = "co-citation", network = "references", sep = ". ")`

### Bibliographic collaboration

Scientific collaboration network is a network where nodes are authors and links are co-authorships as the latter is one of the most well-documented forms of scientific collaboration (Glanzel, 2004).

Using the function biblioNetwork, you can calculate an authors’ collaboration network:

```
# collaboration network
NetMatrix <- biblioNetwork(M, analysis = "collaboration", network = "authors", sep = ";")
# country collaboration network:
NetMatrix <- biblioNetwork(M, analysis = "collaboration", network = "countries", sep = ";")
```

## Network Graph Characteristics

The function `networkStat`

calculates several summary statistics.

In particular, starting from a bibliographic matrix (or an igraph object), two groups of descriptive measures are computed:

The summary statistics of the network;

The main indices of centrality and prestige of vertices.

```
NetMatrix <- biblioNetwork(M, analysis = "co-occurrences", network = "keywords", sep = ";")
netstat <- networkStat(NetMatrix)
netstat
```

### The summary statistics of the network

This group of statistics allows to describe the structural properties of a network:

`Size`

: is the number of vertices composing the network;`Density`

: is the proportion of present edges from all possible edges in the network;`Transitivity`

: is the ratio of triangles to connected triples;`Diameter`

: is the longest geodesic distance (length of the shortest path between two nodes) in the network;`Degree distribution`

: is the cumulative distribution of vertex degrees;`Degree centralization`

: is the normalized degree of the overall network;`Closeness centralization`

: is the normalized inverse of the vertex average geodesic distance to others in the network;`Eigenvector centralization`

: is the first eigenvector of the graph matrix;`Betweenness centralization`

: is the normalized number of geodesics that pass through the vertex;`Average path length`

: is the mean of the shortest distance between each pair of vertices in the network.

### The main indices of centrality and prestige of vertices

These measures help to identify the most important vertices in a network and the propensity of two vertices that are connected to be both connected to a third vertex.

The statistics, at vertex level, returned by `networkStat`

are:

`Degree centrality`

`Closeness centrality`

: measures how many steps are required to access every other vertex from a given vertex;`Eigenvector centrality`

: is a measure of being well-connected connected to the well-connected;`Betweenness centrality`

: measures brokerage or gatekeeping potential. It is (approximately) the number of shortest paths between vertices that pass through a particular vertex;`PageRank score`

: approximates probability that any message will arrive to a particular vertex. This algorithm was developed by Google founders, and originally applied to website links;`Hub Score`

: estimates the value of the links outgoing from the vertex. It was initially applied to the web pages;`Authority Score`

: is another measure of centrality initially applied to the Web. A vertex has high authority when it is linked by many other vertices that are linking many other vertices;`Vertex Ranking`

: is an overall vertex ranking obtained as a linear weighted combination of the centrality and prestige vertex measures. The weights are proportional to the loadings of the first component of the Principal Component Analysis.

To summarize the main results of the networkStat function, use the generic function summary. It displays the main information about the network and vertex description through several tables.

`summary(netstat, k=10)`

## Three Fields Plot

Visualize the main items of three fields (e.g. authors, keywords, journals), and how they are related through a Sankey diagram.

`threeFieldsPlot(M, fields = c("AU", "DE", "SO"), n = c(20, 20, 20), width = 1200, height = 600)`

## Visualizing Bibliographic Networks

`biblioNetwork`

: calculates, starting from a bibliographic data frame, the most frequently used networks: Coupling, Co-citation, Co-occurrences, and Collaboration.

`biblioNetwork`

uses two arguments to define the network to compute:

`analysis`

argument can be "co-citation", "coupling", "collaboration", or "co-occurrences".`network`

argument can be "authors", "references", "sources", "countries", "universities", "keywords", "author_keywords", "titles" and "abstracts".

To get more information on how to visualize networks using the function `networkPlot`

and VOSviewer software by Nees Jan van Eck and Ludo Waltman, check here!

### Country Scientific Collaboration

```
# Create a country collaboration network
M <- metaTagExtraction(M, Field = "AU_CO", sep = ";")
NetMatrix <- biblioNetwork(M, analysis = "collaboration", network = "countries", sep = ";")
# Plot the network
net = networkPlot(NetMatrix, n = dim(NetMatrix)[1],
Title = "Country Collaboration", type = "circle",
size=TRUE, remove.multiple=FALSE,labelsize=0.7,cluster="none")
```

```
# Collaboration networks (authors)
NetMatrix <- biblioNetwork(M, analysis = "collaboration", network = "authors", sep = ";")
# Plot the network
net = networkPlot(NetMatrix, n = 30,
Title = "Collaboration Network authors", type = "auto",
size=T, remove.multiple=FALSE, labelsize=0.7,edgesize = 5)
```

```
# Collaboration networks (universities)
NetMatrix <- biblioNetwork(M, analysis = "collaboration", network = "universities", sep = ";")
# Plot the network
net = networkPlot(NetMatrix, n = 30,
Title = "Collaboration Network universities", type = "auto",
size=T, remove.multiple=FALSE, labelsize=0.7,edgesize = 5)
```

### Co-Citation Network

The first is the co-citation network. When a reference was addressed by two articles published in the same journal, this reference was included in the co-citation network of references. Therefore, the co-citation network addressed the common references to the concept of uncertainty in articles published by a journal.

```
# Create a co-citation network
NetMatrix <- biblioNetwork(M, analysis = "co-citation", network = "references", sep = ";")
# Plot the network
net = networkPlot(NetMatrix, n = 30,
Title = "Co-Citation Network", type = "fruchterman",
size=T, remove.multiple=FALSE, labelsize=0.7,edgesize = 5)
```

```
# Journal (Source) co-citation analysis
M = metaTagExtraction(M, "CR_SO", sep=";")
NetMatrix <- biblioNetwork(M, analysis = "co-citation", network = "sources", sep = ";")
# Plot the network
net = networkPlot(NetMatrix, n = 50,
Title = "Journals'Co-Citation Network", type = "auto",
size.cex=TRUE, size=15, remove.multiple=FALSE, labelsize=0.7, edgesize = 10, edges.min=5)
```

### Co-occurrences network

```
# keywords
NetMatrix <- biblioNetwork(M, analysis = "co-occurrences", network = "keywords", sep = ";")
# Plot the network
net = networkPlot(NetMatrix, normalize="association", weighted=T, n = 30,
Title = "Keyword Co-occurrences", type = "fruchterman",
size=T, edgesize = 5, labelsize=0.7)
```

## Co-Word Analysis

The aim of the co-word analysis is to map the conceptual structure of a framework using the word co-occurrences in a bibliographic collection.

The analysis can be performed through dimensionality reduction techniques such as Multidimensional Scaling (MDS), Correspondence Analysis (CA) or Multiple Correspondence Analysis (MCA).

Here, we show an example using the function conceptualStructure that performs a CA or MCA to draw a conceptual structure of the field and K-means clustering to identify clusters of documents which express common concepts. Results are plotted on a two-dimensional map.

conceptualStructure includes natural language processing (NLP) routines (see the function termExtraction) to extract terms from titles and abstracts. In addition, it implements the Porter’s stemming algorithm to reduce inflected (or sometimes derived) words to their word stem, base or root form.

```
# Conceptual Structure using keywords (method="CA")
CS <- conceptualStructure(M,field="ID", method="CA", minDegree=4, clust=5, stemming=FALSE, labelsize=10, documents=10)
```

## Historical Direct Citation Network

The historiographic map is a graph proposed by E. Garfield to represent a chronological network map of most relevant direct citations resulting from a bibliographic collection (Garfield, 2016).

The function generates a chronological direct citation network matrix which can be plotted using `histPlot`

:

```
# Create a historical citation network
options(width=130)
histResults <- histNetwork(M, min.citations = 10, sep = ";")
# Plot a historical co-citation network
net <- histPlot(histResults, n=15, size = 10, labelsize=5)
```

```
library(bibliometrix)
library(reshape2)
library(ggplot2)
kword <- KeywordGrowth(M, Tag = "DE", sep = ";", top = 15, cdf = TRUE)
DF = melt(kword, id='Year')
# Timeline keywords ggplot
ggplot(DF,aes(x=Year,y=value, group=variable, shape=variable, colour=variable))+
geom_point()+geom_line()+
scale_shape_manual(values = 1:15)+
labs(color="Author Keywords")+
scale_x_continuous(breaks = seq(min(DF$Year), max(DF$Year), by = 5))+
scale_y_continuous(breaks = seq(0, max(DF$value), by=10))+
guides(color=guide_legend(title = "Author Keywords"), shape=FALSE)+
labs(y="Count", variable="Author Keywords", title = "Author's Keywords Usage Evolution Over Time")+
theme(text = element_text(size = 10))+
facet_grid(variable ~ .)
```

## Biblioshiny

Finally, a shiny app has beend developped by the Bibliometrix's creator to facilitate bibliometric analysis. A Tutorial is available.

`biblioshiny()`

## References

Aria, Massimo, and Corrado Cuccurullo. 2017. “Bibliometrix: An R-Tool for Comprehensive Science Mapping Analysis.” Journal of Informetrics 11 (4): 959–75. https://doi.org/10.1016/j.joi.2017.08.007.

Garfield, Eugene. 2016. “Historiographic Mapping of Knowledge Domains Literature:” Journal of Information Science, July. https://doi.org/10.1177/0165551504042802.

Kumar, Sameer, and Jariah Mohd. Jan. 2013. “Mapping Research Collaborations in the Business and Management Field in Malaysia, 1980–2010.” Scientometrics 97 (3): 491–517. https://doi.org/10.1007/s11192-013-0994-8.

## Acknowledgments

To cite this course:

Warin, Thierry. 2020. “Covid-19 Simulation: A Data Science Perspective.” doi:10.6084/m9.figshare.12020994.v1.