Exploratory data analysis

Content

Plan

In Part 1, entitled Context, you will learn about the context of the exercise. In section 2, called Using Microsoft Excel, you'll use Microsoft's Excel spreadsheet to read a csv file and perform a year-by-year count. In section 3, Using R, you will learn how to use a data package and also perform a count by year this time using the R language.

Instructions

The exercise is divided into several parts offering different modes of interaction. The interactions provide you with additional information, test your knowledge or ask you to write your own bits of code. You don't have to solve the exercises in the order given, but as they complement each other, it's advisable to do them in order. Below you'll find the different types of interaction with their corresponding functions:

Information boxes provide additional information.
Code boxes require you to interact with pieces of code and are marked as Tasks. The process of resolving code boxes is fairly intuitive:
startover: Cleans up your code box to keep only the preset code.
solution: Displays the task solution.
run code: Executes the code without checking its correctness.
submit answer: Similar to run code, you execute the chunk, but this time your answer is checked for correctness.

Background

In these uncertain times, you have been hired by the World Health Organization (WHO) to gather information about Covid-19. As an analyst, you decide to look at bibliographic data, since hundreds of articles have been written on this subject since the start of the pandemic.

Using Microsoft Excel

To accomplish your mission, you first try to use Microsoft Excel to analyze the bibliographical data on Covid-19.

To help you in your research, a csv file has been made available containing the bibliographic data required for your analysis.

Task 1: Download the EpiBib.csv file from the following address: https://warin.ca/datalake/epiBib/EpiBib.csv.

Task 2: Open the csv file in Microsoft Excel.

*Try to do this by yourself first, without consulting the "How to open a csv" info box. If, after several attempts, you are unable to do so, please consult the "How to open a csv" info box.

Info: How to open a csv [Click here]

Open Excel, create a new document and access the Data tab.
Click on "Text file".
Choose the file EpiBib.csv.
A window will open, select the "Delimited" type and click Next.
Check the "Comma" option, click on next and then end.

Task 3: Discover the csv file, browse through the data.

Info: Variable descriptions [Click here]

Column	Descriptions	Column	Descriptions
AU	Authors	ISSN	Source code
TI	Document title	VOL	Volume
AB	Abstract	ISSUE	Issue Number
PY	Year	LT	Language
DT	Document Type	C1	Author's Address
MESH	Medical vocabulary	RP	Reprint address
TC	Number of times cited	ID	PubMed ID
SO	Name of publication (or source)	DE	Author keywords
J9	Source abbreviation	UT	Unique article identifier
JI	ISO Source abbreviation	AU_CO	Author's country of origin
DI	Digital Object Identifier (DOI)	DB	Bibliographic database

Task 4: Using Microsoft Excel, find the number of articles published each year in 5 minutes.

Using the R language

After several attempts to accomplish your task in Microsoft Excel, you hear about the R programming language, which enables faster and more efficient data analysis.

You decide to find out more about it. After some research, you discover that the data has been made available via a data package called EpiBibR and published here:

Warin T, “Global Research on Coronaviruses: An R Package J Med Internet Res 2020;22(8):e19615, DOI: 10.2196/19615, PMID: 32730218, PMCID: 7423387

For the purposes of this exercise, we'll be using a sample of this data.

Info: EpiBibR [Click here]

EpiBibR, which stands for "epidemiology-based bibliography for R", is a data package in R. To collect the references, the procedure used by the Allen Institute for AI for their CORD-19 project was adopted. A similar query was applied on PubMed to build the bibliographic data: "COVID-19" OR "Coronavirus" OR "Corona virus" OR "2019-nCoV" OR "SARS-CoV" OR "MERS-CoV" OR "Severe Acute Respiratory Syndrome" OR "Middle East Respiratory Syndrome".

Task 1: You'll need to store the sample data with the name EpiBib_data in a variable called mydata.

mydata <-

mydata <- EpiBib_data

Task 2: Now that the data file has been loaded, you want to display the data table so you can observe its structure. Use the head() function to display the first 6 lines (n=). Without the second argument (n=), you'll observe the first 5 lines by default.

mydata <-
head(mydata, n=___)

mydata <- EpiBib_data
head(mydata, n=6)

*Note: You can click on the triangle at the top right of the results table to navigate through the table columns.

Task 3: Now you want to find the number of articles published per year. To do this, count the number of articles for each year (column PY) using the count() function in the dplyr package.

mydata <-
head(mydata, n=___)
dplyr::count(mydata, ___)

mydata <- EpiBib_data
head(mydata, n=6)
dplyr::count(mydata, PY)

*Note: You can click on the triangle at the top right of the results table to navigate through the table columns.

You get the number of articles published for all years available in the data table.

Acknowledgments

To cite this course:

Warin, Thierry. 2020. “Nüance-R: R Courses.” doi:10.6084/m9.figshare.11744013.v2.