Skip to Tutorial Content

Content

Plan

In Part 1, entitled Context, you will learn about the context of the exercise. In section 2, called Using Microsoft Excel, you'll use Microsoft's Excel spreadsheet to read a csv file and perform a year-by-year count. In section 3, Using R, you will learn how to use a data package and also perform a count by year this time using the R language.

Instructions

The exercise is divided into several parts offering different modes of interaction. The interactions provide you with additional information, test your knowledge or ask you to write your own bits of code. You don't have to solve the exercises in the order given, but as they complement each other, it's advisable to do them in order. Below you'll find the different types of interaction with their corresponding functions:

  • Information boxes provide additional information.

  • Code boxes require you to interact with pieces of code and are marked as Tasks. The process of resolving code boxes is fairly intuitive:

  • startover: Cleans up your code box to keep only the preset code.
  • solution: Displays the task solution.
  • run code: Executes the code without checking its correctness.
  • submit answer: Similar to run code, you execute the chunk, but this time your answer is checked for correctness.

Background

In these uncertain times, you have been hired by the World Health Organization (WHO) to gather information about Covid-19. As an analyst, you decide to look at bibliographic data, since hundreds of articles have been written on this subject since the start of the pandemic.

Using Microsoft Excel

To accomplish your mission, you first try to use Microsoft Excel to analyze the bibliographical data on Covid-19.

To help you in your research, a csv file has been made available containing the bibliographic data required for your analysis.

Task 1: Download the EpiBib.csv file from the following address: https://warin.ca/datalake/epiBib/EpiBib.csv.


Task 2: Open the csv file in Microsoft Excel.

*Try to do this by yourself first, without consulting the "How to open a csv" info box. If, after several attempts, you are unable to do so, please consult the "How to open a csv" info box.

Info: How to open a csv [Click here]

  1. Open Excel, create a new document and access the Data tab.

  2. Click on "Text file".

  3. Choose the file EpiBib.csv.

  4. A window will open, select the "Delimited" type and click Next.

  5. Check the "Comma" option, click on next and then end.


Task 3: Discover the csv file, browse through the data.

Info: Variable descriptions [Click here]

Column Descriptions Column Descriptions
AU Authors ISSN Source code
TI Document title VOL Volume
AB Abstract ISSUE Issue Number
PY Year LT Language
DT Document Type C1 Author's Address
MESH Medical vocabulary RP Reprint address
TC Number of times cited ID PubMed ID
SO Name of publication (or source) DE Author keywords
J9 Source abbreviation UT Unique article identifier
JI ISO Source abbreviation AU_CO Author's country of origin
DI Digital Object Identifier (DOI) DB Bibliographic database


Task 4: Using Microsoft Excel, find the number of articles published each year in 5 minutes.

Using the R language

After several attempts to accomplish your task in Microsoft Excel, you hear about the R programming language, which enables faster and more efficient data analysis.

You decide to find out more about it. After some research, you discover that the data has been made available via a data package called EpiBibR and published here:

For the purposes of this exercise, we'll be using a sample of this data.

Info: EpiBibR [Click here]

EpiBibR, which stands for "epidemiology-based bibliography for R", is a data package in R. To collect the references, the procedure used by the Allen Institute for AI for their CORD-19 project was adopted. A similar query was applied on PubMed to build the bibliographic data: "COVID-19" OR "Coronavirus" OR "Corona virus" OR "2019-nCoV" OR "SARS-CoV" OR "MERS-CoV" OR "Severe Acute Respiratory Syndrome" OR "Middle East Respiratory Syndrome".


Task 1: You'll need to store the sample data with the name EpiBib_data in a variable called mydata.

mydata <- 
mydata <- EpiBib_data

Task 2: Now that the data file has been loaded, you want to display the data table so you can observe its structure. Use the head() function to display the first 6 lines (n=). Without the second argument (n=), you'll observe the first 5 lines by default.

mydata <-
head(mydata, n=___)
mydata <- EpiBib_data
head(mydata, n=6)

*Note: You can click on the triangle at the top right of the results table to navigate through the table columns.

Task 3: Now you want to find the number of articles published per year. To do this, count the number of articles for each year (column PY) using the count() function in the dplyr package.

mydata <-
head(mydata, n=___)
dplyr::count(mydata, ___)
mydata <- EpiBib_data
head(mydata, n=6)
dplyr::count(mydata, PY)

*Note: You can click on the triangle at the top right of the results table to navigate through the table columns.

You get the number of articles published for all years available in the data table.

Acknowledgments

To cite this course:

Warin, Thierry. 2020. “Nüance-R: R Courses.” doi:10.6084/m9.figshare.11744013.v2.

Exploratory data analysis