Covid-19: Team 1: Sentiment Analysis of Covid-19 Tweets

Pedagogical Material

As part of the Hacking Health Covid-19, the SKEMA Global Lab in AI provided to SKEMA’ students a fully developped data science environment to realize their project. See [here].

For this specific module, this team used these following courses:

Module 1: Data Visualizations
Module 5: Social Media Collection and Analysis

Project Presentation

In this special period of quarantine that is so frantic, the “hacking health hackathon” comes just in time : we all have loads of information on the Covid 19 and have all talked and thought about the management of available information.

In our team we thought that natural language processing and sentiment analysis are one of the most impactful analysis so we started from here. After discussions and technical details -that we will detail afterwards- we closed our attention on the sentimental analysis of words under the hashtag #covid19 and #coronavirus on Twitter. After removing stop words and cleaning our data, we took the top words and divided them under positive and negative terms.

Technical Process


# Loading packages
library(rtweet)
library(dplyr)
library(tidyr)
library(tidytext)

As we needed an authentication token for Twitter API, we used the rtweet courses to know how to get it.


# store api keys (these are fake example values; replace with your own keys)
app_name = 'Covid 19 tweet emotion analyze'
consumer_key = 'xxxxxxxx'
consumer_secret = 'xxxxxxxxx'
access_token = 'xxxxxxxxxx'
access_secret = 'xxxxxxxxx'

# authenticate via web browser
create_token(app = app_name,
             consumer_key = consumer_key,
             consumer_secret = consumer_secret,
             access_token = access_token,
             access_secret = access_secret)

Here we search for tweets with the “#coronavirus” and “#covid19” in the USA.


rt_coronavirus <- search_tweets(
  "#coronavirus", "lang:en", geocode = lookup_coords("usa"), n = 10000
)

rt_covid19 <- search_tweets(
  "#covid19", "lang:en", geocode = lookup_coords("usa"), n = 10000
)

We needed to process each set of tweets into tidy text.


tweets.Coronavirus = rt_coronavirus %>%
  select(tweet)

tweets.Covid19 = rt_covid19 %>%
  select(tweet)

We started the text transformation to clean up tweets, by deleting punctuation, links to web pages and stop words (tidytext package contains a list over 1000 stop words).


## remove http elements
tweets.Coronavirus$tweet <- gsub("https\\S+","", tweets.Coronavirus$tweet)
tweets.Covid19$tweet <- gsub("https\\S+","", tweets.Covid19$tweet)

# Tokenization 
tidy_tweets_coronavirus <- tweets.Coronavirus %>% 
  unnest_tokens(word, tweet) %>%
  anti_join(stop_words, by = "word")

# Tokenization 
tidy_tweets_covid19 <- tweets.Covid19 %>% 
  unnest_tokens(word, tweet) %>%
  anti_join(stop_words, by = "word")

## Check data
head(tidy_tweets_coronavirus)


         word
1   elizabeth
2    warren’s
3     brother
4        dies
5 coronavirus
6         sen


head(tidy_tweets_covid19)


        word
1     vmware
2 supporting
3 techsoup's
4    covid19
5   response
6       fund

We wanted to get the top 10 words and check what population is concerned about.


library(ggplot2)

## Top 10 words for #coronavirus ##
tidy_tweets_coronavirus %>%
  count(word, sort = TRUE) %>%
  top_n(10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() + 
  theme_classic() +
  labs(x = "Count",
       y = "Unique words",
       title = "Unique word counts found in #corona tweets")


## Top 10 words for #covid19 ##
tidy_tweets_covid19 %>%
  count(word, sort = TRUE) %>%
  top_n(10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() + 
  theme_classic() +
  labs(x = "Count",
       y = "Unique words",
       title = "Unique word counts found in #covid19 tweets")

Here we perform a sentiment analysis with Bing lexicon and get_sentiment function from tidytext.


## Sentiment analyze for #coronavirus
bing_coronavirus = tidy_tweets_coronavirus %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  
## Sentiment analyze for #covid19 ##
bing_covid19 = tidy_tweets_covid19 %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  
## Plot it side by side to compare positive vs negative emotions ##
## First for # coronavirus ##
bing_coronavirus %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y" ) +
  labs(title = "Tweets containing #coronavirus", 
       y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() + theme_bw()


## Second for # covid19 ##
bing_covid19 %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y" ) +
  labs(title = "Tweets containing #covid19", 
       y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() + theme_bw()

tl;dr


# Loading packages
library(rtweet)
library(dplyr)
library(tidyr)
library(tidytext)

# store api keys (these are fake example values; replace with your own keys)
app_name = 'Covid 19 tweet emotion analyze'
consumer_key = 'xxxxxxxx'
consumer_secret = 'xxxxxxxxx'
access_token = 'xxxxxxxxxx'
access_secret = 'xxxxxxxxx'

# authenticate via web browser
create_token(app = app_name,
             consumer_key = consumer_key,
             consumer_secret = consumer_secret,
             access_token = access_token,
             access_secret = access_secret)

rt_coronavirus <- search_tweets(
  "#coronavirus", "lang:en", geocode = lookup_coords("usa"), n = 10000
)

rt_covid19 <- search_tweets(
  "#covid19", "lang:en", geocode = lookup_coords("usa"), n = 10000
)

rt_covid19 <- read.csv("./data/covid19.csv")
rt_coronavirus <- read.csv("./data/coronavirus.csv")

tweets.Coronavirus = rt_coronavirus %>%
  select(tweet)

tweets.Covid19 = rt_covid19 %>%
  select(tweet)

## remove http elements
tweets.Coronavirus$tweet <- gsub("https\\S+","", tweets.Coronavirus$tweet)
tweets.Covid19$tweet <- gsub("https\\S+","", tweets.Covid19$tweet)

# Tokenization 
tidy_tweets_coronavirus <- tweets.Coronavirus %>% 
  unnest_tokens(word, tweet) %>%
  anti_join(stop_words, by = "word")

# Tokenization 
tidy_tweets_covid19 <- tweets.Covid19 %>% 
  unnest_tokens(word, tweet) %>%
  anti_join(stop_words, by = "word")

## Check data
head(tidy_tweets_coronavirus)
head(tidy_tweets_covid19)

library(ggplot2)

## Top 10 words for #coronavirus ##
tidy_tweets_coronavirus %>%
  count(word, sort = TRUE) %>%
  top_n(10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() + 
  theme_classic() +
  labs(x = "Count",
       y = "Unique words",
       title = "Unique word counts found in #corona tweets")

## Top 10 words for #covid19 ##
tidy_tweets_covid19 %>%
  count(word, sort = TRUE) %>%
  top_n(10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() + 
  theme_classic() +
  labs(x = "Count",
       y = "Unique words",
       title = "Unique word counts found in #covid19 tweets")

## Sentiment analyze for #coronavirus
bing_coronavirus = tidy_tweets_coronavirus %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  
## Sentiment analyze for #covid19 ##
bing_covid19 = tidy_tweets_covid19 %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  
## Plot it side by side to compare positive vs negative emotions ##
## First for # coronavirus ##
bing_coronavirus %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y" ) +
  labs(title = "Tweets containing #coronavirus", 
       y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() + theme_bw()


## Second for # covid19 ##
bing_covid19 %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y" ) +
  labs(title = "Tweets containing #covid19", 
       y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() + theme_bw()

To go further with our pedagogical platform

Our Coding School
- R nanocourses
- R courses
Our Potential Modules
- Module 1: Data Visualizations
- Module 2: Data Warehouse
- Module 3: News Collection and Analysis
- Module 4: Predictive Modelling
- Module 5: Social Media Collection and Analysis
- Module 6: Mapping
- Module 7: Bibliometrics
- Module 8: Topic Modelling
- Module 9: Covid-19 and International Flows
- Module 10: Covid-19 and Finance
- Module 11: Covid-19 and Public Policies
- Module 12: Covid-19 and Ethics
Our Databases and APIs
- All databases

Team 1: Sentiment Analysis of Covid-19 Tweets

Table of Contents

Pedagogical Material

Project Presentation

Technical Process

tl;dr

To go further with our pedagogical platform

Reuse