Team 1: Sentiment Analysis of Covid-19 Tweets

Estelle Sitbon , Maxime Gouin , Sarah Dahan
05-25-2020

Table of Contents


Pedagogical Material

As part of the Hacking Health Covid-19, the SKEMA Global Lab in AI provided to SKEMA’ students a fully developped data science environment to realize their project. See [here].

For this specific module, this team used these following courses:

Project Presentation

In this special period of quarantine that is so frantic, the “hacking health hackathon” comes just in time : we all have loads of information on the Covid 19 and have all talked and thought about the management of available information.

In our team we thought that natural language processing and sentiment analysis are one of the most impactful analysis so we started from here. After discussions and technical details -that we will detail afterwards- we closed our attention on the sentimental analysis of words under the hashtag #covid19 and #coronavirus on Twitter. After removing stop words and cleaning our data, we took the top words and divided them under positive and negative terms.

Technical Process


# Loading packages
library(rtweet)
library(dplyr)
library(tidyr)
library(tidytext)

As we needed an authentication token for Twitter API, we used the rtweet courses to know how to get it.


# store api keys (these are fake example values; replace with your own keys)
app_name = 'Covid 19 tweet emotion analyze'
consumer_key = 'xxxxxxxx'
consumer_secret = 'xxxxxxxxx'
access_token = 'xxxxxxxxxx'
access_secret = 'xxxxxxxxx'

# authenticate via web browser
create_token(app = app_name,
             consumer_key = consumer_key,
             consumer_secret = consumer_secret,
             access_token = access_token,
             access_secret = access_secret)

Here we search for tweets with the “#coronavirus” and “#covid19” in the USA.


rt_coronavirus <- search_tweets(
  "#coronavirus", "lang:en", geocode = lookup_coords("usa"), n = 10000
)

rt_covid19 <- search_tweets(
  "#covid19", "lang:en", geocode = lookup_coords("usa"), n = 10000
)

We needed to process each set of tweets into tidy text.


tweets.Coronavirus = rt_coronavirus %>%
  select(tweet)

tweets.Covid19 = rt_covid19 %>%
  select(tweet)

We started the text transformation to clean up tweets, by deleting punctuation, links to web pages and stop words (tidytext package contains a list over 1000 stop words).


## remove http elements
tweets.Coronavirus$tweet <- gsub("https\\S+","", tweets.Coronavirus$tweet)
tweets.Covid19$tweet <- gsub("https\\S+","", tweets.Covid19$tweet)

# Tokenization 
tidy_tweets_coronavirus <- tweets.Coronavirus %>% 
  unnest_tokens(word, tweet) %>%
  anti_join(stop_words, by = "word")

# Tokenization 
tidy_tweets_covid19 <- tweets.Covid19 %>% 
  unnest_tokens(word, tweet) %>%
  anti_join(stop_words, by = "word")

## Check data
head(tidy_tweets_coronavirus)

         word
1   elizabeth
2    warren’s
3     brother
4        dies
5 coronavirus
6         sen

head(tidy_tweets_covid19)

        word
1     vmware
2 supporting
3 techsoup's
4    covid19
5   response
6       fund

We wanted to get the top 10 words and check what population is concerned about.


library(ggplot2)

## Top 10 words for #coronavirus ##
tidy_tweets_coronavirus %>%
  count(word, sort = TRUE) %>%
  top_n(10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() + 
  theme_classic() +
  labs(x = "Count",
       y = "Unique words",
       title = "Unique word counts found in #corona tweets")


## Top 10 words for #covid19 ##
tidy_tweets_covid19 %>%
  count(word, sort = TRUE) %>%
  top_n(10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() + 
  theme_classic() +
  labs(x = "Count",
       y = "Unique words",
       title = "Unique word counts found in #covid19 tweets")

Here we perform a sentiment analysis with Bing lexicon and get_sentiment function from tidytext.


## Sentiment analyze for #coronavirus
bing_coronavirus = tidy_tweets_coronavirus %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  
## Sentiment analyze for #covid19 ##
bing_covid19 = tidy_tweets_covid19 %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  
## Plot it side by side to compare positive vs negative emotions ##
## First for # coronavirus ##
bing_coronavirus %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y" ) +
  labs(title = "Tweets containing #coronavirus", 
       y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() + theme_bw()


## Second for # covid19 ##
bing_covid19 %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y" ) +
  labs(title = "Tweets containing #covid19", 
       y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() + theme_bw()

tl;dr


# Loading packages
library(rtweet)
library(dplyr)
library(tidyr)
library(tidytext)

# store api keys (these are fake example values; replace with your own keys)
app_name = 'Covid 19 tweet emotion analyze'
consumer_key = 'xxxxxxxx'
consumer_secret = 'xxxxxxxxx'
access_token = 'xxxxxxxxxx'
access_secret = 'xxxxxxxxx'

# authenticate via web browser
create_token(app = app_name,
             consumer_key = consumer_key,
             consumer_secret = consumer_secret,
             access_token = access_token,
             access_secret = access_secret)

rt_coronavirus <- search_tweets(
  "#coronavirus", "lang:en", geocode = lookup_coords("usa"), n = 10000
)

rt_covid19 <- search_tweets(
  "#covid19", "lang:en", geocode = lookup_coords("usa"), n = 10000
)

rt_covid19 <- read.csv("./data/covid19.csv")
rt_coronavirus <- read.csv("./data/coronavirus.csv")

tweets.Coronavirus = rt_coronavirus %>%
  select(tweet)

tweets.Covid19 = rt_covid19 %>%
  select(tweet)

## remove http elements
tweets.Coronavirus$tweet <- gsub("https\\S+","", tweets.Coronavirus$tweet)
tweets.Covid19$tweet <- gsub("https\\S+","", tweets.Covid19$tweet)

# Tokenization 
tidy_tweets_coronavirus <- tweets.Coronavirus %>% 
  unnest_tokens(word, tweet) %>%
  anti_join(stop_words, by = "word")

# Tokenization 
tidy_tweets_covid19 <- tweets.Covid19 %>% 
  unnest_tokens(word, tweet) %>%
  anti_join(stop_words, by = "word")

## Check data
head(tidy_tweets_coronavirus)
head(tidy_tweets_covid19)

library(ggplot2)

## Top 10 words for #coronavirus ##
tidy_tweets_coronavirus %>%
  count(word, sort = TRUE) %>%
  top_n(10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() + 
  theme_classic() +
  labs(x = "Count",
       y = "Unique words",
       title = "Unique word counts found in #corona tweets")

## Top 10 words for #covid19 ##
tidy_tweets_covid19 %>%
  count(word, sort = TRUE) %>%
  top_n(10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() + 
  theme_classic() +
  labs(x = "Count",
       y = "Unique words",
       title = "Unique word counts found in #covid19 tweets")

## Sentiment analyze for #coronavirus
bing_coronavirus = tidy_tweets_coronavirus %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  
## Sentiment analyze for #covid19 ##
bing_covid19 = tidy_tweets_covid19 %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
  
## Plot it side by side to compare positive vs negative emotions ##
## First for # coronavirus ##
bing_coronavirus %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y" ) +
  labs(title = "Tweets containing #coronavirus", 
       y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() + theme_bw()


## Second for # covid19 ##
bing_covid19 %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y" ) +
  labs(title = "Tweets containing #covid19", 
       y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() + theme_bw()

To go further with our pedagogical platform

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".