Nathalie de Marcellis-Warin, PhD (Polytechnique Montréal, CIRANO & OBVIA)
Thierry Warin, PhD (HEC Montréal, CIRANO & OBVIA)
Introduction
Data Collection
Methodology
Results and lessons learned
Conclusion
Twitter Analysis of Vaccination in a Post COVID-19 Era
Our purpose is above all methodological and relates to the feasibility and validity of the analysis of conversations in a context of great importance for humanity, that of the Covid-19 disease.
A Data Science-based use case
RQ1: Can we identify phases in conversations on COVID-19 on social media?
RQ2: Do the conversations follow traditional media information?
"communications" / "conversations"
vertical / horizontal communications
Unstructured real-time data: collecting live tweets + New York times articles
Natural Language Processing for text analysis
We collected more than 2 million tweets and performed pattern detection
We collected articles from the New York times corresponding to the same patterns (cosine similarity)
Structured and unstructured data
Structural Topic Model (STM)
With STM, users can model framing of international newspapers (Roberts, Stewart, and Airoldi 2016b), open-ended survey responses in the American National Election Study (Robertset al., 2014), online classroom forums (Reich, Tingley, Leder -Luis, Roberts, and Stewart 2015), religious statements (Lucas, Nielsen, Roberts, Stewart, Storer, and Tingley 2015), and lobbying reports (Milner and Tingley 2015), and so on.
The purpose is to allow researchers to discover topics and estimate their relationship to document metadata.
The results of the model can be used to perform hypothesis testing on these relationships.
This, of course, mirrors the type of analysis that social scientists perform with other types of data, where the goal is to discover relationships between variables and test hypotheses.
Period: January 01, 2018 to April 21, 2021
Number of tweets 2,601,702
Keywords: “vaccine” OR “vaccines” OR “vaccinate” OR “vaccination” OR “vaccineswork” OR “antivax” OR “vaccinesdontwork” OR “provax” OR “vaxwithme” OR “antivaxxers” OR “immunization”
Period: January 01, 2020 to April 21, 2021
Number of tweets 2,409,522
Period: January 01, 2020 to September 30, 2020
Number of tweets: 374,084
Proportion: 15,5% of total number of tweets
Period: October 01, 2020 to April 21, 2021
Number of tweets: 2,035,438
Proportion: 84,5% of total number of tweets
Event study
We used an Event study approach.
In order to perform this type of analysis, financial specialists use the return, volume or volatility of stocks. It is about quantifying the economic impact of an event in so-called abnormal returns. For our study, we chose to use the number of tweets per day to detect anomalies.
We performed an anomaly detection on all tweets from the beginning of the pandemic (January 01, 2020 to April 21, 2021), which corresponds to 2,409,522 tweets.
We noticed an acceleration of the conversation on Twitter about vaccination starting on October 1, 2020. So we decided to split the tweets into two parts:
For each part we performed a new anomaly detection.
Part 1 (01-01-2020 to 30-09-2020): Anomaly performed with the following parameters:
Part 2 (01-10-2020 to 21-04-2021): Anomaly performed with adjusted parameters:
We have grouped the anomalies by sub periods.
Time interval of each sub Period:
Sub-periods of anomalies for the 1st part of the pandemic period
Sub-periods of anomalies for the 2nd part of tweets
Structural Topic Models (STM)
The goal of the STM is to allow researchers to discover topics and estimate their relationship to document metadata.
The following figure presents a heuristic overview of a typical workflow.
First users ingest the data and prepare it for analysis. Next a structural topic model is estimated. The ability to estimate the structural topic model quickly allows for the evaluation, understanding, and visualization of results [@roberts_stm_2019]
We performed an STM model to determine the top 10 topics for:
the part 1 of tweets (January 1, 2020 to September 30, 2020.),
the part 2 of tweets (October 1, 2020 to April 21, 2021),
each of the 11 sub periods.
Part 1 Period 1 (March 13-19, 2020)
Part 1 Period 1 (March 13-19, 2020). Topic 8 & 9 respectively.
Part 2 Period 6 (April 10-16, 2021)
Part 2 Period 6 (April 10-16, 2021). Topic 7 & 6 respectively.
Top 3 topics for each 11 sub-period
Date | Topic |
---|---|
13-03-2020 to 19-03-2020 | covid, develop, amp; peopl, get, can; trial, coronavirus, test |
12-05-2020 to 21-05-2020 | get, peopl, need; covid, coronavirus, research; trump, coronavirus, presid |
13-07-2020 to 13-07-2020 | get, peopl, like; amp, need, work; covid, trial, phase |
08-08-2020 to 14-08-2020 | get, peopl, like; russia, first, coronavirus; covid, coronavirus, dose |
05-09-2020 to 19-09-2020 | trump, say, coronavirus; coronavirus, covid, first; covid, develop, work |
Date | Topic |
---|---|
06-11-2020 to 12-011-2020 | get, peopl, flu; pfizer, covid, effect; news, stock, hope |
23-11-2020 to 29-11-2020 | covid, effect, astrazeneca; take, time, now; covid, develop, india |
29-11-2020 to 17-12-2020 | peopl, get, need; first, covid, dose; pfizer, approv, covid |
21-12-2020 to 04-01-2021 | covid, dose, first; covid, health, state; get, take, like |
26-02-2021 to 05-03-2021 | effect, like, just; covid, johnson, dose; covid, get, can |
10-04-2021 to 16-04-2021 | peopl, just, make; johnson, amp, paus; dose, receiv, million |
Classical approach from computational linguistics is to measure similarity based on the content overlap between documents. For this documents are presented as bag-of-words, so each document will be a sparse vector. And measure of overlap will be defined as angle between vectors:
similarity(doc1,doc2)=cos(θ)=doc1doc2|doc1||doc2|
By cosine distance/dissimilarity we assume following:
distance(doc1,doc2)=1−similarity(doc1,doc2)
It is important to note, however, that this is not a proper distance metric in a mathematical sense as it does not have the triangle inequality property and it violates the coincidence axiom.
We calculated the cosine similarity between the tweets of each 11 sub periods and the New York Times (NYT) articles for the same periods.
First, we collected all the NYT articles using the dates of each sub period (see the following table).
Second, we filtered the collected NYT articles per sub periods to only keep those that contained at least one of the following words:
The following table display the number of articles remaining after the filtering.
During the first 5 periods, the conversations were chronologically about:
During the last 6 periods, the conversations allowed the level of development of epidemiological expertise:
Lessons for research:
Lessons for public policy in particular:
RQ1: Can we identify phases in conversations on COVID-19 on social media?
RQ2: Do the conversations follow traditional media information?
No evidence of high similarity between traditional media and social media on these conversations
Some room for optimism:
Unstructured data from two sources: Twitter and New York Times
NLP: natural language analysis protocol
Nowcasting for public policy communications
Introduction
Data Collection
Methodology
Results and lessons learned
Conclusion
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Alt + f | Fit Slides to Screen |
s | Toggle scribble toolbox |
Esc | Back to slideshow |
Nathalie de Marcellis-Warin, PhD (Polytechnique Montréal, CIRANO & OBVIA)
Thierry Warin, PhD (HEC Montréal, CIRANO & OBVIA)
Introduction
Data Collection
Methodology
Results and lessons learned
Conclusion
Twitter Analysis of Vaccination in a Post COVID-19 Era
Our purpose is above all methodological and relates to the feasibility and validity of the analysis of conversations in a context of great importance for humanity, that of the Covid-19 disease.
A Data Science-based use case
RQ1: Can we identify phases in conversations on COVID-19 on social media?
RQ2: Do the conversations follow traditional media information?
"communications" / "conversations"
vertical / horizontal communications
Unstructured real-time data: collecting live tweets + New York times articles
Natural Language Processing for text analysis
We collected more than 2 million tweets and performed pattern detection
We collected articles from the New York times corresponding to the same patterns (cosine similarity)
Structured and unstructured data
Structural Topic Model (STM)
With STM, users can model framing of international newspapers (Roberts, Stewart, and Airoldi 2016b), open-ended survey responses in the American National Election Study (Robertset al., 2014), online classroom forums (Reich, Tingley, Leder -Luis, Roberts, and Stewart 2015), religious statements (Lucas, Nielsen, Roberts, Stewart, Storer, and Tingley 2015), and lobbying reports (Milner and Tingley 2015), and so on.
The purpose is to allow researchers to discover topics and estimate their relationship to document metadata.
The results of the model can be used to perform hypothesis testing on these relationships.
This, of course, mirrors the type of analysis that social scientists perform with other types of data, where the goal is to discover relationships between variables and test hypotheses.
Period: January 01, 2018 to April 21, 2021
Number of tweets 2,601,702
Keywords: “vaccine” OR “vaccines” OR “vaccinate” OR “vaccination” OR “vaccineswork” OR “antivax” OR “vaccinesdontwork” OR “provax” OR “vaxwithme” OR “antivaxxers” OR “immunization”
Period: January 01, 2020 to April 21, 2021
Number of tweets 2,409,522
Period: January 01, 2020 to September 30, 2020
Number of tweets: 374,084
Proportion: 15,5% of total number of tweets
Period: October 01, 2020 to April 21, 2021
Number of tweets: 2,035,438
Proportion: 84,5% of total number of tweets
Event study
We used an Event study approach.
In order to perform this type of analysis, financial specialists use the return, volume or volatility of stocks. It is about quantifying the economic impact of an event in so-called abnormal returns. For our study, we chose to use the number of tweets per day to detect anomalies.
We performed an anomaly detection on all tweets from the beginning of the pandemic (January 01, 2020 to April 21, 2021), which corresponds to 2,409,522 tweets.
We noticed an acceleration of the conversation on Twitter about vaccination starting on October 1, 2020. So we decided to split the tweets into two parts:
For each part we performed a new anomaly detection.
Part 1 (01-01-2020 to 30-09-2020): Anomaly performed with the following parameters:
Part 2 (01-10-2020 to 21-04-2021): Anomaly performed with adjusted parameters:
We have grouped the anomalies by sub periods.
Time interval of each sub Period:
Sub-periods of anomalies for the 1st part of the pandemic period
Sub-periods of anomalies for the 2nd part of tweets
Structural Topic Models (STM)
The goal of the STM is to allow researchers to discover topics and estimate their relationship to document metadata.
The following figure presents a heuristic overview of a typical workflow.
First users ingest the data and prepare it for analysis. Next a structural topic model is estimated. The ability to estimate the structural topic model quickly allows for the evaluation, understanding, and visualization of results [@roberts_stm_2019]
We performed an STM model to determine the top 10 topics for:
the part 1 of tweets (January 1, 2020 to September 30, 2020.),
the part 2 of tweets (October 1, 2020 to April 21, 2021),
each of the 11 sub periods.
Part 1 Period 1 (March 13-19, 2020)
Part 1 Period 1 (March 13-19, 2020). Topic 8 & 9 respectively.
Part 2 Period 6 (April 10-16, 2021)
Part 2 Period 6 (April 10-16, 2021). Topic 7 & 6 respectively.
Top 3 topics for each 11 sub-period
Date | Topic |
---|---|
13-03-2020 to 19-03-2020 | covid, develop, amp; peopl, get, can; trial, coronavirus, test |
12-05-2020 to 21-05-2020 | get, peopl, need; covid, coronavirus, research; trump, coronavirus, presid |
13-07-2020 to 13-07-2020 | get, peopl, like; amp, need, work; covid, trial, phase |
08-08-2020 to 14-08-2020 | get, peopl, like; russia, first, coronavirus; covid, coronavirus, dose |
05-09-2020 to 19-09-2020 | trump, say, coronavirus; coronavirus, covid, first; covid, develop, work |
Date | Topic |
---|---|
06-11-2020 to 12-011-2020 | get, peopl, flu; pfizer, covid, effect; news, stock, hope |
23-11-2020 to 29-11-2020 | covid, effect, astrazeneca; take, time, now; covid, develop, india |
29-11-2020 to 17-12-2020 | peopl, get, need; first, covid, dose; pfizer, approv, covid |
21-12-2020 to 04-01-2021 | covid, dose, first; covid, health, state; get, take, like |
26-02-2021 to 05-03-2021 | effect, like, just; covid, johnson, dose; covid, get, can |
10-04-2021 to 16-04-2021 | peopl, just, make; johnson, amp, paus; dose, receiv, million |
Classical approach from computational linguistics is to measure similarity based on the content overlap between documents. For this documents are presented as bag-of-words, so each document will be a sparse vector. And measure of overlap will be defined as angle between vectors:
similarity(doc1,doc2)=cos(θ)=doc1doc2|doc1||doc2|
By cosine distance/dissimilarity we assume following:
distance(doc1,doc2)=1−similarity(doc1,doc2)
It is important to note, however, that this is not a proper distance metric in a mathematical sense as it does not have the triangle inequality property and it violates the coincidence axiom.
We calculated the cosine similarity between the tweets of each 11 sub periods and the New York Times (NYT) articles for the same periods.
First, we collected all the NYT articles using the dates of each sub period (see the following table).
Second, we filtered the collected NYT articles per sub periods to only keep those that contained at least one of the following words:
The following table display the number of articles remaining after the filtering.
During the first 5 periods, the conversations were chronologically about:
During the last 6 periods, the conversations allowed the level of development of epidemiological expertise:
Lessons for research:
Lessons for public policy in particular:
RQ1: Can we identify phases in conversations on COVID-19 on social media?
RQ2: Do the conversations follow traditional media information?
No evidence of high similarity between traditional media and social media on these conversations
Some room for optimism:
Unstructured data from two sources: Twitter and New York Times
NLP: natural language analysis protocol
Nowcasting for public policy communications