Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Twitter Analysis of Vaccination in a Post COVID-19 Era


Nathalie de Marcellis-Warin, PhD (Polytechnique Montréal, CIRANO & OBVIA)
Thierry Warin, PhD (HEC Montréal, CIRANO & OBVIA)

1 / 24

Agenda

  1. Introduction

  2. Data Collection

  3. Methodology

  4. Results and lessons learned

  5. Conclusion

2 / 24

1. Introduction

3 / 24

1. Introduction

Twitter Analysis of Vaccination in a Post COVID-19 Era

  • Our purpose is above all methodological and relates to the feasibility and validity of the analysis of conversations in a context of great importance for humanity, that of the Covid-19 disease.

  • A Data Science-based use case

4 / 24

1. Introduction

RQ1: Can we identify phases in conversations on COVID-19 on social media?

RQ2: Do the conversations follow traditional media information?

  • "communications" / "conversations"

  • vertical / horizontal communications

  • Unstructured real-time data: collecting live tweets + New York times articles

  • Natural Language Processing for text analysis

5 / 24

1. Introduction

  • We collected more than 2 million tweets and performed pattern detection

  • We collected articles from the New York times corresponding to the same patterns (cosine similarity)

6 / 24

1. Introduction

  • Structured and unstructured data

    • Textual data are ubiquitous in social science research: traditional media, social media, survey data, and many other sources contribute to the enormous amount of text in the modern information age.
  • Structural Topic Model (STM)

    • The increasing availability and interest in textual data has led to the development of a variety of statistical approaches to analyze these data.
7 / 24

1. Introduction

  • With STM, users can model framing of international newspapers (Roberts, Stewart, and Airoldi 2016b), open-ended survey responses in the American National Election Study (Robertset al., 2014), online classroom forums (Reich, Tingley, Leder -Luis, Roberts, and Stewart 2015), religious statements (Lucas, Nielsen, Roberts, Stewart, Storer, and Tingley 2015), and lobbying reports (Milner and Tingley 2015), and so on.

  • The purpose is to allow researchers to discover topics and estimate their relationship to document metadata.

  • The results of the model can be used to perform hypothesis testing on these relationships.

  • This, of course, mirrors the type of analysis that social scientists perform with other types of data, where the goal is to discover relationships between variables and test hypotheses.

8 / 24

2. Data collection

9 / 24

Temporal Distribution of Tweets

  • Period: January 01, 2018 to April 21, 2021

  • Number of tweets 2,601,702

  • Keywords: “vaccine” OR “vaccines” OR “vaccinate” OR “vaccination” OR “vaccineswork” OR “antivax” OR “vaccinesdontwork” OR “provax” OR “vaxwithme” OR “antivaxxers” OR “immunization”

  • Period: January 01, 2020 to April 21, 2021

  • Number of tweets 2,409,522

  • Period: January 01, 2020 to September 30, 2020

  • Number of tweets: 374,084

  • Proportion: 15,5% of total number of tweets

  • Period: October 01, 2020 to April 21, 2021

  • Number of tweets: 2,035,438

  • Proportion: 84,5% of total number of tweets

10 / 24

3 Methodology

11 / 24

3.1 Overall Time Series

Event study

  • We used an Event study approach.

  • In order to perform this type of analysis, financial specialists use the return, volume or volatility of stocks. It is about quantifying the economic impact of an event in so-called abnormal returns. For our study, we chose to use the number of tweets per day to detect anomalies.

  • We performed an anomaly detection on all tweets from the beginning of the pandemic (January 01, 2020 to April 21, 2021), which corresponds to 2,409,522 tweets.

  • We noticed an acceleration of the conversation on Twitter about vaccination starting on October 1, 2020. So we decided to split the tweets into two parts:

    • Part 1: January 01, 2020 to September 30, 2020
    • Number of tweets: 374,084
    • Part 2: October 01, 2020 to April 21, 2021
    • Number of tweets: 2,035,438

For each part we performed a new anomaly detection.

  • Part 1 (01-01-2020 to 30-09-2020): Anomaly performed with the following parameters:

    • alpha = 0.05 or 5% of outlier data
    • max_anoms = 0.2 or 20% of anomalies allowed.
  • Part 2 (01-10-2020 to 21-04-2021): Anomaly performed with adjusted parameters:

    • alpha = 0.3
    • max_anoms = 0.05

  • We have grouped the anomalies by sub periods.

    • Part 1: We have identified 5 sub periods
    • Part 2: We have observed 6 sub periods.
  • Time interval of each sub Period:

    • Start of the intverval: date 1st anomaly - 3 days (period_start)
    • End of the interval: last anomaly date + 3 days (period_end)

Sub-periods of anomalies for the 1st part of the pandemic period

Sub-periods of anomalies for the 2nd part of tweets

12 / 24

3.2 Thematic Model

Structural Topic Models (STM)

  • The goal of the STM is to allow researchers to discover topics and estimate their relationship to document metadata.

  • The following figure presents a heuristic overview of a typical workflow.

  • First users ingest the data and prepare it for analysis. Next a structural topic model is estimated. The ability to estimate the structural topic model quickly allows for the evaluation, understanding, and visualization of results [@roberts_stm_2019]

boxes_and_circles INGEST Reading and processing text data INGEST Reading and processing text data PREPARE Associating text with metadata PREPARE Associating text with metadata INGEST Reading and processing text data->PREPARE Associating text with metadata ESTIMATE Estimating the structural topic model ESTIMATE Estimating the structural topic model PREPARE Associating text with metadata->ESTIMATE Estimating the structural topic model EVALUATE Model selection and search EVALUATE Model selection and search ESTIMATE Estimating the structural topic model->EVALUATE Model selection and search UNDERSTAND Interpreting the STM by plotting and inspecting results UNDERSTAND Interpreting the STM by plotting and inspecting results ESTIMATE Estimating the structural topic model->UNDERSTAND Interpreting the STM by plotting and inspecting results VISUALIZE Presenting STM results VISUALIZE Presenting STM results ESTIMATE Estimating the structural topic model->VISUALIZE Presenting STM results EXTENSIONS Additional tools for interpretation and visualization EXTENSIONS Additional tools for interpretation and visualization

We performed an STM model to determine the top 10 topics for:

  • the part 1 of tweets (January 1, 2020 to September 30, 2020.),

  • the part 2 of tweets (October 1, 2020 to April 21, 2021),

  • each of the 11 sub periods.

13 / 24

3.2 Thematic Model

Period: 2018-01-01 to 2019-11-30

Period: 2019-12-01 to 2021-04-21

14 / 24

3.2 Thematic Model

Part 1 Period 1 (March 13-19, 2020)

Part 1 Period 1 (March 13-19, 2020). Topic 8 & 9 respectively.

Part 2 Period 6 (April 10-16, 2021)

Part 2 Period 6 (April 10-16, 2021). Topic 7 & 6 respectively.

15 / 24

3.2 Thematic Model

Top 3 topics for each 11 sub-period

Date Topic
13-03-2020 to 19-03-2020 covid, develop, amp; peopl, get, can; trial, coronavirus, test
12-05-2020 to 21-05-2020 get, peopl, need; covid, coronavirus, research; trump, coronavirus, presid
13-07-2020 to 13-07-2020 get, peopl, like; amp, need, work; covid, trial, phase
08-08-2020 to 14-08-2020 get, peopl, like; russia, first, coronavirus; covid, coronavirus, dose
05-09-2020 to 19-09-2020 trump, say, coronavirus; coronavirus, covid, first; covid, develop, work
Date Topic
06-11-2020 to 12-011-2020 get, peopl, flu; pfizer, covid, effect; news, stock, hope
23-11-2020 to 29-11-2020 covid, effect, astrazeneca; take, time, now; covid, develop, india
29-11-2020 to 17-12-2020 peopl, get, need; first, covid, dose; pfizer, approv, covid
21-12-2020 to 04-01-2021 covid, dose, first; covid, health, state; get, take, like
26-02-2021 to 05-03-2021 effect, like, just; covid, johnson, dose; covid, get, can
10-04-2021 to 16-04-2021 peopl, just, make; johnson, amp, paus; dose, receiv, million
16 / 24

3.3 Cosine similarity

  • Document similarity (or distance between documents) is one of the central themes in Information Retrieval. How humans usually define how similar are documents? Usually documents treated as similar if they are semantically close and describe similar concepts. [@selivanov_dselivanovtext2vec_2021]

Classical approach from computational linguistics is to measure similarity based on the content overlap between documents. For this documents are presented as bag-of-words, so each document will be a sparse vector. And measure of overlap will be defined as angle between vectors:

similarity(doc1,doc2)=cos(θ)=doc1doc2|doc1||doc2|

By cosine distance/dissimilarity we assume following:

distance(doc1,doc2)=1similarity(doc1,doc2)

It is important to note, however, that this is not a proper distance metric in a mathematical sense as it does not have the triangle inequality property and it violates the coincidence axiom.

We calculated the cosine similarity between the tweets of each 11 sub periods and the New York Times (NYT) articles for the same periods.

First, we collected all the NYT articles using the dates of each sub period (see the following table).

Second, we filtered the collected NYT articles per sub periods to only keep those that contained at least one of the following words:

  • “vaccine” OR “vaccines” OR “vaccinate” OR “vaccination” OR “vaccineswork” OR “antivax” OR “vaccinesdontwork” OR “provax” OR “vaxwithme” OR “antivaxxers” OR “immunization”.

The following table display the number of articles remaining after the filtering.

17 / 24

4. Results and Lessons

18 / 24

Results and Lessons

  • During the first 5 periods, the conversations were chronologically about:

    • the fears: are we going to have a vaccine?
    • then on the debate on vaccines
    • then on the economy, the relationship between vaccines and the return to work
    • then on the invention of the vaccine and its political dimension
    • finally on the economy again as a reason to be vaccinated
  • During the last 6 periods, the conversations allowed the level of development of epidemiological expertise:

    • the hope dimension: Pfizer
    • then AstraZeneca and a developing country dimension
    • then Pfizer to be prioritized over AstraZeneca
    • then on the importance of getting vaccinated anyway
    • then a conversation about vaccines again when the vaccination campaigns started
    • the question of which vaccine is better.
19 / 24

Results and Lessons

  1. Lessons for research:

    • proof of concept on the quantitative measurement of conversation dynamics on a topic like SARS-CoV-II
    • reflections on how to use these conversation dynamics on social networks as evidence
    • multidisciplinary: linguistics, communications, economics, public policy
  2. Lessons for public policy in particular:

    • reflections on the use of these social network data transformation methods for the relevance of public policy communications
    • national and temporal measures: possibility to build a national barometer to know the dynamics of conversations on a particular public policy topic.
20 / 24

5. Conclusion

21 / 24

Conclusion

RQ1: Can we identify phases in conversations on COVID-19 on social media?

RQ2: Do the conversations follow traditional media information?

  • No evidence of high similarity between traditional media and social media on these conversations

  • Some room for optimism:

    • the horizontal conversations have become more educated about the pandemic, though political debates could not be avoided at some point => COVID-19 fatigue?
22 / 24

Conclusion

  • Unstructured data from two sources: Twitter and New York Times

  • NLP: natural language analysis protocol

  • Nowcasting for public policy communications

23 / 24

Thanks!

24 / 24

Agenda

  1. Introduction

  2. Data Collection

  3. Methodology

  4. Results and lessons learned

  5. Conclusion

2 / 24
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Alt + fFit Slides to Screen
sToggle scribble toolbox
Esc Back to slideshow