Text Mining with R: The case of Tweets

Based on the methodology of the book “Text Mining with R”, you will learn to do text mining on tweets.

Thierry Warin https://warin.ca/aboutme.html (HEC Montréal and CIRANO (Canada))https://www.hec.ca/en/profs/thierry.warin.html

Course objectives

Silge and Robinson (2016) developed the tidytext R package to use tidy data principles to make many text mining tasks easier, more effective, and consistent with tools already in wide use. They wrote a book that serves as an introduction of text mining using the tidytext package and other tidy tools in R. To understand the methodology developed by Silge and Robinson (2016), this course will teach you how to use it to analyse a sample of tweets on AI.

Course plan

1 The tidy text format

Familiarize yourself with the tidy text format. Convert your tweets in tibble, break it into individual tokens with the process of tokenization and then find the most common words of the tweets.

2 Sentiment analysis with tidy data

Evaluate the opinion or emotion of the tweets by using sentiment lexicons. Find the most common positive and negative words and visualize this output in a wordcloud.

3 Analyzing word and document frequency: tf-idf

Quantify what the tweets are about by: (1) measuring the frequency of a word i.e. term frequency (tf), (2) evaluating the relationship between the frequency of a word and its rank i.e. Zipf’s law and (3) decreasing the weight for commonly used words and increasing the weight for words that are not used very much in tweets i.e. the idea of tf-idf.

4 Relationships between words: n-grams and correlations

Tokenize into consecutive sequences of words, called n-grams, to see how often word X is followed by word Y. Count and correlate pairs of words to find words that tend to co-occur within the tweets.

5 Converting to and from non-tidy formats

Tidy a document-term matrix and convert tweets into a matrix.

6 Topic modeling

Use the function LDA() to find the mixture of words that is associated with each topic and determine the mixture of topics that describes the tweets.

7 Case study: comparing Twitter archives

Calculate the word frequencies, compare word usage and changes in word use. Explore which words are more likely to be retweeted and favorited (liked).



For attribution, please cite this work as

Warin (2019, Dec. 6). Thierry Warin: Text Mining with R: The case of Tweets. Retrieved from https://warin.ca/posts/textminingwithr/

BibTeX citation

  author = {Warin, Thierry},
  title = {Thierry Warin: Text Mining with R: The case of Tweets},
  url = {https://warin.ca/posts/textminingwithr/},
  year = {2019}