## Preface

The methodology used in this course is based on the book Text Mining with R by Silge and Robinson (2019). This book serves as an introduction of text mining using the tidytext package and other tidy tools in R.

To provide an example of the use of Silge and Robinson (2019)'s methodology, data were downloaded via Twitter. Three keywords were used to produced three datasets: one containing the tweets with the keyword 'artificial intelligence', another containing the tweets with the keyword 'artificialintelligence' and the last one containing the tweets with the keyword 'AI'.

These are the three datasets of Twitter used for this course. The one called tweetsArtificial_Intelligence.csv includes data with the keyword 'artificial intelligence'. The second one called tweetsArtificialIntelligence.csv includes data with the keyword 'artificialintelligence'. The third one called tweetsAI.csv includes data with the keyword 'AI'.

tweetsArtificial_Intelligence <-read.csv("tweetsArtificial_Intelligence.csv")
tweetsAI <-read.csv("tweetsAI.csv")

To use the Silge and Robinson (2019)'s methodology, a cleaning is needed on the datasets. First, let's bind the three data frames into one using bind_rows(). Then, four columns have to be removed using select() to keep unique tweets because the same tweet can be in the three data frames. As the counts of likes or retweets can change between two downloads, the columns containing this information is deleted.

library(dplyr)

tweets <- bind_rows(tweetsArtificialIntelligence, tweetsArtificial_Intelligence)
tweets <- bind_rows(tweets, tweetsAI)

tweets <- select(tweets, -replies_count, -retweets_count, -likes_count, -place)

tweets <- unique(tweets)

The second part of the cleaning is to remove retweet entities, @people, the punctuation, numbers, html links and pictwitter elements.

# remove retweet entities
tweets$tweet <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", tweets$tweet)

# remove at people
tweets$tweet <- gsub("@\\w+", "", tweets$tweet)

# remove punctuation
tweets$tweet <- gsub("[[:punct:]]", "", tweets$tweet)

# remove numbers
tweets$tweet <- gsub("[[:digit:]]", "", tweets$tweet)

# remove html links
tweets$tweet <- gsub("http\\w+", "", tweets$tweet)

# remove all pictwitter
tweets$tweet <- gsub("pictwitter\\w+ *", "", tweets$tweet)

## 1 The tidy text format

### The unnest_tokens function

Let's only keep the column tweet containing the text and put it in a vector called text as in the book.

text <- tweets$tweet head(text) To analyse the character vector, we first need to turn it into a tidy text dataset. To do so, let's put it into a tibble data frame. As described in the book, a tibble is a modern class of data frame within R, available in the dplyr and tibble packages, that has a convenient print method, will not convert strings to factors, and does not use row names. Tibbles are great for use with tidy tools. text_df <- tibble(line = 1:length(text), text = text) text_df To break the text into individual tokens (one-token-per-row format) and transform it to a tidy data structure, we use the tidytext’s unnest_tokens() function. library(tidytext) tidy_tweets <- text_df %>% unnest_tokens(word, text) tidy_tweets  ### Tidying the tweets of AI Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr. To remove stop words (common words such as “the”, “of”, “to”) let's use the anti_join() function. data(stop_words) tidy_tweets <- tidy_tweets %>% anti_join(stop_words) ### Word frequencies A common task in text mining is to look at word frequencies. We can use dplyr’s count() to find the most common words in all tweets. tidy_tweets %>% count(word, sort = TRUE)  Because of the use of tidy tools, the word counts are stored in a tidy data frame. This allows us to pipe this directly to the ggplot2 package, for example to create a visualization of the most common words. library(ggplot2) tidy_tweets %>% count(word, sort = TRUE) %>% filter(n > 600) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip() + theme_light() ## 2 Sentiment analysis with tidy data The topic of opinion mining or sentiment analysis is addressed in this section. We can use the tools of text mining to approach the emotional content of text programmatically. ### Sentiment analysis with inner join Now that the text is in a tidy format with one word per row, we are ready to do the sentiment analysis. First, let’s use the NRC lexicon and filter() for the joy words. Next, let’s use inner_join() to perform the sentiment analysis. Finally, let’s use count() to count the most common joy words. nrc_joy <- get_sentiments("nrc") %>% filter(sentiment == "joy") tidy_tweets %>% inner_join(nrc_joy) %>% count(word, sort = TRUE) Next, we count up how many positive and negative words there are in the tweets. We define an index here to keep track of where we are in the narrative; this index (using integer division) counts up sections of 100 tweets. As the keywords for downloading the tweets were 'artificialintelligence', 'artificial intelligence' and 'AI', we need to extract them from the dataset to produce the following plot. library(tidyr) tweets_sentiment <- tidy_tweets %>% filter(!word %in% c("artificialintelligence", "intelligence", "artificial", "ai")) %>% inner_join(get_sentiments("bing")) %>% count(index = line %/% 100, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative) library(ggplot2) ggplot(tweets_sentiment, aes(index, sentiment)) + geom_col(show.legend = FALSE) + theme_light() ### Comparing the three sentiment dictionaries With several options for sentiment lexicons, you might want some more information on which one is appropriate for your purposes. Let’s use all three sentiment lexicons and examine how the sentiment changes across the tweets. tidy_tweets Let’s again use integer division (%/%) to define larger sections of text that span multiple lines, and we can use the same pattern with count(), spread(), and mutate() to find the net sentiment in each of these sections of text. afinn <- tidy_tweets %>% filter(!word %in% c("artificialintelligence", "intelligence", "artificial", 'ai')) %>% inner_join(get_sentiments("afinn")) %>% group_by(index = line %/% 100) %>% summarise(sentiment = sum(value)) %>% mutate(method = "AFINN") bing_and_nrc <- bind_rows(tidy_tweets %>% filter(!word %in% c("artificialintelligence", "intelligence", "artificial", "ai")) %>% inner_join(get_sentiments("bing")) %>% mutate(method = "Bing et al."), tidy_tweets %>% filter(!word %in% c("artificialintelligence", "intelligence", "artificial", "ai")) %>% inner_join(get_sentiments("nrc") %>% filter(sentiment %in% c("positive", "negative"))) %>% mutate(method = "NRC")) %>% count(method, index = line %/% 100, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative) We now have an estimate of the net sentiment (positive - negative) in each chunk of the novel text for each sentiment lexicon. Let’s bind them together and visualize them. bind_rows(afinn, bing_and_nrc) %>% ggplot(aes(index, sentiment, fill = method)) + geom_col(show.legend = FALSE) + facet_wrap(~method, ncol = 1, scales = "free_y") + theme_light() ### Most common positive and negative words One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment. bing_word_counts <- tidy_tweets %>% inner_join(get_sentiments("bing")) %>% count(word, sentiment, sort = TRUE) %>% ungroup() bing_word_counts This can be shown visually, and we can pipe straight into ggplot2, if we like, because of the way we are consistently using tools built for handling tidy data frames. bing_word_counts %>% group_by(sentiment) %>% top_n(10) %>% ungroup() %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n, fill = sentiment)) + geom_col(show.legend = FALSE) + facet_wrap(~sentiment, scales = "free_y") + labs(y = "Contribution to sentiment", x = NULL) + coord_flip() + theme_light() The figure above lets us spot an anomaly in the sentiment analysis; the word "cloud" is coded as negative but it is used as a computing term to describe data centers available over the Internet. We can also observe that the world "intelligence" is coded as a positive word, which is correct, but it was one of the keyword used to download the data, so it has to be removed as it can biaise the analysis. For that purpose, we could easily add "cloud" and "intelligence" to a custom stop-words list using bind_rows(). We could implement that with a strategy such as this. custom_stop_words_sentiment <- bind_rows(tibble(word = c("cloud", "intelligence", "artificial", "artificialintelligence", "ai"), lexicon = c("custom")), stop_words) custom_stop_words_sentiment bing_word_counts_custom <- bing_word_counts %>% anti_join(custom_stop_words_sentiment) bing_word_counts_custom bing_word_counts_custom %>% group_by(sentiment) %>% top_n(10) %>% ungroup() %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n, fill = sentiment)) + geom_col(show.legend = FALSE) + facet_wrap(~sentiment, scales = "free_y") + labs(y = "Contribution to sentiment", x = NULL) + coord_flip() + theme_light() ### Wordclouds We’ve seen that this tidy text mining approach works well with ggplot2, but having our data in a tidy format is useful for other plots as well. For example, consider the wordcloud package, which uses base R graphics. Let’s look at the most common words in the tweets as a whole again, but this time as a wordcloud. We need to produce a new custom stop word without extracting the word "cloud" because it is an important word but not for sentiment analysis as it is counted as a negative word. custom_stop_words <- bind_rows(tibble(word = c("intelligence", "artificial", "artificialintelligence", "ai"), lexicon = c("custom")), stop_words) custom_stop_words library(wordcloud) tidy_tweets %>% anti_join(custom_stop_words) %>% count(word) %>% with(wordcloud(word, n, max.words = 150)) In other functions, such as comparison.cloud(), you may need to turn the data frame into a matrix with reshape2’s acast(). Let’s do the sentiment analysis to tag positive and negative words using an inner join, then find the most common positive and negative words. Until the step where we need to send the data to comparison.cloud(), this can all be done with joins, piping, and dplyr because our data is in tidy format. library(reshape2) tidy_tweets %>% anti_join(custom_stop_words_sentiment) %>% inner_join(get_sentiments("bing")) %>% count(word, sentiment, sort = TRUE) %>% acast(word ~ sentiment, value.var = "n", fill = 0) %>% comparison.cloud(colors = c("gray20", "gray80"), max.words = 150) The size of a word’s text is in proportion to its frequency within its sentiment. We can use this visualization to see the most important positive and negative words, but the sizes of the words are not comparable across sentiments. ### Looking at units beyond just words We can use tidy text analysis to found out the most negative tweets of the datasets. First, let’s get the list of negative words from the Bing lexicon. Second, let’s make a data frame of how many words are in each tweet so we can normalize for the length of tweets. Then, let’s find the number of negative words in each tweet and divide by the total words in each tweets. bingnegative <- get_sentiments("bing") %>% filter(sentiment == "negative") wordcounts <- tidy_tweets %>% group_by(line) %>% summarize(words = n()) tidy_tweets %>% semi_join(bingnegative) %>% group_by(line) %>% summarize(negativewords = n()) %>% left_join(wordcounts, by = "line") %>% mutate(ratio = negativewords/words) %>% ungroup() ## 3 Analyzing word and document frequency In this section of the book, the code is used on Jane Austen's novels. Indeed, each calculation is based on the different books. We do not have books, but we do have tweets. To continue this course, we merge the three datasets as before, but we add a column called keyword containing the keyword of each one. Instead of calculating the frequency per book, we are going to calculate the frequency per keyword. tweets <- bind_rows(tweetsArtificial_Intelligence %>% mutate(keyword = "Artificial_Intelligence"), tweetsArtificialIntelligence %>% mutate(keyword = "ArtificialIntelligence"), tweetsAI %>% mutate(keyword = "AI")) tweets # remove retweet entities tweets$tweet <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", tweets$tweet) # remove at people tweets$tweet <- gsub("@\\w+", "", tweets$tweet) # remove punctuation tweets$tweet <- gsub("[[:punct:]]", "", tweets$tweet) # remove numbers tweets$tweet <- gsub("[[:digit:]]", "", tweets$tweet) # remove html links tweets$tweet <- gsub("http\\w+", "", tweets$tweet) # remove all pictwitter tweets$tweet <- gsub("pictwitter\\w+ *", "", tweets$tweet) text <- select(tweets, tweet, keyword) text_df <- as_tibble(text) text_df ### Term frequency in AI tweets Let’s start by examining first term frequency, then tf-idf. We can start just by using dplyr verbs such as group_by() and join(). Let’s calculate the most commonly used words and also the total words in each tweet, for later use library(dplyr) library(tidytext) tweets_words <- text_df %>% unnest_tokens(word, tweet) %>% count(keyword, word, sort = TRUE) total_words <- tweets_words %>% group_by(keyword) %>% summarize(total = sum(n)) tweets_words <- left_join(tweets_words, total_words) tweets_words There is one row in this tweetsAI_words data frame for each word-tweet combination; n is the number of times that word is used in that tweet and total is the total words in that tweet. The usual suspects are here with the highest n, “the”, “and”, “to”, and so forth. In the following figure, let’s look at the distribution of n/total for each tweet, the number of times a word appears in a tweet divided by the total number of terms (words) in that tweet. This is exactly what term frequency is. library(ggplot2) ggplot(tweets_words, aes(n/total, fill = keyword)) + geom_histogram(show.legend = FALSE) + xlim(NA, 0.002) + facet_wrap(~keyword, ncol = 1, scales = "free_y") + theme_light() ### Zipf’s law Since we have the data frame we used to plot term frequency, we can examine Zipf’s law for the AI tweets with just a few lines of dplyr functions. freq_by_rank <- tweets_words %>% group_by(keyword) %>% mutate(rank = row_number(), term frequency = n/total) freq_by_rank The rank column here tells us the rank of each word within the frequency table; the table was already ordered by n so we could use row_number() to find the rank. Then, we can calculate the term frequency in the same way we did before. Let's visualise the Zipf’s law by plotting rank on the x-axis and term frequency on the y-axis, on logarithmic scales. freq_by_rank %>% ggplot(aes(rank, term frequency, color = keyword)) + geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + scale_x_log10() + scale_y_log10() + theme_light() Notice that this figure is in log-log coordinates. We see that all three keywords are quite similar to each other, and that the relationship between rank and frequency does have negative slope. Let’s see what the exponent of the power law is for the top of the rank range. rank_subset <- freq_by_rank %>% filter(rank < 500, rank > 1) lm(log10(term frequency) ~ log10(rank), data = rank_subset) The slope is close to -1 here. freq_by_rank %>% ggplot(aes(rank, term frequency, color = keyword)) + geom_abline(intercept = -0.62, slope = -1.1, color = "gray50", linetype = 2) + geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + scale_x_log10() + scale_y_log10() + theme_light() The deviations we see here at high rank are not uncommon for many kinds of language; a corpus of language often contains fewer rare words than predicted by a single power law. ### The bind_tf_idf function Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. Let’s do that now. The bind_tf_idf function in the tidytext package takes a tidy text dataset as input with one row per token (term), per document. One column (word here) contains the terms/tokens, one column contains the documents (keyword in this case), and the last necessary column contains the counts, how many times each document contains each term (n in this example). We calculated a total for each keyword for our explorations in previous sections, but it is not necessary for the bind_tf_idf function; the table only needs to contain all the words in each document. tweets_words <- tweets_words %>% bind_tf_idf(word, keyword, n) tweets_words Notice that idf and thus tf-idf are zero for these extremely common words. Let’s look at terms with high tf-idf in Jane Austen’s works. tweets_words %>% select(-total) %>% arrange(desc(tf_idf)) tweets_words %>% arrange(desc(tf_idf)) %>% mutate(word = factor(word, levels = rev(unique(word)))) %>% group_by(keyword) %>% top_n(5) %>% ungroup() %>% ggplot(aes(word, tf_idf, fill = keyword)) + geom_col(show.legend = FALSE) + labs(x = NULL, y = "tf-idf") + facet_wrap(~keyword, ncol = 1, scales = "free") + coord_flip() + theme_light() ## 4 Relationships between words ### Tokenizing by n-gram We’ve been using the unnest_tokens function to tokenize by word, or sometimes by sentence, which is useful for the kinds of sentiment and frequency analyses we’ve been doing so far. But we can also use the function to tokenize into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them. We do this by adding the token = "ngrams" option to unnest_tokens(), and setting n to the number of words we wish to capture in each n-gram. When we set n to 2, we are examining pairs of two consecutive words, often called "bigrams": library(dplyr) library(tidytext) tweets_bigrams <- text_df %>% unnest_tokens(bigram, tweet, token = "ngrams", n = 2) tweets_bigrams #### Counting and filtering n-grams Our usual tidy tools apply equally well to n-gram analysis. We can examine the most common bigrams using dplyr’s count(): tweets_bigrams %>% count(bigram, sort = TRUE) As one might expect, a lot of the most common bigrams are pairs of our keywords, such as of artificial and intelligence: what we call our custom "stop-words". This is a useful time to use tidyr’s separate(), which splits a column into multiple based on a delimiter. This lets us separate it into two columns, "word1" and "word2", at which point we can remove cases where either is a custom stop-word. library(tidyr) bigrams_separated <- tweets_bigrams %>% separate(bigram, c("word1", "word2"), sep = " ") bigrams_filtered <- bigrams_separated %>% filter(!word1 %in% custom_stop_words$word) %>%
filter(!word2 %in% custom_stop_words$word) # new bigram counts: bigram_counts <- bigrams_filtered %>% count(word1, word2, sort = TRUE) bigram_counts In other analyses, we may want to work with the recombined words. tidyr’s unite() function is the inverse of separate(), and lets us recombine the columns into one. Thus, "separate/filter/count/unite" let us find the most common bigrams not containing our custom stop-words. bigrams_united <- bigrams_filtered %>% unite(bigram, word1, word2, sep = " ") bigrams_united In other analyses you may be interested in the most common trigrams, which are consecutive sequences of 3 words. We can find this by setting n = 3: text_df %>% unnest_tokens(trigram, tweet, token = "ngrams", n = 3) %>% separate(trigram, c("word1", "word2", "word3"), sep = " ") %>% filter(!word1 %in% custom_stop_words$word,
!word2 %in% custom_stop_words$word, !word3 %in% custom_stop_words$word) %>%
count(word1, word2, word3, sort = TRUE)

#### Analyzing bigrams

This one-bigram-per-row format is helpful for exploratory analyses of the text. As a simple example, we might be interested in the most common keyword "datascience" mentioned in each keyword:

bigrams_filtered %>%
filter(word2 == "datascience") %>%
count(keyword, word1, sort = TRUE)

A bigram can also be treated as a term in a document in the same way that we treated individual words. For example, we can look at the tf-idf of bigrams across the three keywords. These tf-idf values can be visualized within each keyword, just as we did for words.

bigram_tf_idf <- bigrams_united %>%
count(keyword, bigram) %>%
bind_tf_idf(bigram, keyword, n) %>%
arrange(desc(tf_idf))

bigram_tf_idf
bigram_tf_idf %>%
arrange(desc(tf_idf)) %>%
mutate(bigram = factor(bigram, levels = rev(unique(bigram)))) %>%
group_by(keyword) %>%
top_n(5) %>%
ungroup() %>%
ggplot(aes(bigram, tf_idf, fill = keyword)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf of bigram to keyword") +
facet_wrap(~keyword, ncol = 2, scales = "free") +
coord_flip() +
theme_light()

#### Using bigrams to provide context in sentiment analysis

Now that we have the data organized into bigrams, it’s easy to tell how often words are preceded by a word like "not":

bigrams_separated %>%
filter(word1 == "not") %>%
count(word1, word2, sort = TRUE)

By performing sentiment analysis on the bigram data, we can examine how often sentiment-associated words are preceded by "not" or other negating words. We could use this to ignore or even reverse their contribution to the sentiment score.

Let’s use the AFINN lexicon for sentiment analysis, which you may recall gives a numeric sentiment value for each word, with positive or negative numbers indicating the direction of the sentiment.

AFINN <- get_sentiments("afinn")

AFINN

We can then examine the most frequent words that were preceded by "not" and were associated with a sentiment.

not_words <- bigrams_separated %>%
filter(word1 == "not") %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word2, value, sort = TRUE)

not_words

For example, the most common sentiment-associated word to follow "not" was "bad", which have a negative score of 3.

It’s worth asking which words contributed the most in the "wrong" direction. To compute that, we can multiply their value by the number of times they appear (so that a word with a value of +3 occurring 10 times has as much impact as a word with a sentiment value of +1 occurring 30 times). We visualize the result with a bar plot.

library(ggplot2)

not_words %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(word2, n * value, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
xlab("Words preceded by \"not\"") +
ylab("Sentiment value * number of occurrences") +
coord_flip() +
theme_light()

The bigrams "not bad", "not hate" and "not kill" were overwhelmingly the largest causes of misidentification, making the text seem much more negative than it is. But we can see phrases like “not win” and “not popular” sometimes suggest text is more positive than it is.

"Not" isn’t the only term that provides some context for the following word. We could pick four common words (or more) that negate the subsequent term, and use the same joining and counting approach to examine all of them at once.

negation_words <- c("not", "no", "without")

negated_words <- bigrams_separated %>%
filter(word1 %in% negation_words) %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word1, word2, value, sort = TRUE)

We could then visualize what the most common words to follow each particular negation are.

negated_words %>%
mutate(contribution = n * value) %>%
arrange(desc(contribution)) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(word2, contribution, fill = contribution > 0)) +
geom_col(show.legend = FALSE) +
xlab("Words preceded by \"not\"") +
ylab("Sentiment value * number of occurrences") +
coord_flip() +
facet_wrap(~word1, ncol = 2, scales = "free") +
theme_light()

While "not bad", "not hate" and "not kill" are still the two most common examples, we can also see pairings such as "no doubt" and "without bias".

#### Visualizing a network of bigrams with ggraph

We may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few at a time. As one common visualization, we can arrange the words into a network, or "graph".

The igraph package has many powerful functions for manipulating and analyzing networks. One way to create an igraph object from tidy data is the graph_from_data_frame() function, which takes a data frame of edges with columns for "from", "to", and edge attributes (in this case n):

library(igraph)

# original counts
bigram_counts

# filter for only relatively common combinations
bigram_graph <- bigram_counts %>%
filter(n > 20) %>%
graph_from_data_frame()

bigram_graph

We can convert an igraph object into a ggraph with the ggraph function, after which we add layers to it, much like layers are added in ggplot2. For example, for a basic graph we need to add three layers: nodes, edges, and text.

library(ggraph)
set.seed(2017)

ggraph(bigram_graph, layout = "fr") +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)

We can add a few polishing options to make a better looking graph.

set.seed(2016)

a <- grid::arrow(type = "closed", length = unit(.05, "inches"))

ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.05, 'inches')) +
geom_node_point(color = "lightblue", size = 2) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()

### Counting and correlating pairs of words with the widyr package

We’ll examine some of the ways tidy text can be turned into a wide matrix, but in this case it isn’t necessary.

#### Counting and correlating among sections

Consider the keyword "Artificial_Intelligence" divided into 10-line sections, as we did (with larger sections) for sentiment analysis in Chapter 2. We may be interested in what words tend to appear within the same section.

tweets_keyword_words <- text_df %>%
filter(keyword == "ArtificialIntelligence") %>%
mutate(section = row_number() %/% 10) %>%
filter(section > 0) %>%
unnest_tokens(word, tweet) %>%
!word %in% str_remove_all(custom_stop_words$word, "'"), str_detect(word, "[a-z]"))  Now we can calculate word frequencies for each keyword. First, we group by keyword and count how many times each word is used in each keyword. Then we use left_join() to add a column of the total number of words used in each keyword. (This is higher for the keyword 'ArtificialIntelligence' since its the most used keyword.) Finally, we calculate a frequency for each keyword and word. frequency <- tidy_tweets %>% group_by(keyword) %>% count(word, sort = TRUE) %>% left_join(tidy_tweets %>% group_by(keyword) %>% summarise(total = n())) %>% mutate(freq = n/total) frequency This is a nice and tidy data frame but we would actually like to plot those frequencies on the x- and y-axes of a plot, so we will need to use spread() from tidyr make a differently shaped data frame. library(tidyr) frequency <- frequency %>% select(keyword, word, freq) %>% spread(keyword, freq) %>% arrange(Artificial_Intelligence, ArtificialIntelligence, AI) frequency Now this is ready for us to plot. Let’s use geom_jitter() so that we don’t see the discreteness at the low end of frequency as much, and check_overlap = TRUE so the text labels don’t all print out on top of each other (only some will print). library(scales) ggplot(frequency, aes(Artificial_Intelligence, ArtificialIntelligence)) + geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) + geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) + scale_x_log10(labels = percent_format()) + scale_y_log10(labels = percent_format()) + geom_abline(color = "red") ### Comparing word usage We just made a plot comparing raw word frequencies over our tweets on AI; now let’s find which words are more or less likely to come from each keyword using the log odds ratio. First, let’s restrict the analysis moving forward to tweets from only 24 hours. tidy_tweets <- tidy_tweets %>% filter(date == "2019-11-19") tidy_tweets$date <- as.Date(tidy_tweets$date, format = "%Y-%m-%d") Next, let’s use str_detect() to remove Twitter usernames from the word column. After removing these, we count how many times each person uses each word and keep only the words used more than 10 times. After a spread() operation, we can calculate the log odds ratio for each word. word_ratios <- tidy_tweets %>% filter(!str_detect(word, "^@")) %>% count(word, keyword) %>% group_by(word) %>% filter(sum(n) >= 10) %>% ungroup() %>% spread(keyword, n, fill = 0) %>% mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>% mutate(logratio = log(Artificial_Intelligence / ArtificialIntelligence)) %>% arrange(desc(logratio)) What are some words that have been about equally likely to come from each keywords during the period selected? word_ratios %>% arrange(abs(logratio)) word_ratios %>% group_by(logratio < 0) %>% top_n(10, abs(logratio)) %>% ungroup() %>% mutate(word = reorder(word, logratio)) %>% ggplot(aes(word, logratio, fill = logratio < 0)) + geom_col(show.legend = FALSE) + coord_flip() + ylab("log odds ratio (Artificial_Intelligence/ArtificialIntelligence)") + scale_fill_discrete(name = "", labels = c("Artificial_Intelligence", "ArtificialIntelligence")) + theme_light() So the tweets with the keyword 'artificial intelligence' is used when people talk about text analytics, datavisualization, analyses, while the tweets with the keyword 'artificialintelligence' are used when people talk about iot, artificial intelligence applications (such as machine learning, deep learning, neural networks) and jobs. ### Changes in word use Which words have been tweeted at a higher or lower rate over time? To do this, we will define a new time variable in the data frame that defines which unit of time each tweet was posted in. We can use floor_date() from lubridate to do this, with a unit of our choosing; using day seems to work well. After we have the time bins defined, we count how many times each word in each time bin was used. After that, we add columns to the data frame for the total number of words used in each time bin and the total number of times each word was used. We can then filter() to only keep words used at least some minimum number of times (30, in this case). library(lubridate) words_by_time <- tidy_tweets %>% unite("time", hour, minute, second, sep = ":") %>% unite("timestamp", date, time, sep = " ") words_by_time$timestamp <- ymd_hms(words_by_time\$timestamp)

words_by_time <- words_by_time %>%
filter(!str_detect(word, "^@")) %>%
mutate(time_floor = floor_date(timestamp, unit = "day")) %>%
count(time_floor, keyword, word) %>%
group_by(keyword, time_floor) %>%
mutate(time_total = sum(n)) %>%
group_by(keyword, word) %>%
mutate(word_total = sum(n)) %>%
ungroup() %>%
rename(count = n) %>%
filter(word_total > 30)

words_by_time

Each row in this data frame corresponds to one keyword containing one word in a given time bin. The count column tells us how many times that keyword contains that word in that time bin, the time_total column tells us how many words that keyword contains during that time bin, and the word_total column tells us how many times that keyword contains that word over the whole day. This is the data set we can use for modeling.

We can use nest() from tidyr to make a data frame with a list column that contains little miniature data frames for each word. Let’s do that now and take a look at the resulting structure.

nested_data <- words_by_time %>%
nest(-word, -keyword)

nested_data

This data frame has one row for each keyword-word combination; the data column is a list column that contains data frames, one for each combination of keyword and word. Let’s use map() from purrr (Henry and Wickham, 2018) to apply our modeling procedure to each of those little data frames inside our big data frame. This is count data so let’s use glm() with family = "binomial" for modeling.

library(purrr)

nested_models <- nested_data %>%
mutate(models = map(data, ~ glm(cbind(count, time_total) ~ time_floor, .,
family = "binomial")))

nested_models

Now notice that we have a new column for the modeling results; it is another list column and contains glm objects. The next step is to use map() and tidy() from the broom package to pull out the slopes for each of these models and find the important ones. We are comparing many slopes here and some of them are not statistically significant, so let’s apply an adjustment to the p-values for multiple comparisons.

library(broom)

slopes <- nested_models %>%
mutate(models = map(models, tidy)) %>%
unnest(cols = c(models)) %>%
mutate(adjusted.p.value = p.adjust(p.value))

Now let’s find the most important slopes. Which words have changed in frequency at a moderately significant level in our tweets?

top_slopes <- slopes %>%

top_slopes

To visualize our results, we can plot these words’ use for the keyword 'ArtificialIntelligence'.

words_by_time %>%
inner_join(top_slopes, by = c("word", "keyword")) %>%
filter(keyword == "ArtificialIntelligence") %>%
mutate(count_time = count/time_total) %>%
top_n(10, count_time) %>%
ggplot(aes(time_floor, count_time, color = word)) +
geom_point(size = 3) +
labs(x = NULL, y = "Word frequency") +
theme_light()

### Favorites and retweets

To start with, let’s look at the number of times each of our tweets was retweeted. Let’s find the total number of retweets for each keyword.

totals <- tidy_tweets %>%
group_by(keyword, id) %>%
summarise(rts = first(retweets_count)) %>%
group_by(keyword) %>%
summarise(total_rts = sum(rts))

totals

Now let’s find the median number of retweets for each word and keyword. We probably want to count each tweet/word combination only once, so we will use group_by() and summarise() twice, one right after the other. The first summarise() statement counts how many times each word was retweeted, for each tweet and keyword. In the second summarise() statement, we can find the median retweets for each keyword and word, also count the number of times each word was used for each keyword and keep that in uses. Next, we can join this to the data frame of retweet totals. Let’s filter() to only keep words mentioned at least 5 times.

word_by_rts <- tidy_tweets %>%
group_by(id, word, keyword) %>%
summarise(rts = first(retweets_count)) %>%
group_by(keyword, word) %>%
summarise(retweets = median(rts), uses = n()) %>%
left_join(totals) %>%
filter(retweets != 0) %>%
ungroup()

word_by_rts %>%
filter(uses >= 5) %>%
arrange(desc(retweets))

At the top of this sorted data frame, we see that the biggest retweet concerns the word secretive for the keyword 'artificial intelligence'. Let’s plot the words that have the highest median retweets for each of keyword.

word_by_rts %>%
filter(uses >= 5) %>%
group_by(keyword) %>%
top_n(5, retweets) %>%
arrange(retweets) %>%
ungroup() %>%
mutate(word = factor(word, unique(word))) %>%
ungroup() %>%
ggplot(aes(word, retweets, fill = keyword)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ keyword, scales = "free", ncol = 2) +
coord_flip() +
labs(x = NULL,
y = "Median # of retweets for tweets containing each word") +
theme_light()

We can follow a similar procedure to see which words led to more favorites. Are they different than the words that lead to more retweets?

totals <- tidy_tweets %>%
group_by(keyword, id) %>%
summarise(favs = first(likes_count)) %>%
group_by(keyword) %>%
summarise(total_favs = sum(favs))

word_by_favs <- tidy_tweets %>%
group_by(id, word, keyword) %>%
summarise(favs = first(likes_count)) %>%
group_by(keyword, word) %>%
summarise(favorites = median(favs), uses = n()) %>%
left_join(totals) %>%
filter(favorites != 0) %>%
ungroup()

We have built the data frames we need. Now let’s make our visualization in the following figure.

word_by_favs %>%
filter(uses >= 5) %>%
group_by(keyword) %>%
top_n(5, favorites) %>%
arrange(favorites) %>%
ungroup() %>%
mutate(word = factor(word, unique(word))) %>%
ungroup() %>%
ggplot(aes(word, favorites, fill = keyword)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ keyword, scales = "free", ncol = 2) +
coord_flip() +
labs(x = NULL,
y = "Median # of likes for tweets containing each word") +
theme_light()

## References

Silge, Julia, and David Robinson. 2019. Text Mining with R. https://www.tidytextmining.com/

## Acknowledgments

To cite this course:

Warin, Thierry. 2020. “SKEMA Quantum Studio: R Courses.” doi:10.6084/m9.figshare.11744013.v1.