9 Structuring Unstructured Text Data in R

Unstructured text data refers to textual information that lacks a predefined format or data schema, making it challenging to process with traditional structured-data tools (Doan et al., 2009). Examples of unstructured data include free-form text like emails, social media posts, news articles, and academic papers, in contrast to tabular data in databases. Such unstructured text constitutes the majority of information available – estimates suggest it may account for as much as 80–95% of all data (Gandomi & Haider, 2015). However, because it does not adhere to rigid schemas, unstructured text is difficult to store, process, and integrate with conventional data processing systems. Extracting meaningful insights from raw text requires converting it into a structured form (e.g. numerical features or categorical labels) through the techniques of natural language processing (NLP) and text mining. This chapter provides a detailed overview of structuring unstructured text data in R, covering theoretical foundations of text preprocessing and practical R-based workflows for analysis. We will discuss fundamental preprocessing steps (tokenization, normalization, stemming, lemmatization, stopword removal), methods to represent text as structured features (document-term matrices and TF–IDF weighting), the tidytext framework for tidy NLP, and techniques for higher-level analysis such as named entity recognition, topic modeling, sentiment analysis, and text classification. In addition, we explore integrating advanced language models like BERT into R analyses using packages that interface with state-of-the-art transformers. Real-world examples and datasets are highlighted throughout, illustrating how these methods are applied in practice. The goal is to equip readers with both the conceptual understanding and practical tools to transform raw text into structured data amenable to statistical analysis and machine learning, all within the R ecosystem.

9.1 Unstructured Text Data and Its Challenges

Textual data is inherently unstructured – it consists of sequences of characters (words, sentences, paragraphs) without explicit numeric indices or categorical keys. Unlike a structured table where each column has a fixed meaning, unstructured text requires interpretation to extract entities, categories, or other features of interest. Unstructured text data is considered textual information that does not have a predefined data format or schema (Doan et al., 2009). For example, a collection of customer reviews is just a set of sentences or documents with varying lengths and vocabulary, lacking the labeled fields one would find in a structured survey dataset. The lack of inherent structure in text means that raw text must be processed and structured before it can be analyzed quantitatively. This presents several key challenges:

Volume and Variety: Text data volumes are often large and growing rapidly (e.g. social media streams, digitized libraries), and the content varies widely in vocabulary and style. Traditional SQL databases struggle with storing and querying such free-form data. Efficient techniques are required to handle large corpora and high-dimensional representations.
Noise and Irregularities: Real-world text contains noise such as typos, misspellings, inconsistent abbreviations, and extraneous content (HTML tags, markup, etc.). Data quality issues like these can slow down analyses or lead to incorrect conclusions. Preparing text for analysis requires careful cleaning and standardization.
Complex Structure of Language: Human language is complex – the meaning of text depends on context, syntax, and semantics. Simply splitting text on whitespace is not sufficient to capture meaningful units, especially for languages without word boundaries (like Chinese) or text with ambiguous tokenization (e.g. “New York” as one entity or two words). Moreover, many linguistic features (names, dates, negations, etc.) need special handling.
High Dimensionality: When text is converted into features (for example, via a bag-of-words model), the dimensionality can be extremely high (thousands of unique terms). A corpus of even a few hundred documents may produce a sparse matrix with tens of thousands of columns (one for each word type). Such high-dimensional, sparse data can be computationally intensive to store and process, requiring efficient sparse matrix operations and possibly feature selection or dimensionality reduction.
Lack of Direct Interpretability: Unlike structured data where each feature might have obvious meaning, text-derived features (such as the principal components of a document-term matrix, or topic proportions from a topic model) can be abstract. Interpreting and validating these requires care and often domain knowledge.

Despite these challenges, unstructured text contains rich information. Techniques from natural language processing allow us to structure this text data, extracting features and patterns that can be analyzed statistically or with machine learning. For example, we might turn a collection of documents into a matrix of term frequencies, perform clustering or topic modeling to discover themes, or classify documents by sentiment or topic using supervised learning. The remainder of this chapter details how to carry out these steps in R, a statistical programming environment well-suited for data manipulation and analysis. We start with text preprocessing, the crucial first stage in structuring text.

9.2 Preprocessing Unstructured Text Data

Preprocessing is the process of transforming raw text into a cleaner, more structured form before feature extraction. In practice, this involves multiple sub-tasks: tokenization (splitting text into units such as words), normalization (standardizing text, e.g. lowercasing, removing punctuation), stopword removal (eliminating common words with little semantic content), and optionally stemming or lemmatization (reducing words to their base forms). The goal is to reduce noise and variability while preserving the information content of the text. By applying these steps, we make downstream analysis more effective – for instance, matching words in different cases, or treating “connect”, “connected”, “connecting” as instances of the same root concept via stemming.

Preprocessing can significantly impact the quality of text mining results. According to Forman and Kirshenbaum (2008), poor text cleaning can lead to low-quality data and degraded accuracy in mining results. Thus, a careful, iterative approach to text cleaning is recommended. In R, many of these tasks can be accomplished with existing packages and functions, which we will highlight in the following subsections.

Tokenization: Splitting Text into Meaningful Units

Tokenization is the process of splitting raw text into smaller units called tokens. Typically, tokens are words, but they could also be characters, sentences, or other units depending on the analysis goal. Tokenization is a fundamental first step in almost any NLP task, as it converts a stream of characters into discrete elements that algorithms can work with. Formally, “in tokenization, we take an input string and a token type (a meaningful unit of text, such as a word) and split the input into pieces (tokens) that correspond to that type” (Manning, Raghavan, & Schütze, 2008). For example, tokenizing the sentence “R is great for data analysis.” into word tokens would yield the sequence [“R”, “is”, “great”, “for”, “data”, “analysis”]. Each token can then be treated as a feature or looked up in a vocabulary.

Tokenization may seem straightforward (often using whitespace and punctuation as delimiters), but it has many pitfalls and language-specific nuances. Consider English contractions (“don’t” → “do”, “n’t” or “do”, “not”), hyphenated words, or entities like dates and URLs – a naive tokenizer could improperly split or retain unwanted characters. Furthermore, languages like Chinese or Japanese do not use spaces between words, requiring more advanced methods (e.g. dictionary-based or machine learning segmentation) to tokenize. Even in English, ambiguous cases arise: for instance, “New York-based” might be considered one token (a single entity) or three tokens (“New”, “York”, “based”) depending on context. Robust tokenization tools account for such issues.

In R, several packages provide tokenization functionality:

Base R / Simple approaches: One can use strsplit() with a regex to split on non-alphanumeric characters as a quick approach. For example, using strsplit(text, "\\W+") will break a string into tokens on any non-word character (punctuation, spaces). This simple method works in many cases but may incorrectly split certain tokens or remove information (it will discard punctuation entirely, which could be an issue if punctuation carries meaning, e.g., “!” indicating emotion).
tokenizers package: The tokenizers package (Mullen et al., 2018) provides a suite of tokenization functions in R for words, sentences, paragraphs, n-grams, etc. It handles details like preserving URLs or hashtags if needed. For example, tokenizers::tokenize_words("Don't tokenize me wrong!") will yield [“don’t”, “tokenize”, “me”, “wrong”] by default, handling the contraction and punctuation. Using a dedicated tokenizer is preferable to manual regex, as it incorporates language-specific rules and has been tested on various edge cases.
tidytext approach: The tidytext package (Silge & Robinson, 2016) offers unnest_tokens(), which can tokenize text within a dataframe column as part of a tidy data workflow. This function can tokenize by words, sentences, or even by regex patterns, and it automatically converts text to lowercase by default (this can be turned off). For example, given a dataframe of lines of text, unnest_tokens(word, text_column) will produce a new table with one token per row. The tidytext tokenizer also has options to handle input formats like HTML or XML by removing markup (using the hunspell tokenizer internally).
quanteda package: The quanteda package (Benoit et al., 2018) contains a fast C++ tokenizer accessible via quanteda::tokens(). It supports advanced features like removing or keeping punctuation, lowercasing, stemming, and n-gram generation in one step as part of the tokenization process. For instance, tokens(c("Example sentence.", "Another one?"), remove_punct = TRUE) will tokenize both sentences and drop punctuation.

SpaCy via spacyr: For more linguistically sophisticated tokenization (including multilingual support), R can interface with the spaCy library (Honnibal et al., 2020) through the spacyr package. SpaCy’s tokenizer is built on extensive rules and machine learning to handle a variety of languages and tricky cases. Using spacy_parse() from spacyr will tokenize text and also return lemmas, part-of-speech, and entities (if enabled) in a single pass. For example:

library(spacyr)
spacy_initialize(model = "en_core_web_sm")  # load English model
txt <- "Mr. Smith spent two years in North Carolina."
parsed <- spacy_parse(txt, tokenize = TRUE, tag = FALSE, entity = TRUE)
head(parsed)
##    doc_id sentence_id token_id    token lemma   pos   entity
## 1      1           1        1      Mr.    mr.  PROPN        
## 2      1           1        2    Smith  smith  PROPN PERSON_B
## 3      1           1        3    spent  spend  VERB        
## 4      1           1        4      two    two   NUM   DATE_B
## 5      1           1        5    years   year  NOUN   DATE_I
## 6      1           1        6       in     in   ADP        
## 7      1           1        7    North  North PROPN   GPE_B
## 8      1           1        8 Carolina Carolina PROPN   GPE_I
## 9      1           1        9        .      . PUNCT

Here, spaCy correctly kept “Mr.” as one token (not splitting at the period) and combined “North Carolina” into a single named entity for location (GPE = Geo-Political Entity). This illustrates how advanced tokenizers can handle edge cases and simultaneously perform other NLP tasks.

Tokenization lays the foundation for structuring text. By segmenting text into consistent units, we enable further analysis like counting word frequencies, computing document-term matrices, or identifying more complex structures (phrases, entities). It is important to choose a tokenization strategy appropriate for the data and task. For instance, when analyzing tweets, one may want to treat “#DataScience” or “😊” as tokens (hashtags and emoji conveying meaning), whereas a simple whitespace tokenizer would not. R’s text processing ecosystem provides the flexibility to customize tokenization as needed.

Normalization: Cleaning and Standardizing Text

Once text is tokenized, a series of normalization steps are typically applied to standardize the tokens and remove irrelevant material. These steps address issues like inconsistent casing, extraneous punctuation, or content that is not useful for analysis (e.g. URLs). Common normalization operations include:

Lowercasing: Converting all text to lower case ensures that “Data” and “data” are treated the same. This is almost always done, except in cases where capitalization carries meaning (e.g., proper nouns in Named Entity Recognition). Lowercasing is easily done in R by tolower() on tokens or using tokenizer options (tokenizers and tidytext lowercases by default, whereas quanteda::tokens() has an option what = "word" which will preserve case unless told otherwise).
Removing Punctuation: Punctuation characters (commas, periods, etc.) are usually removed from tokens, as they typically do not carry meaning for bag-of-words analyses. For example, after tokenization, a token like “analysis.” can be stripped of the period to become “analysis”. In R, one can use gsub("[[:punct:]]+", "", token) or utilize built-in functions: tm_map(..., removePunctuation) in the tm package, or tokens(..., remove_punct = TRUE) in quanteda, or unnest_tokens(..., strip_punct = TRUE) in tidytext. Caution is needed for cases like emoticons or contractions, but in many corpora punctuation can be safely dropped.
Removing Numbers: If numbers (digits) are not relevant to analysis, they can be removed similarly (removeNumbers in tm or via regex). For example, if analyzing topics of articles, the presence of years or product model numbers may add noise. However, in some contexts numbers do matter (e.g., financial texts), so this step is task-dependent.
Stripping Whitespace: Extra whitespace (multiple spaces, tabs, newlines) can be normalized to single spaces or removed. The tm package provides stripWhitespace, while base R’s gsub("\\s+", " ", text) can replace multiple whitespace with a single space. Leading or trailing whitespace should also be trimmed.
Removing URLs, HTML tags, or Special Characters: In web-scraped text or social media data, tokens may include fragments of HTML (<div>,  ) or URLs (http://...). These can be removed with custom regex substitution. For instance, the snippet gsub('http\\S+\\s*', '', text) will remove URLs, and similar patterns can remove HTML tags or non-printable characters. The hunspell tokenizer or rvest::html_text() can also help by extracting text content from HTML.
Handling Accents or Unicode: Depending on encoding, it may be useful to normalize accented characters to their ASCII equivalents (é → e) or ensure consistent encoding (UTF-8). The stringi package has functions like stri_trans_general(text, "Latin-ASCII") to remove diacritics.
Stopword Removal: Stopwords are common words that are often filtered out before analysis because they carry little unique information about document content. These include articles, prepositions, and very frequent verbs (e.g., “the”, “and”, “of”, “to”, “is”). Removing stopwords can significantly reduce the dimensionality of the data and improve focus on meaningful words. R provides stopword lists, such as stopwords("en") in the stopwords or tm packages (based on multiple sources like Snowball or SMART). For example, after tokenizing, one can do tokens_remove(tokens_obj, stopwords("english")) in quanteda, or use tm_map(corpus, removeWords, stopwords("english")) with tm, or an anti_join with tidytext (joining tokens against a stopword list and keeping those not in the list). It is also possible to define a custom stopword list to include domain-specific frequent terms or exclude certain words that standard lists remove (for example, “not” is in some stopword lists but one might keep it when sentiment analysis is of interest). By removing stopwords, we drop words that “do not provide much valuable information” – for instance, nearly every English document will contain “the”, so this word has little value in distinguishing documents.

An illustration of these steps in R using the tm package (one of the older text mining libraries) might look like:

library(tm)
txt <- c("This  is   an Example!! There's a URL: https://example.com/page ",
         "<b>HTML text</b> with numbers 1234 and symbols $#*&.")
corp <- VCorpus(VectorSource(txt))
# Apply a series of transformations
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, content_transformer(function(x) gsub('http\\S+\\s*', '', x)))  # remove URLs
corp <- tm_map(corp, content_transformer(function(x) gsub('<.*?>', '', x)))         # remove HTML tags
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, removeWords, stopwords("english"))
inspect(corp[[1]])
# Resulting text: " example theres url  "
# (The word "example" remains; "there's" became "theres" after punctuation removal; "a", "is", "an" removed as stopwords)

This pipeline aggressively removed punctuation, extra spaces, numbers, a URL, HTML tags, and stopwords. In practice, modern workflows often use tidytext or quanteda which have more efficient and flexible ways to do these operations (and do not require the baggage of the tm Corpus object). For example, using quanteda:

library(quanteda)
tokens <- tokens(txt, remove_punct = TRUE, remove_numbers = TRUE) %>%
          tokens_tolower() %>%
          tokens_remove(stopwords("en"))
tokens[[1]]
# tokens for first text: c("this", "is", "an", "example", "there", "s", "a", "url")

Quanteda’s tokens() with arguments already handled punctuation and numbers. We then lowercased and removed stopwords; note that “there’s” became two tokens [“there”, “s”] due to the apostrophe removal (the word “there” is not a stopword and remains). In this process one might also choose to use a custom regex to remove the “’s” remnants or keep contractions together depending on analysis needs.

It is often advisable to inspect the effect of preprocessing at each step, as overzealous cleaning can remove meaningful information. For example, removing punctuation without careful thought can merge words or split tokens (as seen with “there’s” -> “there”, “s”). Similarly, removing stopwords should be done after considering the analysis goals – in some text classification tasks, stopwords might actually carry class-specific patterns (imagine authorship attribution where function word frequencies matter). Researchers sometimes create domain-specific stopword lists, or not remove stopwords for tasks like topic modeling if it causes topics to be dominated by what remains of fixed phrases.

To summarize, normalization transforms tokens to a standard, reduced-noise form: lowercase, minimal punctuation, no irrelevant symbols, no common stopwords. This results in a cleaner set of tokens that better represent the unique content of the text. The outcome of these steps is often a list of normalized tokens per document, which can then be used to construct structured representations like the document-term matrix.

Stemming and Lemmatization: Reducing Words to Base Forms

Human language exhibits inflection and derivation – words appear in different forms (play, plays, playing, played; analysis, analyses; beautiful, beautifully) that are related semantically. Stemming and lemmatization are techniques to reduce related words to a common base or root, which can be useful for reducing dimensionality and grouping similar terms. The motivation is that, for many text analysis purposes, the specific morphological form of a word is less important than its base meaning. For example, if we are doing topic analysis or search, treating “connect”, “connected”, “connecting” as the same term “connect” can improve the results by aggregating frequencies. However, these techniques come with trade-offs and must be applied carefully.

Stemming: This is a rule-based process that chops off word endings in an attempt to achieve the root form (the “stem”). Stemming algorithms apply heuristics to remove suffixes (and sometimes prefixes) from words. The most commonly used stemmer in English is the Porter Stemmer (Porter, 1980), and an improved version (the Snowball stemmer by Porter) which is available in many languages. Stemming is a crude heuristic: it does not guarantee a valid word as output, just a stem that groups similar words. For example, Porter stemming might reduce “university” and “universe” both to “univers”, which is a stem that is not a real word, and in this case overgeneralizes two distinct concepts. This is known as over-stemming, when two different words are reduced to the same stem erroneously. Conversely, under-stemming occurs when related words are not stemmed to the same root due to algorithm limitations (e.g., “data” and “datum” might stem to “dat” and “datu” respectively, failing to unify them). Despite these issues, stemming is fast and can significantly reduce vocabulary size. It “reduces words to their root form by removing suffixes” according to common definitions. In R, stemming can be performed using the SnowballC package which provides wordStem(). Packages like tm and quanteda integrate with SnowballC; for example, tm_map(corpus, stemDocument, language="en") will stem each word in a corpus, and tokens_wordstem(tokens_object, language = "en") in quanteda stems tokens.
Lemmatization: This is a more sophisticated process that involves determining the lemma of a word, i.e., its dictionary form, using lexical knowledge. Lemmatization attempts to produce valid root words (lemmas) that appear in dictionaries, by considering the word’s part of speech and morphological rules. For instance, the lemma of “went” is “go”, of “better” is “good” (as an adjective), and “cars” -> “car”. Lemmatization requires either a lookup dictionary or a model of language morphology. It usually yields more interpretable results than stemming (since the output are actual words), but is more computationally intensive and language-specific. As an example, where a stemmer might turn “running” -> “run” (chopping off “ning”), a lemmatizer would likely also return “run”, but it would know that “better” -> “good” (a stemmer might leave “better” as is, or incorrectly chop it). Lemmatization “identifies the correct base forms of words using lexical knowledge bases”, thus overcoming some of the errors of stemming and making the results more interpretable.

In R, lemmatization can be achieved through a few avenues:

The textstem package provides functions like lemmatize_words() and lemmatize_strings(). It uses a built-in dictionary (WordNet) to map inflected forms to lemmas. In the example from the Tilburg Science Hub tutorial, they use lemmatize_strings() on a corpus to lemmatize each string (document). For instance, lemmatize_words(c("running", "mice", "children")) would return c("run", "mouse", "child"). It’s straightforward and works for many common words.
The spaCy engine via spacyr can also produce lemmas as seen earlier – by default spacy_parse(, lemma = TRUE) will include a lemma column for each token. For example, spaCy knows “spent” lemma is “spend”, “years” -> “year”, as shown in the parsed output above.
The UDPipe models (Universal Dependencies) have R wrappers and can provide lemmas and parts of speech for many languages.
The koRpus package and WordNet via wordnet package are other options for lemmatization or morphological analysis in R, though less commonly used in recent workflows.

When to use stemming vs lemmatization depends on the task and resources:

Stemming is easier to implement (no external linguistic resources needed) and faster. If slight inaccuracies in word forms are acceptable and you mostly need to reduce dimensionality, stemming is a pragmatic choice. For example, in a simple search application or when creating a DTM, stemming can lump variants together (“analysis”, “analyses” -> “analysi”).
Lemmatization is preferred when you need linguistically precise normalization and actual words. For tasks like coreference resolution, lexical semantics, or when presenting results to humans (e.g., topic labels), lemmatized forms are more readable. Lemmatization is also language-specific; one needs a lemmatizer for each language of interest.

One could also choose to do neither stemming nor lemmatization. In some text mining studies, especially on short texts or when nuances matter, leaving words in their original form is beneficial. Stemming/lemmatization can collapse distinctions that are meaningful. For instance, the words “organization” and “organism” share a prefix but are unrelated; an aggressive stemmer might erroneously chop both to “organ”. Domain knowledge should guide whether normalization to base forms is appropriate.

In R, applying stemming or lemmatization is typically done after tokenization and basic cleaning, but before constructing the document-term matrix. Using tidytext, one could do:

library(tidytext)
library(textstem)
tokens_df <- data_frame(text = c("Cats running happily", "A child went home")) %>%
  unnest_tokens(word, text)
tokens_df$lemma <- lemmatize_words(tokens_df$word)
tokens_df
# word      lemma
# "cats"    "cat"
# "running" "run"
# "happily" "happily"
# "a"       "a"
# "child"   "child"
# "went"    "go"
# "home"    "home"

Here “cats” -> “cat”, “went” -> “go”. The word “happily” remained “happily” because the lemmatizer likely treats it as already a base adverb or has no rule for it (a more advanced lemmatizer might reduce “happily” to “happy” if considering derivational morphology, but many only handle inflectional morphology).

Overall, stemming and lemmatization are techniques to reduce vocabulary size and normalize word forms. Stemming is a quick heuristic that may sacrifice some accuracy for speed, whereas lemmatization uses linguistic knowledge for more accurate base forms. Many analyses (e.g., topic modeling, general text classification) benefit from one of these techniques as they conflate word variants and reduce sparsity in the data. However, it’s important to monitor if meaning is unintentionally lost – one should evaluate results both with and without stemming/lemmatization if possible.

After completing the preprocessing steps – tokenizing the text, cleaning and normalizing tokens, and optionally stemming or lemmatizing – our unstructured text is now converted into a structured form of normalized tokens. This is often stored as a list or table of tokens per document, or a corpus object in quanteda. The next step is to turn this into a numerical representation suitable for statistical analysis: the document-term matrix.

9.3 From Text to Features: Document-Term Matrix and TF–IDF

A core objective in structuring text data is to create a numerical feature representation that algorithms can work with. The most classic and widely used representation is the Document-Term Matrix (DTM), also known as a Document-Feature Matrix (DFM) in some literature (e.g., quanteda) or Term-Document Matrix (TDM, interchangeably used, just transposed). A DTM is a matrix that quantifies the occurrence of terms in each document: rows correspond to documents, columns correspond to terms (tokens), and entries are the frequency (or weight) of a term in a document. This matrix provides a structured, algebra-friendly representation of the corpus, enabling further analyses like statistical modeling, clustering, or visualization.

Constructing a Document-Term Matrix

To build a DTM, one starts with a set of processed tokens for each document (after the preprocessing described above). Each unique token (typically after normalization) becomes a column in the matrix, and each document becomes a row. The cell value is often the term frequency (TF) – the count of how many times that token occurred in that document. Alternatively, it could be a binary indicator (1 if the term appears at least once, 0 if not), or a weighted value (like TF–IDF, discussed shortly).

For example, imagine a tiny corpus of three documents:

Document 1: “data science is fun”
Document 2: “data analysis and science”
Document 3: “science of data”

After preprocessing (lowercasing, removing stopwords like “of”, perhaps no stemming needed here), suppose our vocabulary is: {analysis, data, fun, science}. The DTM (with simple term frequencies) would look like:

Document	analysis	data	fun	science
Doc1	0	1	1	1
Doc2	1	1	0	1
Doc3	0	1	0	1

Here, Doc1 contains “data”, “science”, “fun” once each (ignoring “is” if it was removed); Doc2 contains “data”, “analysis”, “science”; Doc3 contains “science”, “data”. This structured representation can now be used to compute similarities between documents, train classifiers (with documents as feature vectors), etc.

In R, creating a DTM is straightforward with packages:

quanteda: The function dfm() (document-feature matrix) will take in a corpus or tokens and produce a sparse matrix. For example, using quanteda on a corpus corp, dfm <- dfm(corp, tolower=TRUE, stem=FALSE, remove=stopwords("en")) would handle tokenization and create the matrix in one go. If tokens are already generated, one can do dfm(tokens_object). Quanteda’s DFM is an efficient sparse matrix (using the Matrix package under the hood) with additional functionality for weighting and trimming.
tidytext: With tidytext, one would typically unnest tokens, possibly count them, and then use the cast_dtm() or cast_dfm() function (which can cast a tidy table of counts into a Matrix or tm’s DocumentTermMatrix). For instance:
```
library(tidytext)
dtm <- tokens_df %>% count(document_id, token) %>% 
        cast_dtm(document_id, token, n)
```
where tokens_df is a data frame of one-token-per-row along with a document identifier. This uses the tm package’s DocumentTermMatrix under the hood. Alternatively, one could use DocumentTermMatrix directly by first creating a Corpus in tm, etc., but quanteda and tidytext provide higher-level, faster mechanisms.
tm: The classic approach is TermDocumentMatrix or DocumentTermMatrix on a Corpus with a control list for preprocessing. For example:
```
DTM <- DocumentTermMatrix(corpus, control = list(tolower=TRUE, removePunctuation=TRUE,
                                                 stopwords=TRUE, stemming=FALSE))
```
This will produce a sparse matrix (from the tm package). However, tm is less efficient with large corpora compared to quanteda.

One important characteristic of DTMs is their sparseness. In text, each document contains only a small subset of the entire vocabulary. As a result, DTMs are highly sparse (most entries are 0). It is not uncommon to have sparsity > 99% for large corpora. For example, a corpus of 2,246 Associated Press news articles with a vocabulary of 10,473 terms has 99% of its DTM cells as zeros. Working with sparse matrices is essential – R’s Matrix package supports this, and quanteda’s dfm is built on it, allowing efficient storage and operations. Functions exist to trim the DTM by removing rare terms (columns that appear in very few documents) or extremely common terms (which appear in almost all documents), since those may be less useful. For instance, quanteda has dfm_trim(dtm, min_docfreq = ..., max_docfreq = ...) or dfm_remove(dtm, pattern = ...) to drop terms by frequency.

Once a DTM is built, a basic analysis could involve finding the most frequent terms in the corpus, or the terms with high co-occurrence, etc. But raw term frequencies have limitations: they don’t account for the fact that some terms might appear in many documents (e.g., the word “data” might be in all documents of a data science corpus and thus not distinguishing between them). This is where term weighting like TF–IDF comes in.

Term Frequency–Inverse Document Frequency (TF–IDF) Weighting

TF–IDF (Term Frequency–Inverse Document Frequency) is a weighting scheme that adjusts raw term frequency by a measure of how informative or rare a term is across the whole corpus. The intuition is: a term is important for a document if it occurs frequently in that document (high TF), but it is less useful as a discriminator if it occurs in many documents (high DF, thus low IDF). TF–IDF aims to highlight words that are characteristic of a document relative to the corpus.

Term Frequency (TF): This is simply the count of term t in document d, often denoted $\text{tf}_{t,d}$. Sometimes a normalized frequency is used (e.g., frequency divided by document length, or log-scaled frequency), but the simplest is raw count.
Document Frequency (DF): The number of documents in the corpus that contain term t. If DF is high, the term is very common across documents.
Inverse Document Frequency (IDF): Defined as $\text{idf}_t = \log(\frac{N}{\text{df}_t})$ or $\log(\frac{N}{\text{df}_t} + 1)$ (to avoid division by zero issues), where N is total number of documents. This can be smoothed in various ways. The effect is that rarer terms (low df) get a higher IDF, whereas common terms (high df) get a lower IDF, approaching 0 for terms in all documents.

The TF–IDF score for term t in document d is $\text{tfidf}_{t,d} = \text{tf}_{t,d} \times \text{idf}_{t}$. So if a word appears frequently in a document but rarely elsewhere, its TF–IDF will be high, indicating it is important for that document. Conversely, a common word (high df) will get a low weight even if it has high term frequency.

In summary, “tf–idf is a measure of the importance of a word to a document in a collection, adjusted for the fact that some words appear more frequently in general”. It refines the simple bag-of-words model by scaling down the contribution of ubiquitous terms and scaling up rare ones. In practice, this often yields better features for tasks like information retrieval and text mining. In fact, historically, TF–IDF was a core component in document ranking for search engines and is still widely used in recommender systems and library science.

To illustrate, consider the corpus of Document1, Document2, Document3 from earlier. Suppose “data” appears in all 3 docs, “science” in all 3, “analysis” only in Doc2, “fun” only in Doc1. If we compute IDF (using log base 10 for example):

N = 3.
df(“data”) = 3 -> idf(“data”) = log(3/3) = log(1) = 0.
df(“science”) = 3 -> idf(“science”) = 0.
df(“analysis”) = 1 -> idf ~ log(3/1) = log(3) ~ 0.477.
df(“fun”) = 1 -> idf ~ log(3/1) = 0.477.

Then TF–IDF weights:

For “data” in any document: tf * 0 = 0. So “data” is basically considered not informative (it’s everywhere).
For “science”: also will get weight 0 for all.
“analysis” in Doc2: tf=1 * 0.477 = 0.477.
“fun” in Doc1: tf=1 * 0.477 = 0.477. Thus, TF–IDF would highlight “fun” as the unique term of Doc1 and “analysis” for Doc2, whereas “data” and “science” (although frequent) are not distinguishing. This aligns with intuition: if every document is about data and science, those words don’t tell us which document is which, but “fun” vs “analysis” do.

In R, computing TF–IDF is very easy once you have a DTM:

tidytext: The package provides bind_tf_idf() function. If you have a tidy table of document, term, count, you can do: tfidf_table <- counts_table %>% bind_tf_idf(term, document, count). This will add columns for tf, idf, and tf_idf. The implementation uses log2 by default for IDF and does smoothing. Tidytext’s Chapter 3 demonstrates using bind_tf_idf() on a tidy DTM to find important words per document.
quanteda: One can apply dfm_tfidf(dtm, scheme_tf="frequency", scheme_df="inverse") on a dfm, which will return a new dfm with tf-idf values instead of raw counts. By default, quanteda uses log base 10 for the IDF and standard weighting. For example:
```
dtm <- dfm(corpus, tolower=TRUE, remove=stopwords("en"))
tfidf_dtm <- dfm_tfidf(dtm)
```
Now tfidf_dtm contains weights. We could retrieve, say, the top weighted terms for each document to see which words are most characteristic.
tm: If using tm’s DocumentTermMatrix, there is a weightTfIdf() function. One would do DTM_tfidf <- weightTfIdf(DTM). Under the hood, it performs the weighting (with some assumptions like IDF with log and smoothing by 0.5). However, tm’s approach returns an object of class weightTfIdf which one might need to convert to matrix to view.

It’s worth noting that TF–IDF is a heuristic. While powerful, it’s not always ideal for every situation. For instance, in very small corpora, IDF might overemphasize rare terms that could be typos. Also, if the corpus is very large, terms that appear in, say, 30% of documents might still be informative even though IDF penalizes them. Nonetheless, TF–IDF remains a popular baseline for feature extraction in text classification and information retrieval due to its simplicity and effectiveness.

In many real-world scenarios, one might use TF–IDF features as input to a machine learning model. For example, one could compute a TF–IDF matrix for a set of documents and then use those features in a classifier to predict topics or sentiment. TF–IDF often improves accuracy over raw term frequencies for these tasks, because it balances local importance (term frequency in document) with global uniqueness (inverse document frequency).

To recap this section: the document-term matrix (DTM) is the fundamental structured representation of text, converting text into a numeric matrix. It can be used directly with counts or with weighting schemes like TF–IDF that highlight important terms. Using R’s quanteda or tidytext makes it straightforward to go from a corpus to a DTM and to compute TF–IDF weights. At this point, the unstructured text has been transformed into a structured dataset (matrix of features) suitable for analysis. The next sections will explore various analyses and modeling techniques that can be performed on this structured text data, including tidy workflows, named entity recognition, topic modeling, sentiment analysis, and classification.

9.4 Tidy Text Framework and Workflow in R

In the process of structuring text data, it is often advantageous to use a tidy data approach. Tidy data, as defined by Wickham (2014), means organizing data such that each variable is a column, each observation is a row, and each type of observational unit is a table. Applying this principle to text mining gives rise to the tidy text format, where the observational unit is typically the token (word) instance. In other words, a corpus in tidy text format is represented as a table with one token per row, along with associated variables like document ID, sentence ID, or other metadata for that token.

For example, a tidy representation of three short documents might look like:

document_id	token
1	data
1	science
1	fun
2	data
2	analysis
2	science
3	science
3	data

This tabular format allows leveraging all the powerful tools of the R tidyverse (dplyr for manipulation, ggplot2 for visualization, etc.) directly on text data. The tidytext package (Silge & Robinson, 2016) is built around this idea, providing functions to convert between raw text, statistical text mining structures, and tidy tibbles of tokens. The authors explain that tidytext “provides functions and supporting data sets to allow conversion of text to and from tidy formats, and to seamlessly switch between tidy tools and existing text mining packages.”. This means you can do part of your analysis using tidyverse operations, then cast your data into a DocumentTermMatrix for algorithms that need it, then bring the results back into a tidy form for visualization.

Key components of tidytext workflow:

Tokenization with unnest_tokens(): This function is central to tidytext. Starting from a dataframe that has a text column (where each row might be a document or a line), unnest_tokens(output_col, input_text_col, token = "words") will produce a new dataframe with one row per token, dropping the original text column. It handles lowercasing and removing punctuation by default (with options to disable or to choose different tokenization units like "sentences", "ngrams", etc.). As a simple example:
```
library(tidytext)
library(dplyr)
text_df <- tibble(doc = c(1,2), text = c("R is great for data analysis!", "Data Science is fun."))
tokens <- text_df %>% unnest_tokens(word, text)
print(tokens)
# A tibble: 7 x 2
#    doc word
#   <dbl> <chr>
# 1     1 r
# 2     1 is
# 3     1 great
# 4     1 for
# 5     1 data
# 6     1 analysis
# 7     2 data
# 8     2 science
# 9     2 is
# 10    2 fun
```
We see document 1 tokenized into 6 words (punctuation removed, case lowered), document 2 into 4 words. If stopword removal is desired, one could anti_join(tokens, stop_words, by="word") where stop_words is a tidytext-provided dataset of stopwords from multiple lexicons.
Tidy operations: Once tokenized, one can use dplyr verbs like count() to get term frequencies, group_by() to aggregate per document or per token, filter() to remove certain tokens (like stopwords or tokens by length), mutate() to add new info (like categorizing tokens, tagging parts of speech if you have that info). The tidy approach makes it intuitive to perform operations like: “find the top 10 most frequent words in each category of documents”, or “compute the pairwise correlation between word occurrences,” etc., using familiar data manipulation grammar.
Joining with lexicons (sentiment, etc.): Tidytext includes built-in datasets for sentiment analysis (e.g., NRC, Bing, AFINN lexicons). These are in tidy format (a column for word, a column for sentiment category or score). You can join your tokens with these lexicons to attach sentiment values or categories to each token. For example:
```
sentiments <- get_sentiments("bing")  # two columns: word, sentiment (positive/negative)
tokens_sentiment <- tokens %>% inner_join(sentiments, by="word")
```
This would keep only tokens that are associated with a sentiment, labeling them as positive/negative.
Casting to matrix or other structures: As mentioned, the result of tidy operations can be cast to other formats. cast_dtm() and cast_dfm() allow conversion to sparse matrices (for use with tm or quanteda or machine learning algorithms). Conversely, tidytext provides tidy() methods for objects from packages like topicmodels (LDA results) and tm to convert them back into tibbles for analysis or visualization. For instance, after performing LDA (topic modeling) with the topicmodels package, one can do tidy(lda_model, matrix="beta") to get a tidy table of word-topic probabilities for each term in each topic, which can be then manipulated or plotted with ggplot2.

The advantages of the tidy approach are clear in making analysis code more readable and integrating naturally with general data analysis workflows. Instead of dealing with custom data structures at every step (Corpus, DTM, etc.), analysts work with data frames (tibbles) and can use consistent tools. Silge and Robinson (2016) emphasize that tidytext enables text mining with the same toolkit used for other data, making text analysis “easier, more effective, and consistent with tools already being used widely” for data science. It lowers the barrier to entry for analysts not specialized in NLP, and it allows combining text data with other data sources in a unified manner (e.g., a dataset of customer reviews can be tokenized and analyzed alongside structured customer metadata in the same pipeline).

To give a practical sense, here is a sample workflow using tidytext on a public dataset: the Jane Austen novels (available in the janeaustenr package). Suppose we want to find the most important words (by TF–IDF) in each novel:

library(janeaustenr)
library(tidytext)
library(dplyr)

# Get tidy format of Austen novels: one row per line per novel
books <- austen_books() %>%
  group_by(book) %>%
  mutate(line = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter", ignore_case=TRUE)))) %>%
  ungroup()

# Unnest tokens
tidy_books <- books %>% unnest_tokens(word, text)

# Remove stopwords
data("stop_words")
tidy_books <- tidy_books %>% anti_join(stop_words, by="word")

# Calculate tf-idf for words in each novel
book_word_counts <- tidy_books %>% count(book, word, sort=TRUE)
book_tfidf <- book_word_counts %>%
  bind_tf_idf(word, book, n) %>%
  arrange(desc(tf_idf))

book_tfidf %>% select(book, word, tf_idf) %>% slice_max(tf_idf, n=10, by="book")

This would output the top 10 TF–IDF words for each Jane Austen novel, which typically are character names or places unique to each novel (since common words across all novels get low IDF). Indeed, we might see for Pride and Prejudice top words like “Elizabeth”, “Darcy”, etc., and for Mansfield Park words like “Fanny” or “Bertram” – those are distinctive to each novel and TF–IDF surfaces them.

The tidytext approach is not exclusive; it works in tandem with other methods. One could use tidytext to do initial exploration and cleaning, then feed the results into quanteda for faster DTM construction or into a TensorFlow model via keras. The ability to “switch between tidy tools and existing text mining packages” is by design. For example, one might use tidytext to unnest and filter tokens, then cast_dfm() to get a quanteda dfm and use quanteda’s textstat_* functions for keyword analysis, then convert those results back to a tibble for plotting.

In summary, the tidytext framework brings the advantages of tidy data and the tidyverse to NLP. It provides an intuitive, consistent interface for structuring text (one-token-per-row), which complements and connects to the more specialized text mining techniques. By structuring text data in a tidy format, one can more easily integrate text with other data sources and utilize the rich ecosystem of R packages for data manipulation and visualization on text-derived data. This enhances reproducibility and clarity of text analysis workflows.

Having covered preprocessing and representation of text, including a tidy approach, we now move on to specific analytical techniques applied to structured text data: recognizing named entities, discovering topics via topic modeling, analyzing sentiment, and performing text classification. Each of these adds a layer of semantic understanding or predictive power on top of the structured forms we have created.

9.5 Named Entity Recognition (NER) in R

Named Entity Recognition (NER) is the task of identifying and classifying named entities in text into predefined categories such as persons, organizations, locations, dates, etc. It is a fundamental NLP task that adds semantic structure to unstructured text by highlighting the real-world entities mentioned. As spaCy’s documentation succinctly puts it, “a named entity is a ‘real-world object’ that’s assigned a name – for example, a person, a country, a product, or a book title”. NER systems seek out those spans of text (often proper nouns or numerical expressions) and tag them with labels (PERSON, ORG, GPE (geo-political entity), DATE, MONEY, etc.). For instance, in the sentence “Barack Obama, the former President of the United States, was born in Hawaii on August 4, 1961.”, an NER system might recognize Barack Obama as a PERSON, United States as a GPE (country), Hawaii as a location (GPE), and August 4, 1961 as a DATE.

Performing NER typically involves more than just simple pattern matching; it often requires statistical or rule-based models that consider context (to distinguish, say, “May” as a month vs. part of a name like “Theresa May”). In R, NER can be tackled through a few approaches:

spaCy via spacyr: The spacyr package is an R interface to spaCy, a modern NLP library in Python with state-of-the-art NER for several languages. spaCy’s English models, for example, can recognize dozens of entity types (PER, ORG, GPE, LOC, PRODUCT, EVENT, etc.). Using spacyr, after parsing text (spacy_parse() as shown earlier), one can either inspect the entity column or use helper functions. The output of spacy_parse(..., entity = TRUE) includes an entity tag for tokens that are part of an entity, using IOB notation (B-begin, I-inside, O-outside) with type (e.g., “PERSON_B” for beginning of a person name). There is also a convenience function entity_extract(parsed_data) which will pull out full entity spans with their type. For example:
```
parsed <- spacy_parse(c("Mr. Smith spent two years in North Carolina.",
                        "Apple is looking at buying U.K. startup for $1 billion"),
                      entity = TRUE)
entities <- entity_extract(parsed)
print(entities)
#    doc_id sentence_id         entity entity_type
# 1       1           1          Smith      PERSON
# 2       1           1 North Carolina         GPE
# 3       2           1          Apple         ORG
# 4       2           1           U.K.         GPE
# 5       2           1       $1 billion       MONEY
```
Here we see Smith identified as a PERSON, North Carolina as a geopolitical entity (location), Apple as an ORG (organization), U.K. as a GPE, and $1 billion as MONEY. spaCy did this using its statistical model. We can trust these annotations to a large extent, although no NER is perfect (and models can confuse entity boundaries or types, especially with ambiguous names or out-of-context mentions).

The spacyr approach is quite powerful; it essentially brings a high-quality NLP model into R with minimal fuss. One can customize the model (e.g., use a larger en_core_web_trf transformer-based spaCy model for better accuracy, albeit slower).

OpenNLP: The openNLP package provides an interface to the Apache OpenNLP library (a Java-based NLP toolkit). It includes pre-trained models for English NER (person, organization, location, etc.). The usage involves initializing annotators and then annotating text. For example:

library(openNLP)
text <- as.String("Barack Obama was born in Hawaii.")
sent_token_annotator <- Maxent_Sentence_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
person_annotator <- Maxent_Entity_Annotator(kind = "person")
loc_annotator <- Maxent_Entity_Annotator(kind = "location")
pipeline <- list(sent_token_annotator, word_token_annotator, person_annotator, loc_annotator)
annotations <- annotate(text, pipeline)
entities <- text[annotations]
# The annotations contain entity spans with their type

OpenNLP models are a bit older (MaxEnt models) and might not be as accurate as spaCy’s newer neural models, but they can still identify major entities. It’s fully in R (but requires Java) and is an option if one prefers not to rely on Python.

CleanNLP and UDPipe: The cleanNLP package provides a unified interface to various backends (including spaCy, Stanford CoreNLP, and UDPipe). UDPipe is particularly interesting as it provides pre-trained models for many languages (tokenization, POS, lemma, and dependency parsing, including NER for some languages). With cleanNLP, one can initialize a backend and get an annotated tbl_df in R. For instance:
```
library(cleanNLP)
cnlp_init_udpipe(model_name = "english-ewt")
result <- cnlp_annotate("Barack Obama was born in Hawaii in 1961.")
result$entity
# This might show a data frame of entities with columns like entity_type, token_ids, etc.
```
However, UDPipe’s pre-trained English model might not have NER included (it depends on the model; some UD models focus on POS and dependencies).
EntityR or Custom RegEx: For very specific tasks or with well-defined entity patterns (like extracting phone numbers, product codes, etc.), sometimes simple pattern matching is enough. One could use grepl or the stringr package to find patterns that match an entity of interest. But for general NER (names of people, etc.), statistical models are far superior.

NER is useful for structuring text because it pulls out a level of information that can be treated as features or metadata. For example, after extracting entities, one could ask: how many persons are mentioned in this document? Is a particular organization mentioned across many documents? One can also use NER to enrich datasets – for instance, tagging locations in text and then linking them to latitude/longitude for geospatial analysis, or identifying product names in customer feedback and tallying sentiment for each product.

In R, after extracting entities (say using spacyr), we might integrate that with our tidy workflow. The entity_extract() from spacyr gives a data frame with columns for doc_id, sentence_id, entity text, and entity_type. We could join that back to our original data or count frequencies. For example, count how many times each PERSON is mentioned in a corpus, or filter sentences that contain a DATE entity.

A quick example: Suppose we have a vector of news headlines and we want to find all the companies (ORG) mentioned:

library(spacyr)
spacy_initialize(model="en_core_web_sm")
headlines <- c("Google acquires Fitbit for $2.1 billion",
               "Tesla releases new software update",
               "Apple and Microsoft battle in cloud computing",
               "UN General Assembly convenes amid pandemic",
               "Facebook faces antitrust investigation by FTC")
parsed <- spacy_parse(headlines, entity=TRUE)
entities <- entity_extract(parsed)
orgs <- entities %>% filter(entity_type=="ORG")
print(orgs)
# likely output:
# doc_id sentence_id    entity       entity_type
# 1      1           1    Google       ORG
# 1      1           1    Fitbit       ORG
# 2      2           1    Tesla        ORG
# 3      3           1    Apple        ORG
# 3      3           1    Microsoft    ORG
# 5      5           1    Facebook     ORG
# 5      5           1    FTC          ORG

We see the ORG entities extracted from each headline, effectively structuring the text by pulling out the company names. We could then tabulate these or use them as features (maybe a one-hot encoding indicating presence of a particular ORG in a document, etc., depending on task).

In evaluating NER results, one must remember that models can have errors – e.g., spaCy might label a common noun phrase as ORG by mistake or miss an entity with an uncommon name. But overall, NER adds a valuable structured layer: unstructured text → structured list of entities with types.

To conclude, NER in R is achievable either by leveraging powerful external libraries (through wrappers like spacyr or cleanNLP) or using built-in models (openNLP). It structures text by identifying proper nouns and other key named concepts and labeling them. This structured information can then feed into further analysis. For instance, one might use NER before topic modeling to replace entities with a placeholder (to avoid topics clustering by specific names), or conversely, use NER output to specifically analyze networks of people and organizations mentioned together in a corpus (as is done in news analytics). By converting raw text into a list of entities, we reduce complexity and can focus on relationships between those entities or their frequencies.

9.6 Topic Modeling in R

While preprocessing and the DTM representation structure text at the lexical level, topic modeling provides structure at a semantic or thematic level. Topic modeling is an unsupervised machine learning technique that discovers latent themes (topics) in a collection of documents. Each topic is a distribution over words, and each document is modeled as a distribution over topics. The most popular method for topic modeling is Latent Dirichlet Allocation (LDA), introduced by Blei, Ng, and Jordan (2003). LDA assumes that every document is a mixture of a small number of topics, and each topic is characterized by a set of words with certain probabilities. In essence, it tries to cluster words that frequently co-occur into topics, and assign documents to those topics in varying proportions.

To clarify this with Silge & Robinson’s summary: “Latent Dirichlet allocation (LDA) treats each document as a mixture of topics, and each topic as a mixture of words”. This means a topic model might find (for example) one topic about “politics” that has high probability for words like “government”, “election”, “president”, and another topic about “sports” with words like “team”, “coach”, “season”. A document about the President attending a basketball game might then be, say, 60% the politics topic and 40% the sports topic. Unlike clustering where a document belongs to one cluster, LDA allows documents to overlap topics, which is often more realistic for text.

Theoretical foundations: LDA is a generative probabilistic model. It posits that for each document, one chooses a distribution over K topics (randomly from a Dirichlet prior). Then for each word in the document, a topic is chosen according to that distribution, and then a word is generated from that topic’s word distribution. The inference process (fitting LDA) takes the observed words in documents and tries to infer the likely topics and their word probabilities that could have generated the corpus. The result of fitting LDA is typically:

φ (phi): a matrix of size K topics x V words, where φ_{k,w} = P(word w | topic k). Each row is essentially a probability distribution over the vocabulary (sums to 1).
θ (theta): a matrix of size D documents x K topics, where θ_{d,k} = P(topic k | document d). Each row is a distribution over topics for a document (sums to 1). Additionally, the model might learn hyperparameters for those Dirichlet distributions, but typically one focuses on φ and θ as the output.

Practical usage in R:

topicmodels package: The primary CRAN package for LDA is topicmodels, which provides an interface to variational and Gibbs sampling implementations of LDA (using code from the original authors/Blei’s group). One can create a DocumentTermMatrix (or use a quanteda dfm via conversion) and then call LDA(). For example:
```
library(topicmodels)
data("AssociatedPress", package = "topicmodels")  # a DTM of AP news
lda_model <- LDA(AssociatedPress, k = 5, control = list(seed=1234))
```
This would fit a 5-topic LDA model on the AP dataset. By default it uses a VEM (variational EM) algorithm. We can then examine the results:
```
terms(lda_model, 10)  # top 10 terms per topic
topics(lda_model, 5)  # the top topic (as an index) for each of 5 documents
```
But these base functions are limited. A better approach is to use tidytext’s tidy() on the model. As noted earlier, tidy(lda_model, matrix="beta") will give a tidy data frame of word probabilities per topic. tidy(lda_model, matrix="gamma") would give the document-topic probabilities (θ). This tidy output can be used to analyze and visualize the topics. For instance, we could filter beta for the top terms in each topic and plot them with ggplot2, or examine how documents distribute across topics.
quanteda: quanteda does not have its own LDA implementation but provides a convenience wrapper textmodel_LDA() that actually calls the topicmodels package under the hood for a quanteda dfm. quanteda can seamlessly convert dfm to “topicmodels” format and back.
stm package: Another notable package is stm (Structural Topic Models), which extends LDA to include document-level covariates (allowing topics to correlate with external variables or to vary by groups). If one’s aim is purely topics, stm can be used similarly and also has tidy methods and visualization tools.
lda package: A somewhat lower-level package implementing collapsed Gibbs sampling for LDA. It’s less user-friendly and typically topicmodels is preferred now.

Interpreting topics: Once the model is fit, interpreting the topics is the crucial step. Typically:

Look at the top n words for each topic (words with highest φ_{k,w}). These give a sense of what the topic might be about. For example, one topic might show: “game, team, season, coach, play, win…” which clearly suggests sports. Another might show: “stock, market, shares, dollar, fund…” suggesting finance.
Sometimes, reading some example documents that have a high proportion of a given topic helps in labeling the topic.
One can also examine topic correlations (in models like CTM or via post-hoc correlation of θ columns).

One of the advantages of using tidytext with topicmodels is that it integrates topic modeling results into the tidy workflow for analysis. In Text Mining with R, an example is shown where they differentiate chapters of novels by topics. They were able to see, for example, a topic that clearly corresponded to the Shakespeare’s King Henry plays (with words like “king, crown, thou…” etc.) when applying LDA to a set of literary texts.

Choosing number of topics (k): This often requires experimentation or domain knowledge. One might use metrics like perplexity or coherence to pick k, or simply try several and see which yields coherent, meaningful topics. There is no single correct number of topics; it depends on how granular the themes you want and the diversity of the corpus.

Example: Let’s illustrate with a small example in R. Suppose we take the built-in Associated Press dataset (which is news articles) and fit an LDA with k=4:

library(topicmodels)
data("AssociatedPress")  # DTM with ~2246 docs and 10473 terms
ap_lda <- LDA(AssociatedPress, k = 4, control=list(seed=42))
library(tidytext)
ap_topics <- tidy(ap_lda, matrix = "beta")
top_terms <- ap_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 8) %>%
  ungroup() %>%
  arrange(topic, -beta)
print(top_terms)

The output might show something like:

# A tibble: 32 x 3
   topic term      beta
   <int> <chr>     <dbl>
 1     1 church    0.015
 2     1 catholic  0.010
 3     1 pope      0.009
 4     1 god       0.008
 5     1 percent   0.008
 6     1 life      0.007
 7     1 people    0.007
 8     1 church's  0.007
 9     2 stock     0.012
10     2 market    0.010
11     2 shares    0.009
12     2 prices    0.008
13     2 money     0.006
14     2 sales     0.006
15     2 company   0.006
16     2 billion   0.005
...

Topic 1 seems about religion/church, Topic 2 about stocks/market, and so on. Each topic is characterized by those top words. Indeed, by examining these, one can label topic 1 as “Religion”, topic 2 as “Finance”, etc. We see also some generic words like “percent, people” in topic 1 which might indicate a mix or common words; sometimes removing extremely common words beyond stopwords (like “percent”) can refine topics.

One can also examine document-topic assignments:

ap_documents <- tidy(ap_lda, matrix = "gamma")
head(ap_documents)

This might show for each document (which in this dataset are AP articles, but unlabeled in data), the probability for each topic. If we had metadata (say each AP article had a date or known category), we could correlate that with topics (e.g., perhaps Topic 2 (finance) is more prevalent in business news section).

Applications of topic modeling: It is used for discovering themes in large document collections where manual labeling is not feasible – for example:

Exploring research article abstracts to see what topics of research emerge.
Clustering customer feedback into themes.
Analyzing social media posts to identify prevalent discussion topics.
As a preprocessing step: using topic distributions (θ) as features for classification tasks, or using topic modeling to summarize and navigate a corpus.

Extensions: Topic modeling has many variations beyond vanilla LDA: Correlated Topic Models (allow topics to correlate), Dynamic Topic Models (topics over time), Hierarchical LDA, etc. In R, the stm package can model topics with covariates (like time or group), allowing one to see how topics’ prevalence changes with a covariate or how word usage differs by a document attribute, adding more structure to the analysis.

In conclusion, topic modeling provides a structured, higher-level view of text data by organizing the vocabulary and documents into topics. In R, with packages like topicmodels and tidytext, it is relatively straightforward to implement LDA and interpret the results using tidy data principles. Topic models do not require prior labeling of documents, which makes them an excellent exploratory tool for unknown corpora. They essentially add a layer of abstraction to the structured text: instead of dealing with thousands of word features, one can summarize documents by a handful of topics (the θ vectors). This thematic structuring complements other analyses like NER and sentiment analysis, which we will discuss next.

9.7 Sentiment Analysis Techniques

Sentiment analysis, also known as opinion mining, is the process of determining the sentiment or emotional tone behind a block of text. Typically, this involves classifying text along a polarity spectrum (positive, negative, neutral) or extracting more nuanced emotion categories (joy, anger, sadness, etc.). In practical terms, sentiment analysis turns unstructured text (like a product review, tweet, or news headline) into structured data by assigning sentiment labels or scores. For example, the movie review “I absolutely loved this film; it was fantastic!” would be labeled as positive sentiment, whereas “The film was a total waste of time and money.” would be negative.

Academically, Pang and Lee (2008) define sentiment analysis as classifying a given text into sentiment categories such as positive or negative. It can be done at different granularities: document-level (overall sentiment of a review), sentence-level, or aspect-level (sentiment towards specific aspects of an entity). The most common binary classification is positive vs. negative, sometimes with neutral as a third category for objectively worded text.

In R, there are several approaches and tools for sentiment analysis:

1. Lexicon-Based Sentiment Analysis

This approach relies on a predefined list of words (a lexicon) that are annotated with sentiment scores or categories. The sentiment of a text is derived from the sentiment values of the words it contains. This is a straightforward and interpretable method, though it has limitations in handling context, sarcasm, negation, etc.

Common sentiment lexicons:

Bing Liu’s lexicon: Categorizations of ~6800 words as positive or negative (no intensity). In tidytext, accessed via get_sentiments("bing").
AFINN: ~2500 words rated with an integer sentiment score from -5 (very negative) to +5 (very positive). Accessed by get_sentiments("afinn"). This allows a weighted sentiment.
NRC: Contains ~14,000 unigrams with binary flags for ten different sentiments (eight basic emotions like joy, anger, fear, and two sentiments: positive, negative). Accessed via get_sentiments("nrc").

Using tidytext, one can perform lexicon-based sentiment analysis by inner joining tokens with one of these lexicons. For example:

library(tidytext)
library(dplyr)
text <- c("I loved the new design, it is wonderful!", 
          "The update is bad and disappointing.")
tokens <- tibble(line=1:2, text=text) %>% unnest_tokens(word, text)
bing <- get_sentiments("bing")
tokens_sentiment <- tokens %>% inner_join(bing, by="word") 
tokens_sentiment
# line word        sentiment
# 1   loved       positive
# 1   wonderful   positive
# 2   bad         negative
# 2   disappointing negative

From this joined result, we can compute a simple score: document 1 had 2 positive words, 0 negative → overall positive; document 2 had 0 positive, 2 negative → overall negative. We might assign a sentiment label accordingly. With AFINN, one could sum the scores (e.g., “bad” = -3, “disappointing” = -2, total -5 indicates negative).

The tidytext book demonstrates this on larger scales, for instance computing the sentiment score of each chapter of a novel by summing AFINN scores. Lexicon methods are easy to implement and fast. They work well when text is fairly straightforward, and domain-appropriate lexicons are used (e.g., social media may require including slang or emoticons in the lexicon).

Limitations:

Negation handling: The phrase “not good” might get a positive from “good”. Basic lexicon methods would incorrectly label it positive unless we handle negation explicitly (some strategies: flip sentiment of words after a “not” within a window).
Sarcasm/irony: “Great, another delay.” has “great” (positive word) but clearly is negative; lexicon might misfire.
Context/domain: Lexicons are generic; domain-specific usage (e.g., “cold” in medical context might be neutral noun, but lexicon might see it as negative adj).
Composition: “good” vs “very good” vs “not very good” all have different intensities not fully captured by just adding word scores.

Despite these, lexicon approaches often serve as a baseline and are useful for quick analyses.

There are also R packages focusing on lexicon methods:

syuzhet package (by Jockers) which includes multiple lexicons (NRC, Bing, AFINN, etc.) and some convenient functions to get sentiment by sentence. It’s known for the “syuzhet” method of mapping sentiment across narrative time (for literature).
sentimentr package (by Rinker) which improves upon lexicon by accounting for valence shifters (negations, amplifiers, de-amplifiers) in a consistent way. It computes sentiment at the sentence level with some smarter rules. For example, sentimentr would catch that “not good” flips the valence of “good”. Using sentimentr is relatively straightforward:
```
library(sentimentr)
sentiment(c("I am not happy", "This is extremely good"))
# it will output a data frame with sentiment scores (around 0 to 1 positive, 0 to -1 negative typically).
```
sentimentr uses a lexicon with valence adjustments and yields a continuous score.

2. Machine Learning (Supervised) Approaches

Another approach is to treat sentiment analysis as a text classification problem. This requires a labeled dataset (e.g., a collection of movie reviews each labeled positive or negative). One then uses a machine learning model to learn which words, phrases, or other features correlate with positive or negative sentiment. Classic algorithms for this include:

Naive Bayes classifier: often used for its simplicity and surprisingly good performance on text. For sentiment, one would calculate the probability of each word given positive vs negative classes and use Bayes’ theorem to classify new documents. (Quanteda has textmodel_nb() that can do this given a dfm and labels, or one can do it manually).
Support Vector Machines (SVM): with appropriate kernels, SVMs have historically performed well for text classification, including sentiment. As Research suggests, SVM often outperforms Naive Bayes in text classification tasks by better handling high-dimensional sparse data, though NB is competitive considering its simplicity.
Logistic Regression (often with L1 or L2 regularization, a.k.a. Maximum Entropy model in NLP terminology): This is also widely used for sentiment. R’s glmnet package (for regularized regression) can handle large sparse matrices and could be used to train a logistic model on TF–IDF features for sentiment.
Tree-based models and Ensembles: Random forests, gradient boosting (xgboost) can be applied as well. But linear models (NB, SVM, logistic) typically suffice and are more interpretable for text.

For instance, using quanteda, one could do:

library(quanteda)
library(quanteda.textmodels)
# Suppose we have a data frame 'reviews' with columns 'text' and 'sentiment' (pos/neg)
dfm_reviews <- reviews$text %>%
                tokens(remove_punct=TRUE, remove_numbers=TRUE) %>%
                tokens_remove(stopwords("en")) %>%
                dfm()
# split into train and test
train_id <- sample(seq_len(nrow(dfm_reviews)), size = 0.8 * nrow(dfm_reviews))
dfm_train <- dfm_reviews[train_id, ]
dfm_test <- dfm_reviews[-train_id, ]
true_train_labels <- reviews$sentiment[train_id]
true_test_labels <- reviews$sentiment[-train_id]

nb_model <- textmodel_nb(dfm_train, y = true_train_labels)
pred <- predict(nb_model, newdata = dfm_test)
confusion_matrix <- table(Predicted=pred, Actual=true_test_labels)

This yields a Naive Bayes classification model. We could also do textmodel_linear (SVM via the Liblinear wrapper) or use the caret package to try different algorithms.

The advantage of supervised models is they can learn domain-specific sentiment indicators (maybe particular jargon or multi-word expressions). They can also naturally handle negation if such patterns are frequent and distinguishable (like “not good” might appear often in neg reviews and the model learns “not good” as a feature combination if using bigrams or with appropriate modeling). However, supervised learning requires labeled data, which might not be available or expensive to obtain for every domain.

3. Advanced (Neural) Approaches

In recent years, state-of-the-art sentiment analysis often uses neural network models and embeddings:

Word embeddings (Word2Vec, GloVe) followed by an LSTM or CNN to capture sequence information and context, often outperform traditional models.
Fine-tuning pretrained language models like BERT (Devlin et al., 2019) on a sentiment classification task has become a dominant approach. BERT already encodes a lot of linguistic knowledge, and when fine-tuned on a sentiment dataset (like SST or IMDB), it achieves very high accuracy.

In R, one can access these through:

reticulate to use Python libraries (like Hugging Face transformers or TensorFlow). For example, using reticulate to import a transformers pipeline for sentiment (as was demonstrated in the Posit AI Blog: BERT from R).
The text package and transformers R package (as mentioned before). The text package allows using Hugging Face transformer models to get embeddings and even do classification. The medium article we saw indicates there is an R package called transformers which provides access to Hugging Face models, and they mention that with these one can implement sentiment analysis in R with minimal Python interference. For instance, the text package has textClassify() or one can use textEmbed() to get embeddings and then train with textTrain().
keras and torch in R: R interface to Keras/TensorFlow can build and train neural nets. One could, say, tokenize text and feed into an LSTM using keras in R. There are examples of doing an IMDB review classifier in R with Keras (the keras package has an example built-in for text classification).

Using such advanced models often significantly boosts accuracy. For example, a BERT fine-tuned on a large movie review corpus can achieve > 95% accuracy in distinguishing positive vs negative reviews, which is far above a simple lexicon approach. The cost is complexity and requiring more computational resources. The text package was specifically created to make transformers accessible in R for such use cases.

Example:

To illustrate a simple sentiment classification, consider using a lexicon to analyze sentiment of some sentences:

library(tidytext)
sentences <- tibble(id = 1:3, text = c("I love this product, it's amazing!",
                                      "This is the worst purchase I've ever made.",
                                      "It's okay, not great but not terrible."))
sentiment_words <- sentences %>%
  unnest_tokens(word, text) %>%
  inner_join(get_sentiments("bing"), by="word") %>%
  count(id, sentiment) %>%
  tidyr::pivot_wider(names_from = sentiment, values_from = n, values_fill = 0)
sentiment_words <- sentiment_words %>% 
  mutate(sentiment = case_when(positive > negative ~ "positive",
                               negative > positive ~ "negative",
                               TRUE ~ "neutral"))
print(sentiment_words)
# id positive negative sentiment
# 1  1         2         0   positive
# 2  2         0         2   negative
# 3  3         0         1   negative   (because "not" might not be in lexicon, "great" is positive, "terrible" negative; here "not" absence led to misclassification maybe)

We see sentence 1 classified positive (which is correct), sentence 2 negative (correct), sentence 3 ended as negative (the phrase “not great but not terrible” is essentially saying moderate, but our simplistic method only caught “terrible” as a negative word – an example of needing to handle negation or understanding “not X but not Y” construction). More advanced methods or lexicons with bigrams could catch that “not great” is negative or “not terrible” is a mitigating positive.

In practice:

If just a quick analysis needed on social media or reviews, lexicon may suffice to get an aggregate sense (like average sentiment over time or fraction of negative vs positive).
For applications like customer feedback classification, training a model using historical labeled data (if available) yields better precision. One might use caret or tidymodels (the tidymodels suite in R can handle text features via recipes: e.g., step_tokenize, step_tf or step_tfidf to incorporate text into a modeling pipeline).

To sum up, sentiment analysis turns unstructured text into structured sentiment data – either a categorical label or a numeric score – which can then be used to draw insights (e.g., “80% of tweets about our brand this week were positive”). R offers multiple techniques: lexicon-based (quick and interpretable, but needing careful handling of context) and supervised machine learning (requiring data but potentially more accurate). With the integration of advanced models (like BERT via reticulate or the text package), R users can also leverage cutting-edge NLP for sentiment tasks. Indeed, using Hugging Face transformers in R has been demonstrated for sentiment analysis, bringing modern deep learning capabilities into our R workflow.

Having extracted sentiments, one could combine this with other structured data (e.g., see if negative reviews correlate with certain product categories, or map sentiment geographically if texts are tied to locations). The final piece we discuss is text classification in a broader sense, of which sentiment is a special case.

9.8 Text Classification and Categorization

Text classification is a broad term for assigning categories or labels to text based on its content. Sentiment analysis (positive/negative) is one example of text classification. Other examples include topic labeling (classifying news articles into topics like sports, politics, tech), spam detection (spam vs ham emails), authorship attribution, language identification, or any prediction where input is text and output is a category.

Structuring unstructured text via classification means training a model that can automatically tag new text documents with the appropriate label, effectively adding a new structured attribute (the predicted class). In an academic context, this involves feature extraction from text (often the DTM/TF–IDF features we discussed) and applying supervised learning algorithms.

Common approaches & algorithms: As mentioned, Naive Bayes and SVM are widely used for text classification due to their effectiveness on high-dimensional sparse data. Naive Bayes is particularly popular as a baseline because of its speed and reasonable accuracy by assuming word independence (which, while a simplification, works decently). SVMs tend to perform very well, often outscoring NB, especially with proper regularization and kernel choice. For instance, an SVM with a linear kernel is essentially doing something similar to logistic regression but with a max-margin criteria; it handles many correlated features well via regularization.

Logistic regression (a.k.a. MaxEnt classifier in NLP) is also commonly used. In fact, a well-tuned logistic regression with L2 regularization on TF–IDF features is often a strong baseline. Many Kaggle competitions or studies have shown that these “shallow” models can nearly match more complex neural networks on moderate-sized datasets, especially if one uses n-gram features and maybe additional engineering (like capturing word position or simple negation handling).

Feature engineering: Besides raw word counts or TF–IDF, one can enrich features for classification:

N-grams: Including bigrams or trigrams can capture short phrases or context that single words miss (like “United States” vs “United” separately, or “not good” as a bigram feature).
Parts of speech tags or syntax patterns: Sometimes useful in tasks like authorship or formality classification.
Domain-specific cues: e.g., number of exclamation points as a feature for enthusiasm in sentiment, presence of certain domain-specific jargon for topic classification.
Metadata: If available, e.g., the length of the text, time of writing (some tasks combine textual and non-textual features).

The R ecosystem allows feature engineering with recipes in tidymodels or manual methods with dfm in quanteda (which has dfm_ngrams() etc. to add n-gram features).

Model training and evaluation: One typically splits data into training and test sets (or uses cross-validation). In R, caret (Classification And Regression Training) or the newer tidymodels (specifically parsnip, workflows, tune, etc.) can streamline trying different models with text data. For example:

# Using tidymodels for text classification:
library(tidymodels)
text_rec <- recipe(label ~ text, data = training_data) %>%
  step_tokenize(text) %>%
  step_stopwords(text) %>%
  step_tfidf(text)   # converts to tfidf features
lr_spec <- logistic_reg() %>% set_engine("glmnet")
text_wf <- workflow() %>% add_recipe(text_rec) %>% add_model(lr_spec)
text_fit <- text_wf %>% fit(data = training_data)
text_preds <- predict(text_fit, new_data = test_data) %>% bind_cols(test_data)
metrics(text_preds, truth = label, estimate = .pred_class)

This tidy pipeline will handle all the steps in one go. Under the hood, step_tfidf likely creates a sparse matrix of TF–IDF and then glmnet uses that. One can easily swap logistic_reg with e.g. svc_rbf() for an SVM (if a relevant engine is available, possibly via kernlab).

Performance considerations: High-dimensional features (tens of thousands of terms) can slow down training for some models, but linear models scale well (glmnet, Liblinear SVM). It’s often crucial to trim the vocabulary (remove very rare terms) to reduce dimensionality with minimal impact on accuracy.

Example scenario: A classic dataset is the 20 Newsgroups dataset (posts from 20 different Usenet newsgroups, labeled by topic). A text classifier can be trained to predict which of the 20 newsgroups a given post belongs to. Achieving 80-90% accuracy is feasible with SVM or logistic regression on this data. Another is spam detection using the SMS Spam Collection dataset (SMS labeled ham or spam); a simple NB model can reach ~97% accuracy on that. In R, one could implement that with a few lines using quanteda or tidytext to get DTM and then e1071 (for SVM) or textmodels.

Using advanced models: Just like with sentiment, one can employ deep learning for classification if needed. Fine-tuning BERT or other transformers on multi-class tasks can yield excellent results. For example, classifying news articles by topic using BERT might reach very high accuracy and also allow use of pre-trained knowledge (like understanding that “stocks” and “Wall Street” imply Finance category, etc., even if “Wall Street” never appeared in the training set explicitly, because BERT knows it from pretraining).

The reticulate and text package approach: The user could import transformers pipeline for text classification:

library(reticulate)
transformers <- import("transformers")
classifier <- transformers$pipeline("sentiment-analysis")  # or another model type
classifier("The movie was absolutely wonderful!") 
# Returns something like {'label': 'POSITIVE', 'score': 0.98}

This shows for sentiment but similarly for any classification if a model exists (or fine-tuned externally). The text package specifically provides textTrain() to train models on embeddings. They demonstrate using these for numeric prediction or classification tasks, all within R but leveraging Python under the hood for the heavy-lifting.

Connected to structuring text data: When we successfully train a text classifier, we essentially have a function that maps unstructured input (text) to a structured label. This can be used to automatically tag large volumes of documents. For example, feeding a million customer reviews through a sentiment classifier adds a “sentiment_score” or “sentiment_label” column to that dataset, which can then be aggregated or correlated with other data (e.g., do certain products get more negative reviews? Did sentiment improve after a certain date?). Similarly, classifying support tickets by topic can route them to the appropriate department – here the model adds a structured topic category to each ticket.

It’s worth noting that in multi-class classification tasks with many classes (like topic labeling with dozens of possible labels), simpler models might struggle if classes have overlapping vocabularies, whereas more complex ones (like neural networks) might capture subtle patterns better. But the trade-off is complexity and the need for more data.

Accuracy considerations:

Naive Bayes, while often used, makes independence assumptions that might not hold, and can be less accurate than discriminative models like SVM or logistic regression. Yet, NB is extremely fast and has low variance (especially good when data is limited – it doesn’t overfit easily due to assumptions).
SVMs and logistic regression with regularization are robust choices; SVM often used in academic benchmarks for text (even beating some older neural nets pre-2012). The snippet from research suggests SVM outperforms NB in most cases for text classification, though NB is “widely used” due to simplicity.
Ensemble approaches (combining NB and SVM, as in the biomedical example) can sometimes yield slight improvements. In their case, they combined NB and SVM results to leverage both, because NB might get some right that SVM misses and vice versa.

To conclude, text classification is a powerful way to structure text by predictive labeling. R provides multiple paths: from quick lexicon-based classification to comprehensive machine learning pipelines. The structured outcome (predicted class labels or probabilities) can then be integrated into decision-making or further data analysis. For instance, one might find that support tickets classified as “Login Issue” have a higher resolution time – guiding process improvements. Or, classifying news articles by sentiment and topic can allow an economist to quantify “negativity in financial news” over time and correlate it with market indicators. By converting text into labels or scores, we simplify and condense the unstructured content into actionable information.

9.9 Integrating Advanced Language Models (BERT and Transformers in R)

Recent advances in NLP have been dominated by large pre-trained language models such as BERT, RoBERTa, GPT, XLNet, etc., based on the Transformer architecture (Vaswani et al., 2017). These models are trained on massive corpora and capture rich linguistic and world knowledge. By integrating these models into our R workflow, we can significantly enhance the structuring and analysis of text data, especially for tasks that were traditionally challenging (like understanding context, detecting sarcasm, etc.). BERT (Bidirectional Encoder Representations from Transformers) in particular, introduced by Devlin et al. (2018), achieved state-of-the-art results on a wide range of NLP tasks by using a deep bidirectional transformer that reads text contextually in both directions.

Why use transformers in R? Utilizing models like BERT can provide:

Better feature representations: Instead of simple bag-of-words, BERT can provide contextual embeddings for words or entire sentences. These embeddings often cluster by semantic meaning. For example, BERT might map “bank” in “river bank” vs “bank account” to different vector representations, whereas bag-of-words cannot disambiguate those.
Pre-trained knowledge: Transformers come pre-trained on enormous datasets (e.g., BERT was trained on Wikipedia + BooksCorpus). This knowledge can be fine-tuned to specific tasks with relatively smaller data. They effectively serve as off-the-shelf engines for tasks like question answering, NER, summarization, etc., with minimal additional training.
Improved accuracy on many tasks: For classification, sentiment, NER, etc., fine-tuned transformers often outperform traditional models by a large margin.

How to integrate in R:

Reticulate to call Python libraries (Hugging Face Transformers, TensorFlow, PyTorch): This approach treats Python as a backend. For example, using reticulate you can import the transformers library and use pipelines as shown earlier. The RStudio TensorFlow team has shown examples of using BERT via keras and reticulate. In the “BERT from R” blog post, they demonstrate loading a Keras implementation of BERT, then calling reticulate::import('keras_bert') and constructing the model, then using it for classification. They essentially leveraged Python code within R to train a model on text classification.
- One can also directly call a HuggingFace pipeline: e.g. classifier <- transformers$pipeline('sentiment-analysis'), then classifier(c("I love this.", "This is bad.")) returns labeled results. Similarly, a NER pipeline can be called. The bottleneck is moving data between R and Python, but for moderate sizes it’s fine.
The text R package: As we explored, the text package by Kjell et al. (2023) provides a high-level R interface to Hugging Face transformer models through reticulate and torch. It allows one to transform texts to embeddings (using models like BERT) with a single function call textEmbed(), then do various analyses on those embeddings. For example:
- textEmbed() will return a matrix of embeddings for each text (one row per input text, columns being embedding dimensions, typically 768 for base BERT).
- One can then do textTrain() on those embeddings with a vector of outcomes (for regression or classification), which under the hood might train a simple neural network or other model to predict from embeddings.
- textPredict() then can predict on new data with the trained model.
- The package also includes textSimilarity() or textDistance() to compute semantic similarity between texts by their embeddings, which is useful for information retrieval or clustering.
- There are visualization helpers like textProjectionPlot() that can help visualize how words or texts relate in the embedding space. The text package essentially wraps the heavy computation such that an R user doesn’t need to write Python code; it downloads models from Hugging Face and uses them. For instance, by default it might use “bert-base-uncased” model for English if asked.
The abstract of the text-package paper highlights that it “provides user-friendly functions tailored to test hypotheses… for both relatively small and large data sets” and is both modular and end-to-end. It is designed for social scientists to easily leverage transformers without deep technical overhead. The core functions (textEmbed, textTrain, textSimilarity, etc.) encapsulate typical use-cases: embedding texts, building predictive models, and measuring semantic similarity/distance. The inclusion of these advanced methods in R greatly expands what we can do.
Onnx or others: There is also the option of using ONNX (Open Neural Network Exchange) models via the onnxruntime R package to run pre-trained transformer models without python. But that requires a bit more setup (exporting a model to ONNX). Not as common yet.
torch for R: The torch package (by MLVerse) is an R native interface to libtorch (PyTorch C++ backend). There’s now torchtransformers and related efforts to allow loading transformer models directly in R. This is a developing area. But conceptually, one could load a pre-trained BERT in torch in R and do predictions. It’s lower-level than the text package, which already does a lot for you, but it’s an emerging alternative for those who want to avoid Python.

Use cases in this chapter context:

If performing classification, instead of using TF–IDF + SVM, one could fine-tune BERT on the labeled data. Some R examples (like on blogs or Kaggle) show doing this with reticulate or keras. The text package might let you do something like textTrain() with model_type = “bert” to directly fine-tune. Actually, looking at text documentation: they mention textTrain can train predictive models with embeddings as input. If those embeddings are from BERT (e.g., produced by textEmbed), then effectively that’s using BERT’s features but not fine-tuning BERT fully. It’s more like: use BERT as feature extractor, then train a smaller model. Full fine-tuning (updating BERT weights) would likely require going into reticulate or keras.
For NER: Instead of spaCy’s static model, one could use Hugging Face’s pipeline("ner") which often is a BERT-based model for NER and might yield even better results or more fine-grained entity types (like distinguishing PERSON vs NORP (nationalities) etc.). The usage is similar via reticulate, and the output would be entity text with labels and confidence scores.
For similarity and clustering of documents: rather than relying on LDA or raw TF–IDF cosine similarity, using BERT’s sentence embeddings (like Sentence-BERT) can yield semantic similarities (e.g., “the cat sits on the mat” vs “a kitten is on the rug” are considered similar by embeddings, which TF–IDF might not realize if words differ). The text package’s textSimilarity() can compute this with BERT embeddings easily.
For multi-lingual text: Many transformers are multilingual (e.g., XLM-RoBERTa, mBERT). The same pipeline can then handle languages beyond English without separate models for each – a huge advantage if working with international data.

One must note the computational considerations: Transformers are heavy. In R, when using reticulate, it’s essentially Python doing the heavy work, which is fine if properly installed. Memory and possibly GPU usage (if configured) come into play for large data. But for moderate tasks or using cloud services, it’s become manageable. The CRAN text package likely relies on having Python ≥ 3.6 and the transformers library installed and will handle the rest, as indicated by its CRAN description requiring Python and torch.

A quick demonstration with the text package (assuming we have it and the required environment):

library(text)
# Suppose we have a data frame of sentences and a numeric rating variable (like sentiment 1-5)
embeddings <- textEmbed(sentences_df$sentence)
# 'embeddings' now contains the text embeddings (perhaps average of token embeddings or [CLS] token from BERT).
# We can then cluster or classify using these embeddings.
# If classification:
model <- textTrain(embeddings, y = sentences_df$label)  # textTrain will create a model (could be an MLP or so)
preds <- textPredict(model, new_data = textEmbed(new_sentences))

All internal details are abstracted away, which is convenient.

Lastly, the transformers package mentioned in the Heartbeat article suggests there might be a direct R package named transformers that wraps Hugging Face (maybe just a reticulate wrapper). They listed it as required along with reticulate and tokenizers. Possibly that package simplifies using the Python library.

The state of using BERT in R as of 2025: It’s quite feasible and increasingly user-friendly:

Data scientists can remain mostly in R, using recipes and known modeling infrastructure while plugging in embeddings from transformer models.
The results are state-of-the-art: e.g., using BERT for classification might boost accuracy significantly compared to classical methods, as shown by many research benchmarks.
The integration is endorsed by the R community (the RStudio/Posit team) with blog posts and package development, indicating it’s a recommended path for serious NLP tasks.

In summary, integrating advanced models like BERT enables R users to elevate their text analysis from counting words to truly understanding text in context. Tools like reticulate and the text package act as bridges to the sophisticated transformer models. This synergy allows one to benefit from Python’s NLP advancements while still conducting analysis, visualization, and reporting in R. By using these models, unstructured text can be structured and analyzed in ways that were not previously possible with older methods – capturing nuance, context, and deep semantic relationships.

The rest of this chapter includes examples demonstrating how these advanced techniques can be applied to real-world text datasets, illustrating the complete process of structuring unstructured text data in R from start to finish.

9.10 Real-World Examples and Datasets

To ground the concepts discussed, we consider a few real-world scenarios and datasets where structuring unstructured text in R proves invaluable. These examples demonstrate the end-to-end application of preprocessing, structuring, and analyzing text using the techniques covered:

1. Movie Reviews Sentiment Classification (IMDb Reviews Dataset): Dataset: The IMDb movie reviews dataset (Maas et al., 2011) contains 50,000 movie reviews labeled positive or negative, commonly used for sentiment analysis benchmarks. Task: Predict whether a review is positive or negative (binary classification). Procedure: Using a combination of approaches:

Preprocess reviews by removing HTML tags (the dataset has some), lowercasing, perhaps keeping negations (“not”, “never”) as is (or even bigram “not_good” approach).
Tokenize and remove stopwords (though some argue to keep negations and intensifiers for sentiment tasks).
Create a Document-Term Matrix with TF–IDF weighting.
Train a classifier: e.g., a logistic regression with regularization or an SVM. Evaluate on a held-out test set.
Result: In literature, a simple Bag-of-Words + logistic regression can exceed 85% accuracy. With tuning (bigrams, etc.) ~90%. Using BERT fine-tuning, one can achieve ~95%+. For instance, a fine-tuned BERT model on this dataset would involve using reticulate to load transformers and then training (which could be done in a few lines with the transformers Trainer API or using the text package to get embeddings and train a smaller model).
Outcome: The unstructured text (movie review content) gets structured into a sentiment label or even a probability of positivity for each review. Researchers can then analyze, say, which words were most indicative of positive vs negative in the model (using coefficients from logistic regression or SHAP values for complex models). One might also aggregate by movie to see overall reception or correlate with box office success.

2. News Topic Modeling and Trend Analysis (Reuters or New York Times Corpus): Dataset: The Reuters-21578 news articles (a classic dataset with articles labeled by topics like earn, acq, grain, etc.), or a more modern one like a large set of New York Times articles with categories. Reuters has multiple topics per article sometimes, but one can focus on single-label subset (like “ModApte” split). Task: Discover topics in news or classify news by topic. Procedure (Topic Modeling):

Compile the corpus of news articles. Preprocess by removing boilerplate, stopwords; possibly stem or not depending on needs (for topics, stem could help merge variants).
Use LDA to find, say, 10 topics. Examine top words per topic and assign labels (“Earnings/Business”, “Sports”, “Politics”, etc.).
Check if these topics align with known categories or if they reveal subtopics. For example, an LDA on NYT articles might separate international politics vs domestic politics, or tech business vs financial news.
One can plot the prevalence of each topic over time if timestamps are present. E.g., topic about “election” spikes during election years. This turns unstructured text into a structured time-series of topic proportions.
Procedure (Classification): If labels exist (like Reuters categories), one can train a multiclass classifier. For instance, classify each article into one of the top 5 Reuters categories. A one-vs-rest logistic regression or linear SVM could be used. Multi-label classification (if one article can have multiple topics) is more complex; one might train separate classifiers per label.
Outcome: Using LDA, we structured the text into topics per document. Using classification, we add a category label to each article. This makes it possible to, say, filter all articles about “grain” or to quantify that, e.g., 30% of Reuters news in a certain month was about earnings (business). The structured data can be used for downstream tasks like recommendation (suggest related articles of same topic) or content analysis in social sciences (e.g., measuring media attention on topics over time).

3. Customer Complaint Analysis (Twitter Airlines Sentiment or Consumer Complaints): Dataset: A well-known dataset is the Twitter US Airline Sentiment, where tweets directed at airlines are labeled positive, negative, or neutral. Another is the Consumer Complaint Database (US CFPB) where complaints are text describing issues with financial products, often categorized by issue and product. Task: Structure the tweets or complaints by sentiment and key issues. Procedure:

Sentiment: For tweets, one could directly apply a pretrained sentiment model (like a RoBERTa sentiment analyzer via text::textEmbed + textTrain or huggingface pipeline). Or train a model on the provided labels. Since tweets are short, lexicon methods might also be okay. Indeed, the Twitter airline data was often used to benchmark lexicon vs ML.
NER & Aspect Extraction: Perhaps use NER to find airline names, locations, etc. Many tweets mention flight numbers or airports (like “AA123”, “JFK”). A custom NER or regex could tag those.
Topic/Issue classification: For consumer complaints, one could use topic modeling to see emergent issues (e.g., “customer service”, “billing error”, “fraud”). Or use the provided structured categories as a supervised label.
The tidytext framework can be very helpful in analyzing common words by sentiment category (like what words are most associated with negative tweets? Possibly “delayed”, “cancelled”). Or bigrams analysis (common phrases).
Outcome: We transform a mass of tweets into a summary: e.g., 60% negative, top complaint topics are “late flight”, “lost luggage”. For the Consumer Complaints, one can output something like: complaints about “mortgage” often involve the topic “loan modification” and have a certain polarity. This structured insight can help companies prioritize issues.
On the modeling side, one might achieve, say, ~80% accuracy in classifying tweet sentiment with a simple model, and higher with transformer. The structured output can be visualized (like bar charts of sentiment counts or time-series of sentiment over a day).

4. Academic Literature Mining (Research Paper Abstracts – NER + Classification): Dataset: Consider a collection of scientific paper abstracts (like from arXiv or PubMed). Suppose we want to structure them by identifying key entities and classifying their research area. Task: Extract named entities like chemical names, diseases, or author affiliations from abstracts, and classify papers into fields or detect if they propose a new method vs application. Procedure:

Use NER (maybe domain-specific, e.g., sciSpacy or BioBERT NER via reticulate) to extract entities: gene names, chemical compounds, algorithms, etc.
Use a classification model to label each abstract by field (perhaps using journal info or an existing taxonomy).
Alternatively, use unsupervised topic modeling to see groupings of research topics (maybe topics correspond to subfields).
Example: On PubMed abstracts, an LDA might separate topics like “cancer research”, “cardiology”, “neurology”, “bioinformatics”, etc. Entities extracted would include specific proteins, gene IDs, etc., which are structured data that could be linked to databases.
Outcome: Unstructured abstracts become enriched data: each abstract has a set of entities (structured as, say, key = ‘Protein’, value = ‘TP53’), and possibly a predicted category. This could feed into a knowledge base or search index. For instance, one can query “papers that mention BRCA1 gene” easily after this structuring. Or track how many papers per year mention a given technique.

5. Multi-language Social Media Analytics: Dataset: Suppose we have product reviews in English, Spanish, and French. We want an overall sentiment score for each, but don’t have labeled data in all languages. Approach: Use a multilingual transformer (like multilingual BERT or XLM-R) via the text package or reticulate, to predict sentiment for all reviews, regardless of language. Because these models handle multiple languages, we don’t need separate pipelines. Alternatively, translate all to English (using an API) then use an English model, but that introduces errors and cost. The multilingual model approach keeps it unified.

Implementation: transformers$pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment") is a known model on HF that predicts 1-5 star sentiment on many languages. With reticulate, you could load that and run it on all texts.
The output is a rating or class per text, which is the structured result.
Then one could compare sentiment distribution by language or region, etc.
This shows the power of advanced models combined with R’s data handling: ingest raw text in different languages, output a structured numeric sentiment score or label for each.

Evaluation and References: Throughout these examples, we see academic references and evidence for the effectiveness of methods:

The notion that “Naïve Bayes classifier is widely used for text classification due to its simplicity, efficiency, and speed”, but SVM often yields higher accuracy, supports why one might start with NB for a quick baseline and then move to SVM or BERT for better performance.
The introduction of quanteda and tidytext in academic work gives credence to their use in scholarly analysis, e.g., political science research analyzing big text corpora using these tools.
The text-package publication in Psychological Methods highlights how transformers can be applied for social science text analysis, validating our advanced integration approach.

By applying the strategies from this chapter, researchers and practitioners can convert raw text datasets into structured forms that yield insights. The combination of classical methods (DTMs, TF–IDF, LDA) and modern deep learning (BERT, transformers) provides a comprehensive toolkit. Importantly, these techniques do not exist in isolation: an analysis might begin with exploratory lexicon sentiment analysis to get a quick sense, then proceed to train a more robust model for deployment; or use topic modeling to suggest categories which are then validated or used as features in a supervised model.

In conclusion, structuring unstructured text data in R involves a pipeline of theoretical understanding (knowing why to tokenize, normalize, etc.) and practical application (using R packages to implement each step). With the knowledge from this chapter, one can take virtually any collection of text – be it social media posts, customer feedback, academic articles, or news – and transform it into structured data: tokens, entities, sentiment scores, topic assignments, and more. This structured data can then be analyzed like any other quantitative data, unlocking the vast information contained in text for insights and decision-making.

9.11 Conclusion

Unstructured text data, ubiquitous in today’s digital age, contains a wealth of information that can be unlocked through careful processing and analysis. In this chapter, we have detailed a comprehensive roadmap for structuring unstructured text data in R, bridging theoretical foundations with hands-on techniques and tools. We began by discussing the nature of unstructured text and the challenges it poses – high volume, noise, context sensitivity – underlining why structuring is necessary for meaningful analysis.

We then delved into preprocessing, the critical first stage where text is cleaned and standardized. Techniques such as tokenization were defined and demonstrated, highlighting how breaking text into tokens provides the units for further analysis. We saw that R’s ecosystem (e.g., tokenizers, tidytext, quanteda, spacyr) offers robust solutions to handle tokenization across languages and contexts. Normalization steps like lowercasing, removing punctuation, and stopword removal were explained as methods to reduce noise. We also contrasted stemming and lemmatization – two paths to reduce words to base forms – noting their pros and cons and how they can be applied in R with packages like SnowballC or textstem. This preprocessing toolkit is crucial: by converting raw text into a cleaner, standardized token list, we set the stage for reliable structuring.

Building on tokens, we introduced the Document-Term Matrix (DTM), the primary structured representation for text. We explained how a DTM enumerates term frequencies across documents, and discussed sparsity and weighting. The concept of TF–IDF was highlighted for its ability to score term importance by balancing local frequency with global rarity. We showed how quanteda and tidytext in R make it straightforward to compute a DTM and apply TF–IDF weighting, turning texts into numerical feature vectors ready for analysis or modeling.

Recognizing that not all analyses require leaving the tidy paradigm, we examined the tidytext framework in detail, demonstrating how unstructured text can be transformed into a tidy one-token-per-row format. This allows integration with dplyr, ggplot2, and other tidyverse tools, making text analysis more accessible and legible. We emphasized how tidytext enables fluid conversion between raw text, tidy tokens, and matrix representations, embodying the principle that text data is just data – amenable to standard data science workflows when structured properly.

The chapter then explored higher-level structuring techniques that extract or impose additional structure on text:

Named Entity Recognition (NER): We defined named entities and illustrated how tools like spacyr in R can tag entities in text (PERSON, ORG, etc.), effectively adding a layer of structured information about “who/what is mentioned” in the text. This transforms unstructured text into databases of entities and their occurrences.
Topic Modeling: We explained Latent Dirichlet Allocation (LDA) and how it discovers latent topics in document collections, treating each document as a mixture of topics and each topic as a mixture of words. Using examples and tidytext integration, we showed how LDA structures a corpus into interpretable topics – a powerful unsupervised structuring that reveals thematic organization without prior labels.
Sentiment Analysis: We discussed both lexicon-based methods and machine learning approaches for determining sentiment, effectively structuring text by its emotional or opinion dimension. The use of sentiment lexicons in tidytext was exemplified along with cautionary notes on context (negation, sarcasm). We also touched on more advanced models that can be brought in for sentiment tasks.
Text Classification: Extending beyond sentiment, we covered general text categorization using supervised learning, where the structured outcome is a predicted label for each text (e.g., spam vs ham, topic categories, etc.). We pointed out that algorithms like Naive Bayes and SVM have been especially popular and effective for text, and that R’s quanteda.textmodels or tidymodels make implementing these models straightforward. The result of classification is another form of structured data extracted from text – a label or class probability that can be analyzed or used in decision processes.

Crucially, we acknowledged the shifting landscape of NLP by discussing integration of advanced models like BERT. Transformers have arguably redefined what it means to structure text because they can derive deep contextual embeddings and perform tasks like NER, sentiment, or Q&A with minimal task-specific data. We described how R users can harness these through packages such as reticulate (to interface with HuggingFace or TensorFlow) and the text package (providing a high-level R interface to transformers). By linking the latest research advances to R workflows, we equip readers with tools to push the frontier of text structuring – enabling, for instance, cross-lingual sentiment analysis or highly accurate classification by leveraging pre-trained knowledge. This not only improves performance but broadens the range of text that can be structured (e.g., low-resource languages, domain-specific jargon) because the models carry over learning from massive training corpora.

Throughout the chapter, real-world examples were interwoven to illustrate how these methods come together in practice: analyzing social media sentiment, mining customer reviews, classifying news, extracting information from research literature, and more. Each example demonstrated a flow from raw text to structured insights (like sentiment trends, key entity extraction, or topic distributions), highlighting the value gained by structuring the text.

We also underscored the importance of evaluation and validation at each step. In an academic textbook context, one should not only perform these transformations but also assess their quality: Does the tokenizer handle edge cases in our corpus? Is our list of stopwords appropriate or did we remove important domain terms? How coherent are the topics generated by LDA (which can be checked by the top words or topic coherence metrics)? What is the accuracy of our classification models on held-out data? By considering these questions, the reader can ensure that the structuring process is yielding reliable and valid structures that truly reflect the underlying text content.

In conclusion, structuring unstructured text data in R is a multi-step journey – from cleaning and tokenizing, through feature representation (DTMs, embeddings), to higher-level structure discovery (entities, topics, sentiment, categories). It combines linguistics, statistics, and computer science techniques, many of which we have cited from seminal works and recent advances. The R environment, enriched with packages like tidytext (Silge & Robinson, 2016) and quanteda (Benoit et al., 2018), provides an efficient and user-friendly platform to implement these techniques, while the integration of transformer models brings state-of-the-art capabilities to our toolkit. The end result of applying these methods is that unstructured text – once opaque and unanalyzed – becomes structured data: a form we can aggregate, correlate, model, and infer from, using all the strengths of quantitative analysis. This empowers researchers and practitioners across disciplines to include textual evidence in their analyses, enriching insights and enabling data-driven decisions based on textual information that was previously locked in qualitative form.

By mastering the content of this chapter, readers should be equipped to handle a wide array of text datasets. Whether the goal is to build a sentiment dashboard for product reviews, extract policy-relevant information from legislative documents, or perform a content analysis for academic research, the principles and techniques outlined here provide a solid foundation. The structured results – be it a tidy table of tokens, a matrix of features, or a set of predicted labels – form a bridge between human language and quantitative analysis, allowing the wealth of knowledge embedded in text to be systematically harnessed.

References

Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., & Matsuo, A. (2018). quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774.

Devlin, J., Chang, M.‑W., Lee, K., & Toutanova, K. (2019). BERT: Pre‑training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4171–4186). Association for Computational Linguistics.

Doan, A., Ramakrishnan, R., & Vaithyanathan, S. (2006). Managing information extraction: State of the art and research directions. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (pp. 799–800). Association for Computing Machinery.

Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137–144.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.

Kjell, O., Giorgi, S., & Schwartz, H. A. (2023). The text package: An R package for analyzing and visualizing human language using natural language processing and transformers. Psychological Methods, 28(6), 1478–1498.

Silge, J., & Robinson, D. (2016). tidytext: Text mining and analysis using tidy data principles in R. Journal of Open Source Software, 1(3), 37.

Explosion AI. (2022). spaCy 3 documentation: Linguistic features. Retrieved July 14, 2025, from https://spacy.io/usage/linguistic‑features

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems, 30 (pp. 5998–6008). Curran Associates.

# Structuring Unstructured Text Data in R Unstructured text data refers to textual information that **lacks a predefined format or data schema**, making it challenging to process with traditional structured-data tools (Doan et al., 2009). Examples of unstructured data include free-form text like emails, social media posts, news articles, and academic papers, in contrast to tabular data in databases. Such unstructured text constitutes the majority of information available – estimates suggest it may account for as much as 80–95% of all data (Gandomi & Haider, 2015). However, because it does not adhere to rigid schemas, unstructured text is *difficult to store, process, and integrate* with conventional data processing systems. Extracting meaningful insights from raw text requires converting it into a structured form (e.g. numerical features or categorical labels) through the techniques of natural language processing (NLP) and text mining. This chapter provides a detailed overview of **structuring unstructured text data in R**, covering theoretical foundations of text preprocessing and practical R-based workflows for analysis. We will discuss fundamental preprocessing steps (tokenization, normalization, stemming, lemmatization, stopword removal), methods to represent text as structured features (document-term matrices and TF–IDF weighting), the *tidytext* framework for tidy NLP, and techniques for higher-level analysis such as named entity recognition, topic modeling, sentiment analysis, and text classification. In addition, we explore integrating advanced language models like BERT into R analyses using packages that interface with state-of-the-art transformers. Real-world examples and datasets are highlighted throughout, illustrating how these methods are applied in practice. The goal is to equip readers with both the conceptual understanding and practical tools to transform raw text into structured data amenable to statistical analysis and machine learning, all within the R ecosystem. ## Unstructured Text Data and Its Challenges Textual data is inherently unstructured – it consists of sequences of characters (words, sentences, paragraphs) without explicit numeric indices or categorical keys. Unlike a structured table where each column has a fixed meaning, unstructured text requires interpretation to extract entities, categories, or other features of interest. **Unstructured text data is considered textual information that does not have a predefined data format or schema** (Doan et al., 2009). For example, a collection of customer reviews is just a set of sentences or documents with varying lengths and vocabulary, lacking the labeled fields one would find in a structured survey dataset. The lack of inherent structure in text means that *raw text must be processed and structured* before it can be analyzed quantitatively. This presents several key challenges: * **Volume and Variety:** Text data volumes are often large and growing rapidly (e.g. social media streams, digitized libraries), and the content varies widely in vocabulary and style. Traditional SQL databases struggle with storing and querying such free-form data. Efficient techniques are required to handle large corpora and high-dimensional representations. * **Noise and Irregularities:** Real-world text contains noise such as typos, misspellings, inconsistent abbreviations, and extraneous content (HTML tags, markup, etc.). Data quality issues like these can slow down analyses or lead to incorrect conclusions. Preparing text for analysis requires careful cleaning and standardization. * **Complex Structure of Language:** Human language is complex – the meaning of text depends on context, syntax, and semantics. Simply splitting text on whitespace is not sufficient to capture meaningful units, especially for languages without word boundaries (like Chinese) or text with ambiguous tokenization (e.g. “New York” as one entity or two words). Moreover, many linguistic features (names, dates, negations, etc.) need special handling. * **High Dimensionality:** When text is converted into features (for example, via a bag-of-words model), the dimensionality can be extremely high (thousands of unique terms). A corpus of even a few hundred documents may produce a sparse matrix with tens of thousands of columns (one for each word type). Such high-dimensional, sparse data can be computationally intensive to store and process, requiring efficient sparse matrix operations and possibly feature selection or dimensionality reduction. * **Lack of Direct Interpretability:** Unlike structured data where each feature might have obvious meaning, text-derived features (such as the principal components of a document-term matrix, or topic proportions from a topic model) can be abstract. Interpreting and validating these requires care and often domain knowledge. Despite these challenges, unstructured text contains rich information. Techniques from natural language processing allow us to *structure this text data*, extracting features and patterns that can be analyzed statistically or with machine learning. For example, we might turn a collection of documents into a matrix of term frequencies, perform clustering or topic modeling to discover themes, or classify documents by sentiment or topic using supervised learning. The remainder of this chapter details how to carry out these steps in R, a statistical programming environment well-suited for data manipulation and analysis. We start with **text preprocessing**, the crucial first stage in structuring text. ## Preprocessing Unstructured Text Data Preprocessing is the process of transforming raw text into a cleaner, more structured form before feature extraction. In practice, this involves multiple sub-tasks: *tokenization* (splitting text into units such as words), *normalization* (standardizing text, e.g. lowercasing, removing punctuation), *stopword removal* (eliminating common words with little semantic content), and optionally *stemming* or *lemmatization* (reducing words to their base forms). The goal is to reduce noise and variability while preserving the information content of the text. By applying these steps, we make downstream analysis more effective – for instance, matching words in different cases, or treating “connect”, “connected”, “connecting” as instances of the same root concept via stemming. Preprocessing can significantly impact the quality of text mining results. According to Forman and Kirshenbaum (2008), poor text cleaning can lead to low-quality data and degraded accuracy in mining results. Thus, a careful, iterative approach to text cleaning is recommended. In R, many of these tasks can be accomplished with existing packages and functions, which we will highlight in the following subsections. ### Tokenization: Splitting Text into Meaningful Units **Tokenization** is the process of splitting raw text into smaller units called *tokens*. Typically, tokens are words, but they could also be characters, sentences, or other units depending on the analysis goal. Tokenization is a fundamental first step in almost any NLP task, as it converts a stream of characters into discrete elements that algorithms can work with. Formally, *“in tokenization, we take an input string and a token type (a meaningful unit of text, such as a word) and split the input into pieces (tokens) that correspond to that type”* (Manning, Raghavan, & Schütze, 2008). For example, tokenizing the sentence *"R is great for data analysis."* into word tokens would yield the sequence ["R", "is", "great", "for", "data", "analysis"]. Each token can then be treated as a feature or looked up in a vocabulary. Tokenization may seem straightforward (often using whitespace and punctuation as delimiters), but it has many pitfalls and language-specific nuances. Consider English contractions ("don't" → "do", "n't" or "do", "not"), hyphenated words, or entities like dates and URLs – a naive tokenizer could improperly split or retain unwanted characters. Furthermore, languages like Chinese or Japanese do not use spaces between words, requiring more advanced methods (e.g. dictionary-based or machine learning segmentation) to tokenize. Even in English, ambiguous cases arise: for instance, "New York-based" might be considered one token (a single entity) or three tokens ("New", "York", "based") depending on context. Robust tokenization tools account for such issues. In R, several packages provide tokenization functionality: * **Base R / Simple approaches:** One can use `strsplit()` with a regex to split on non-alphanumeric characters as a quick approach. For example, using `strsplit(text, "\\W+")` will break a string into tokens on any non-word character (punctuation, spaces). This simple method works in many cases but may incorrectly split certain tokens or remove information (it will discard punctuation entirely, which could be an issue if punctuation carries meaning, e.g., "!" indicating emotion). * **`tokenizers` package:** The **tokenizers** package (Mullen et al., 2018) provides a suite of tokenization functions in R for words, sentences, paragraphs, n-grams, etc. It handles details like preserving URLs or hashtags if needed. For example, `tokenizers::tokenize_words("Don't tokenize me wrong!")` will yield ["don't", "tokenize", "me", "wrong"] by default, handling the contraction and punctuation. Using a dedicated tokenizer is preferable to manual regex, as it incorporates language-specific rules and has been tested on various edge cases. * **`tidytext` approach:** The **tidytext** package (Silge & Robinson, 2016) offers `unnest_tokens()`, which can tokenize text within a dataframe column as part of a tidy data workflow. This function can tokenize by words, sentences, or even by regex patterns, and it automatically converts text to lowercase by default (this can be turned off). For example, given a dataframe of lines of text, `unnest_tokens(word, text_column)` will produce a new table with one token per row. The tidytext tokenizer also has options to handle input formats like HTML or XML by removing markup (using the *hunspell* tokenizer internally). * **`quanteda` package:** The **quanteda** package (Benoit et al., 2018) contains a fast C++ tokenizer accessible via `quanteda::tokens()`. It supports advanced features like removing or keeping punctuation, lowercasing, stemming, and n-gram generation in one step as part of the tokenization process. For instance, `tokens(c("Example sentence.", "Another one?"), remove_punct = TRUE)` will tokenize both sentences and drop punctuation. * **SpaCy via `spacyr`:** For more linguistically sophisticated tokenization (including multilingual support), R can interface with the spaCy library (Honnibal et al., 2020) through the **spacyr** package. SpaCy's tokenizer is built on extensive rules and machine learning to handle a variety of languages and tricky cases. Using `spacy_parse()` from spacyr will tokenize text and also return lemmas, part-of-speech, and entities (if enabled) in a single pass. For example: ```r library(spacyr) spacy_initialize(model = "en_core_web_sm") # load English model txt <- "Mr. Smith spent two years in North Carolina." parsed <- spacy_parse(txt, tokenize = TRUE, tag = FALSE, entity = TRUE) head(parsed) ## doc_id sentence_id token_id token lemma pos entity ## 1 1 1 1 Mr. mr. PROPN ## 2 1 1 2 Smith smith PROPN PERSON_B ## 3 1 1 3 spent spend VERB ## 4 1 1 4 two two NUM DATE_B ## 5 1 1 5 years year NOUN DATE_I ## 6 1 1 6 in in ADP ## 7 1 1 7 North North PROPN GPE_B ## 8 1 1 8 Carolina Carolina PROPN GPE_I ## 9 1 1 9 . . PUNCT ``` Here, spaCy correctly kept "Mr." as one token (not splitting at the period) and combined "North Carolina" into a single named entity for location (GPE = Geo-Political Entity). This illustrates how advanced tokenizers can handle edge cases and simultaneously perform other NLP tasks. **Tokenization lays the foundation** for structuring text. By segmenting text into consistent units, we enable further analysis like counting word frequencies, computing document-term matrices, or identifying more complex structures (phrases, entities). It is important to choose a tokenization strategy appropriate for the data and task. For instance, when analyzing tweets, one may want to treat "#DataScience" or "😊" as tokens (hashtags and emoji conveying meaning), whereas a simple whitespace tokenizer would not. R’s text processing ecosystem provides the flexibility to customize tokenization as needed. ### Normalization: Cleaning and Standardizing Text Once text is tokenized, a series of **normalization** steps are typically applied to standardize the tokens and remove irrelevant material. These steps address issues like inconsistent casing, extraneous punctuation, or content that is not useful for analysis (e.g. URLs). Common normalization operations include: * **Lowercasing:** Converting all text to lower case ensures that “Data” and “data” are treated the same. This is almost always done, except in cases where capitalization carries meaning (e.g., proper nouns in Named Entity Recognition). Lowercasing is easily done in R by `tolower()` on tokens or using tokenizer options (`tokenizers` and `tidytext` lowercases by default, whereas `quanteda::tokens()` has an option `what = "word"` which will preserve case unless told otherwise). * **Removing Punctuation:** Punctuation characters (commas, periods, etc.) are usually removed from tokens, as they typically do not carry meaning for bag-of-words analyses. For example, after tokenization, a token like "analysis." can be stripped of the period to become "analysis". In R, one can use `gsub("[[:punct:]]+", "", token)` or utilize built-in functions: `tm_map(..., removePunctuation)` in the **tm** package, or `tokens(..., remove_punct = TRUE)` in quanteda, or `unnest_tokens(..., strip_punct = TRUE)` in tidytext. Caution is needed for cases like emoticons or contractions, but in many corpora punctuation can be safely dropped. * **Removing Numbers:** If numbers (digits) are not relevant to analysis, they can be removed similarly (`removeNumbers` in **tm** or via regex). For example, if analyzing topics of articles, the presence of years or product model numbers may add noise. However, in some contexts numbers do matter (e.g., financial texts), so this step is task-dependent. * **Stripping Whitespace:** Extra whitespace (multiple spaces, tabs, newlines) can be normalized to single spaces or removed. The **tm** package provides `stripWhitespace`, while base R’s `gsub("\\s+", " ", text)` can replace multiple whitespace with a single space. Leading or trailing whitespace should also be trimmed. * **Removing URLs, HTML tags, or Special Characters:** In web-scraped text or social media data, tokens may include fragments of HTML (`<div>`, ` `) or URLs (`http://...`). These can be removed with custom regex substitution. For instance, the snippet `gsub('http\\S+\\s*', '', text)` will remove URLs, and similar patterns can remove HTML tags or non-printable characters. The *hunspell* tokenizer or `rvest::html_text()` can also help by extracting text content from HTML. * **Handling Accents or Unicode:** Depending on encoding, it may be useful to normalize accented characters to their ASCII equivalents (é → e) or ensure consistent encoding (UTF-8). The **stringi** package has functions like `stri_trans_general(text, "Latin-ASCII")` to remove diacritics. * **Stopword Removal:** **Stopwords** are common words that are often filtered out before analysis because they carry little unique information about document content. These include articles, prepositions, and very frequent verbs (e.g., "the", "and", "of", "to", "is"). Removing stopwords can significantly reduce the dimensionality of the data and improve focus on meaningful words. R provides stopword lists, such as `stopwords("en")` in the **stopwords** or **tm** packages (based on multiple sources like Snowball or SMART). For example, after tokenizing, one can do `tokens_remove(tokens_obj, stopwords("english"))` in quanteda, or use `tm_map(corpus, removeWords, stopwords("english"))` with tm, or an anti_join with tidytext (joining tokens against a stopword list and keeping those not in the list). It is also possible to define a custom stopword list to include domain-specific frequent terms or exclude certain words that standard lists remove (for example, "not" is in some stopword lists but one might keep it when sentiment analysis is of interest). By removing stopwords, we drop words that *“do not provide much valuable information”* – for instance, nearly every English document will contain "the", so this word has little value in distinguishing documents. An illustration of these steps in R using the **tm** package (one of the older text mining libraries) might look like: ```r library(tm) txt <- c("This is an Example!! There's a URL: https://example.com/page ", "<b>HTML text</b> with numbers 1234 and symbols $#*&.") corp <- VCorpus(VectorSource(txt)) # Apply a series of transformations corp <- tm_map(corp, content_transformer(tolower)) corp <- tm_map(corp, removePunctuation) corp <- tm_map(corp, removeNumbers) corp <- tm_map(corp, content_transformer(function(x) gsub('http\\S+\\s*', '', x))) # remove URLs corp <- tm_map(corp, content_transformer(function(x) gsub('<.*?>', '', x))) # remove HTML tags corp <- tm_map(corp, stripWhitespace) corp <- tm_map(corp, removeWords, stopwords("english")) inspect(corp[[1]]) # Resulting text: " example theres url " # (The word "example" remains; "there's" became "theres" after punctuation removal; "a", "is", "an" removed as stopwords) ``` This pipeline aggressively removed punctuation, extra spaces, numbers, a URL, HTML tags, and stopwords. In practice, modern workflows often use **tidytext** or **quanteda** which have more efficient and flexible ways to do these operations (and do not require the baggage of the tm Corpus object). For example, using quanteda: ```r library(quanteda) tokens <- tokens(txt, remove_punct = TRUE, remove_numbers = TRUE) %>% tokens_tolower() %>% tokens_remove(stopwords("en")) tokens[[1]] # tokens for first text: c("this", "is", "an", "example", "there", "s", "a", "url") ``` Quanteda's `tokens()` with arguments already handled punctuation and numbers. We then lowercased and removed stopwords; note that "there's" became two tokens ["there", "s"] due to the apostrophe removal (the word "there" is not a stopword and remains). In this process one might also choose to use a custom regex to remove the "'s" remnants or keep contractions together depending on analysis needs. It is often advisable to **inspect the effect of preprocessing at each step**, as overzealous cleaning can remove meaningful information. For example, removing punctuation without careful thought can merge words or split tokens (as seen with "there's" -> "there", "s"). Similarly, removing stopwords should be done after considering the analysis goals – in some text classification tasks, stopwords might actually carry class-specific patterns (imagine authorship attribution where function word frequencies matter). Researchers sometimes create domain-specific stopword lists, or *not* remove stopwords for tasks like topic modeling if it causes topics to be dominated by what remains of fixed phrases. To summarize, normalization transforms tokens to a standard, reduced-noise form: **lowercase, minimal punctuation, no irrelevant symbols, no common stopwords**. This results in a cleaner set of tokens that better represent the unique content of the text. The outcome of these steps is often a list of normalized tokens per document, which can then be used to construct structured representations like the document-term matrix. ### Stemming and Lemmatization: Reducing Words to Base Forms Human language exhibits inflection and derivation – words appear in different forms (play, plays, playing, played; analysis, analyses; beautiful, beautifully) that are related semantically. **Stemming** and **lemmatization** are techniques to reduce related words to a common base or root, which can be useful for reducing dimensionality and grouping similar terms. The motivation is that, for many text analysis purposes, *the specific morphological form of a word is less important than its base meaning*. For example, if we are doing topic analysis or search, treating "connect", "connected", "connecting" as the same term "connect" can improve the results by aggregating frequencies. However, these techniques come with trade-offs and must be applied carefully. * **Stemming:** This is a rule-based process that **chops off word endings** in an attempt to achieve the root form (the "stem"). Stemming algorithms apply heuristics to remove suffixes (and sometimes prefixes) from words. The most commonly used stemmer in English is the **Porter Stemmer** (Porter, 1980), and an improved version (the Snowball stemmer by Porter) which is available in many languages. Stemming is a crude heuristic: it does not guarantee a valid word as output, just a stem that groups similar words. For example, Porter stemming might reduce *"university"* and *"universe"* both to "univers", which is a stem that is not a real word, and in this case overgeneralizes two distinct concepts. This is known as **over-stemming**, when two different words are reduced to the same stem erroneously. Conversely, **under-stemming** occurs when related words are not stemmed to the same root due to algorithm limitations (e.g., "data" and "datum" might stem to "dat" and "datu" respectively, failing to unify them). Despite these issues, stemming is fast and can significantly reduce vocabulary size. It *“reduces words to their root form by removing suffixes”* according to common definitions. In R, stemming can be performed using the **SnowballC** package which provides `wordStem()`. Packages like **tm** and **quanteda** integrate with SnowballC; for example, `tm_map(corpus, stemDocument, language="en")` will stem each word in a corpus, and `tokens_wordstem(tokens_object, language = "en")` in quanteda stems tokens. * **Lemmatization:** This is a more sophisticated process that involves **determining the lemma of a word**, i.e., its dictionary form, using lexical knowledge. Lemmatization attempts to produce valid root words (lemmas) that appear in dictionaries, by considering the word’s part of speech and morphological rules. For instance, the lemma of "went" is "go", of "better" is "good" (as an adjective), and "cars" -> "car". Lemmatization requires either a lookup dictionary or a model of language morphology. It usually yields more interpretable results than stemming (since the output are actual words), but is more computationally intensive and language-specific. As an example, where a stemmer might turn "running" -> "run" (chopping off "ning"), a lemmatizer would likely also return "run", but it would know that "better" -> "good" (a stemmer might leave "better" as is, or incorrectly chop it). Lemmatization “identifies the correct base forms of words using lexical knowledge bases”, thus overcoming some of the errors of stemming and making the results more interpretable. In R, lemmatization can be achieved through a few avenues: * The **`textstem`** package provides functions like `lemmatize_words()` and `lemmatize_strings()`. It uses a built-in dictionary (WordNet) to map inflected forms to lemmas. In the example from the Tilburg Science Hub tutorial, they use `lemmatize_strings()` on a corpus to lemmatize each string (document). For instance, `lemmatize_words(c("running", "mice", "children"))` would return `c("run", "mouse", "child")`. It’s straightforward and works for many common words. * The **spaCy** engine via **spacyr** can also produce lemmas as seen earlier – by default `spacy_parse(, lemma = TRUE)` will include a lemma column for each token. For example, spaCy knows "spent" lemma is "spend", "years" -> "year", as shown in the parsed output above. * The **UDPipe** models (Universal Dependencies) have R wrappers and can provide lemmas and parts of speech for many languages. * The **koRpus** package and WordNet via **wordnet** package are other options for lemmatization or morphological analysis in R, though less commonly used in recent workflows. When to use stemming vs lemmatization depends on the task and resources: * Stemming is easier to implement (no external linguistic resources needed) and faster. If slight inaccuracies in word forms are acceptable and you mostly need to reduce dimensionality, stemming is a pragmatic choice. For example, in a simple search application or when creating a DTM, stemming can lump variants together ("analysis", "analyses" -> "analysi"). * Lemmatization is preferred when you need linguistically precise normalization and actual words. For tasks like coreference resolution, lexical semantics, or when presenting results to humans (e.g., topic labels), lemmatized forms are more readable. Lemmatization is also language-specific; one needs a lemmatizer for each language of interest. One could also choose to do neither stemming nor lemmatization. In some text mining studies, especially on short texts or when nuances matter, leaving words in their original form is beneficial. Stemming/lemmatization can collapse distinctions that are meaningful. For instance, the words "organization" and "organism" share a prefix but are unrelated; an aggressive stemmer might erroneously chop both to "organ". Domain knowledge should guide whether normalization to base forms is appropriate. In R, applying stemming or lemmatization is typically done after tokenization and basic cleaning, but before constructing the document-term matrix. Using **tidytext**, one could do: ```r library(tidytext) library(textstem) tokens_df <- data_frame(text = c("Cats running happily", "A child went home")) %>% unnest_tokens(word, text) tokens_df$lemma <- lemmatize_words(tokens_df$word) tokens_df # word lemma # "cats" "cat" # "running" "run" # "happily" "happily" # "a" "a" # "child" "child" # "went" "go" # "home" "home" ``` Here "cats" -> "cat", "went" -> "go". The word "happily" remained "happily" because the lemmatizer likely treats it as already a base adverb or has no rule for it (a more advanced lemmatizer might reduce "happily" to "happy" if considering derivational morphology, but many only handle inflectional morphology). Overall, **stemming and lemmatization** are techniques to *reduce vocabulary size and normalize word forms*. Stemming is a quick heuristic that may sacrifice some accuracy for speed, whereas lemmatization uses linguistic knowledge for more accurate base forms. Many analyses (e.g., topic modeling, general text classification) benefit from one of these techniques as they conflate word variants and reduce sparsity in the data. However, it’s important to monitor if meaning is unintentionally lost – one should evaluate results both with and without stemming/lemmatization if possible. After completing the preprocessing steps – tokenizing the text, cleaning and normalizing tokens, and optionally stemming or lemmatizing – our unstructured text is now converted into a structured form of **normalized tokens**. This is often stored as a list or table of tokens per document, or a corpus object in quanteda. The next step is to turn this into a numerical representation suitable for statistical analysis: the **document-term matrix**. ## From Text to Features: Document-Term Matrix and TF–IDF A core objective in structuring text data is to create a numerical feature representation that algorithms can work with. The most classic and widely used representation is the **Document-Term Matrix (DTM)**, also known as a Document-Feature Matrix (DFM) in some literature (e.g., quanteda) or Term-Document Matrix (TDM, interchangeably used, just transposed). A DTM is a matrix that quantifies the occurrence of terms in each document: **rows correspond to documents, columns correspond to terms (tokens), and entries are the frequency (or weight) of a term in a document**. This matrix provides a structured, algebra-friendly representation of the corpus, enabling further analyses like statistical modeling, clustering, or visualization. ### Constructing a Document-Term Matrix To build a DTM, one starts with a set of processed tokens for each document (after the preprocessing described above). Each unique token (typically after normalization) becomes a column in the matrix, and each document becomes a row. The cell value is often the **term frequency (TF)** – the count of how many times that token occurred in that document. Alternatively, it could be a binary indicator (1 if the term appears at least once, 0 if not), or a weighted value (like TF–IDF, discussed shortly). For example, imagine a tiny corpus of three documents: 1. Document 1: "data science is fun" 2. Document 2: "data analysis and science" 3. Document 3: "science of data" After preprocessing (lowercasing, removing stopwords like "of", perhaps no stemming needed here), suppose our vocabulary is: {analysis, data, fun, science}. The DTM (with simple term frequencies) would look like: | Document | analysis | data | fun | science | | -------- | -------- | ---- | --- | ------- | | Doc1 | 0 | 1 | 1 | 1 | | Doc2 | 1 | 1 | 0 | 1 | | Doc3 | 0 | 1 | 0 | 1 | Here, Doc1 contains "data", "science", "fun" once each (ignoring "is" if it was removed); Doc2 contains "data", "analysis", "science"; Doc3 contains "science", "data". This structured representation can now be used to compute similarities between documents, train classifiers (with documents as feature vectors), etc. In R, creating a DTM is straightforward with packages: * **quanteda:** The function `dfm()` (document-feature matrix) will take in a corpus or tokens and produce a sparse matrix. For example, using quanteda on a corpus `corp`, `dfm <- dfm(corp, tolower=TRUE, stem=FALSE, remove=stopwords("en"))` would handle tokenization and create the matrix in one go. If tokens are already generated, one can do `dfm(tokens_object)`. Quanteda’s DFM is an efficient sparse matrix (using the Matrix package under the hood) with additional functionality for weighting and trimming. * **tidytext:** With tidytext, one would typically unnest tokens, possibly count them, and then use the `cast_dtm()` or `cast_dfm()` function (which can cast a tidy table of counts into a Matrix or tm's DocumentTermMatrix). For instance: ```r library(tidytext) dtm <- tokens_df %>% count(document_id, token) %>% cast_dtm(document_id, token, n) ``` where `tokens_df` is a data frame of one-token-per-row along with a document identifier. This uses the **tm** package's DocumentTermMatrix under the hood. Alternatively, one could use `DocumentTermMatrix` directly by first creating a Corpus in tm, etc., but quanteda and tidytext provide higher-level, faster mechanisms. * **tm:** The classic approach is `TermDocumentMatrix` or `DocumentTermMatrix` on a Corpus with a control list for preprocessing. For example: ```r DTM <- DocumentTermMatrix(corpus, control = list(tolower=TRUE, removePunctuation=TRUE, stopwords=TRUE, stemming=FALSE)) ``` This will produce a sparse matrix (from the tm package). However, tm is less efficient with large corpora compared to quanteda. One important characteristic of DTMs is their **sparseness**. In text, each document contains only a small subset of the entire vocabulary. As a result, DTMs are highly sparse (most entries are 0). It is not uncommon to have sparsity > 99% for large corpora. For example, a corpus of 2,246 Associated Press news articles with a vocabulary of 10,473 terms has 99% of its DTM cells as zeros. Working with sparse matrices is essential – R’s `Matrix` package supports this, and quanteda’s dfm is built on it, allowing efficient storage and operations. Functions exist to trim the DTM by removing rare terms (columns that appear in very few documents) or extremely common terms (which appear in almost all documents), since those may be less useful. For instance, quanteda has `dfm_trim(dtm, min_docfreq = ..., max_docfreq = ...)` or `dfm_remove(dtm, pattern = ...)` to drop terms by frequency. Once a DTM is built, a basic analysis could involve finding the most frequent terms in the corpus, or the terms with high co-occurrence, etc. But raw term frequencies have limitations: they don’t account for the fact that some terms might appear in many documents (e.g., the word "data" might be in all documents of a data science corpus and thus not distinguishing between them). This is where **term weighting** like TF–IDF comes in. ### Term Frequency–Inverse Document Frequency (TF–IDF) Weighting **TF–IDF (Term Frequency–Inverse Document Frequency)** is a weighting scheme that adjusts raw term frequency by a measure of how *informative* or *rare* a term is across the whole corpus. The intuition is: a term is important for a document if it occurs frequently in that document (high TF), *but* it is less useful as a discriminator if it occurs in many documents (high DF, thus low IDF). TF–IDF aims to highlight words that are characteristic of a document *relative to the corpus*. * **Term Frequency (TF):** This is simply the count of term *t* in document *d*, often denoted $\text{tf}_{t,d}$. Sometimes a normalized frequency is used (e.g., frequency divided by document length, or log-scaled frequency), but the simplest is raw count. * **Document Frequency (DF):** The number of documents in the corpus that contain term *t*. If DF is high, the term is very common across documents. * **Inverse Document Frequency (IDF):** Defined as $\text{idf}_t = \log(\frac{N}{\text{df}_t})$ or $\log(\frac{N}{\text{df}_t} + 1)$ (to avoid division by zero issues), where *N* is total number of documents. This can be smoothed in various ways. The effect is that rarer terms (low df) get a higher IDF, whereas common terms (high df) get a lower IDF, approaching 0 for terms in all documents. The **TF–IDF score** for term *t* in document *d* is $\text{tfidf}_{t,d} = \text{tf}_{t,d} \times \text{idf}_{t}$. So if a word appears frequently in a document but rarely elsewhere, its TF–IDF will be high, indicating it is important for that document. Conversely, a common word (high df) will get a low weight even if it has high term frequency. In summary, *“tf–idf is a measure of the importance of a word to a document in a collection, adjusted for the fact that some words appear more frequently in general”*. It refines the simple bag-of-words model by scaling down the contribution of ubiquitous terms and scaling up rare ones. In practice, this often yields better features for tasks like information retrieval and text mining. In fact, historically, TF–IDF was a core component in document ranking for search engines and is still widely used in recommender systems and library science. To illustrate, consider the corpus of Document1, Document2, Document3 from earlier. Suppose "data" appears in all 3 docs, "science" in all 3, "analysis" only in Doc2, "fun" only in Doc1. If we compute IDF (using log base 10 for example): * N = 3. * df("data") = 3 -> idf("data") = log(3/3) = log(1) = 0. * df("science") = 3 -> idf("science") = 0. * df("analysis") = 1 -> idf \~ log(3/1) = log(3) \~ 0.477. * df("fun") = 1 -> idf \~ log(3/1) = 0.477. Then TF–IDF weights: * For "data" in any document: tf \* 0 = 0. So "data" is basically considered not informative (it’s everywhere). * For "science": also will get weight 0 for all. * "analysis" in Doc2: tf=1 \* 0.477 = 0.477. * "fun" in Doc1: tf=1 \* 0.477 = 0.477. Thus, TF–IDF would highlight "fun" as the unique term of Doc1 and "analysis" for Doc2, whereas "data" and "science" (although frequent) are not distinguishing. This aligns with intuition: if every document is about data and science, those words don't tell us which document is which, but "fun" vs "analysis" do. In R, computing TF–IDF is very easy once you have a DTM: * **tidytext:** The package provides `bind_tf_idf()` function. If you have a tidy table of `document, term, count`, you can do: `tfidf_table <- counts_table %>% bind_tf_idf(term, document, count)`. This will add columns for tf, idf, and tf_idf. The implementation uses log2 by default for IDF and does smoothing. Tidytext’s Chapter 3 demonstrates using `bind_tf_idf()` on a tidy DTM to find important words per document. * **quanteda:** One can apply `dfm_tfidf(dtm, scheme_tf="frequency", scheme_df="inverse")` on a dfm, which will return a new dfm with tf-idf values instead of raw counts. By default, quanteda uses log base 10 for the IDF and standard weighting. For example: ```r dtm <- dfm(corpus, tolower=TRUE, remove=stopwords("en")) tfidf_dtm <- dfm_tfidf(dtm) ``` Now `tfidf_dtm` contains weights. We could retrieve, say, the top weighted terms for each document to see which words are most characteristic. * **tm:** If using tm’s DocumentTermMatrix, there is a `weightTfIdf()` function. One would do `DTM_tfidf <- weightTfIdf(DTM)`. Under the hood, it performs the weighting (with some assumptions like IDF with log and smoothing by 0.5). However, tm’s approach returns an object of class weightTfIdf which one might need to convert to matrix to view. It’s worth noting that **TF–IDF is a heuristic**. While powerful, it’s not always ideal for every situation. For instance, in very small corpora, IDF might overemphasize rare terms that could be typos. Also, if the corpus is very large, terms that appear in, say, 30% of documents might still be informative even though IDF penalizes them. Nonetheless, TF–IDF remains a popular baseline for feature extraction in text classification and information retrieval due to its simplicity and effectiveness. In many real-world scenarios, one might use TF–IDF features as input to a machine learning model. For example, one could compute a TF–IDF matrix for a set of documents and then use those features in a classifier to predict topics or sentiment. TF–IDF often improves accuracy over raw term frequencies for these tasks, because it balances local importance (term frequency in document) with global uniqueness (inverse document frequency). To recap this section: **the document-term matrix (DTM)** is the fundamental structured representation of text, converting text into a numeric matrix. It can be used directly with counts or with weighting schemes like TF–IDF that highlight important terms. Using R’s quanteda or tidytext makes it straightforward to go from a corpus to a DTM and to compute TF–IDF weights. At this point, the unstructured text has been transformed into a structured dataset (matrix of features) suitable for analysis. The next sections will explore various analyses and modeling techniques that can be performed on this structured text data, including tidy workflows, named entity recognition, topic modeling, sentiment analysis, and classification. ## Tidy Text Framework and Workflow in R In the process of structuring text data, it is often advantageous to use a **tidy data** approach. Tidy data, as defined by Wickham (2014), means organizing data such that each variable is a column, each observation is a row, and each type of observational unit is a table. Applying this principle to text mining gives rise to the *tidy text format*, where the observational unit is typically the **token** (word) instance. In other words, a corpus in tidy text format is represented as a table with one token per row, along with associated variables like document ID, sentence ID, or other metadata for that token. For example, a tidy representation of three short documents might look like: | document_id | token | | -----------: | :------- | | 1 | data | | 1 | science | | 1 | fun | | 2 | data | | 2 | analysis | | 2 | science | | 3 | science | | 3 | data | This tabular format allows leveraging all the powerful tools of the R tidyverse (dplyr for manipulation, ggplot2 for visualization, etc.) directly on text data. The **tidytext** package (Silge & Robinson, 2016) is built around this idea, providing functions to convert between raw text, statistical text mining structures, and tidy tibbles of tokens. The authors explain that tidytext *“provides functions and supporting data sets to allow conversion of text to and from tidy formats, and to seamlessly switch between tidy tools and existing text mining packages.”*. This means you can do part of your analysis using tidyverse operations, then cast your data into a DocumentTermMatrix for algorithms that need it, then bring the results back into a tidy form for visualization. **Key components of tidytext workflow:** * **Tokenization with `unnest_tokens()`:** This function is central to tidytext. Starting from a dataframe that has a text column (where each row might be a document or a line), `unnest_tokens(output_col, input_text_col, token = "words")` will produce a new dataframe with one row per token, dropping the original text column. It handles lowercasing and removing punctuation by default (with options to disable or to choose different tokenization units like `"sentences"`, `"ngrams"`, etc.). As a simple example: ```r library(tidytext) library(dplyr) text_df <- tibble(doc = c(1,2), text = c("R is great for data analysis!", "Data Science is fun.")) tokens <- text_df %>% unnest_tokens(word, text) print(tokens) # A tibble: 7 x 2 # doc word # <dbl> <chr> # 1 1 r # 2 1 is # 3 1 great # 4 1 for # 5 1 data # 6 1 analysis # 7 2 data # 8 2 science # 9 2 is # 10 2 fun ``` We see document 1 tokenized into 6 words (punctuation removed, case lowered), document 2 into 4 words. If stopword removal is desired, one could `anti_join(tokens, stop_words, by="word")` where `stop_words` is a tidytext-provided dataset of stopwords from multiple lexicons. * **Tidy operations:** Once tokenized, one can use `dplyr` verbs like `count()` to get term frequencies, `group_by()` to aggregate per document or per token, `filter()` to remove certain tokens (like stopwords or tokens by length), `mutate()` to add new info (like categorizing tokens, tagging parts of speech if you have that info). The tidy approach makes it intuitive to perform operations like: "find the top 10 most frequent words in each category of documents", or "compute the pairwise correlation between word occurrences," etc., using familiar data manipulation grammar. * **Joining with lexicons (sentiment, etc.):** Tidytext includes built-in datasets for sentiment analysis (e.g., NRC, Bing, AFINN lexicons). These are in tidy format (a column for word, a column for sentiment category or score). You can join your tokens with these lexicons to attach sentiment values or categories to each token. For example: ```r sentiments <- get_sentiments("bing") # two columns: word, sentiment (positive/negative) tokens_sentiment <- tokens %>% inner_join(sentiments, by="word") ``` This would keep only tokens that are associated with a sentiment, labeling them as positive/negative. * **Casting to matrix or other structures:** As mentioned, the result of tidy operations can be cast to other formats. `cast_dtm()` and `cast_dfm()` allow conversion to sparse matrices (for use with tm or quanteda or machine learning algorithms). Conversely, tidytext provides `tidy()` methods for objects from packages like **topicmodels** (LDA results) and **tm** to convert them back into tibbles for analysis or visualization. For instance, after performing LDA (topic modeling) with the **topicmodels** package, one can do `tidy(lda_model, matrix="beta")` to get a tidy table of word-topic probabilities for each term in each topic, which can be then manipulated or plotted with ggplot2. The **advantages of the tidy approach** are clear in making analysis code more readable and integrating naturally with general data analysis workflows. Instead of dealing with custom data structures at every step (Corpus, DTM, etc.), analysts work with data frames (tibbles) and can use consistent tools. Silge and Robinson (2016) emphasize that *tidytext enables text mining with the same toolkit used for other data, making text analysis “easier, more effective, and consistent with tools already being used widely”* for data science. It lowers the barrier to entry for analysts not specialized in NLP, and it allows combining text data with other data sources in a unified manner (e.g., a dataset of customer reviews can be tokenized and analyzed alongside structured customer metadata in the same pipeline). To give a practical sense, here is a sample workflow using tidytext on a public dataset: the **Jane Austen novels** (available in the *janeaustenr* package). Suppose we want to find the most important words (by TF–IDF) in each novel: ```r library(janeaustenr) library(tidytext) library(dplyr) # Get tidy format of Austen novels: one row per line per novel books <- austen_books() %>% group_by(book) %>% mutate(line = row_number(), chapter = cumsum(str_detect(text, regex("^chapter", ignore_case=TRUE)))) %>% ungroup() # Unnest tokens tidy_books <- books %>% unnest_tokens(word, text) # Remove stopwords data("stop_words") tidy_books <- tidy_books %>% anti_join(stop_words, by="word") # Calculate tf-idf for words in each novel book_word_counts <- tidy_books %>% count(book, word, sort=TRUE) book_tfidf <- book_word_counts %>% bind_tf_idf(word, book, n) %>% arrange(desc(tf_idf)) book_tfidf %>% select(book, word, tf_idf) %>% slice_max(tf_idf, n=10, by="book") ``` This would output the top 10 TF–IDF words for each Jane Austen novel, which typically are character names or places unique to each novel (since common words across all novels get low IDF). Indeed, we might see for *Pride and Prejudice* top words like "Elizabeth", "Darcy", etc., and for *Mansfield Park* words like "Fanny" or "Bertram" – those are distinctive to each novel and TF–IDF surfaces them. The tidytext approach is not exclusive; it works in tandem with other methods. One could use tidytext to do initial exploration and cleaning, then feed the results into quanteda for faster DTM construction or into a TensorFlow model via keras. The ability to *“switch between tidy tools and existing text mining packages”* is by design. For example, one might use tidytext to unnest and filter tokens, then `cast_dfm()` to get a quanteda dfm and use quanteda's `textstat_*` functions for keyword analysis, then convert those results back to a tibble for plotting. In summary, the **tidytext framework** brings the advantages of tidy data and the tidyverse to NLP. It provides an intuitive, consistent interface for structuring text (one-token-per-row), which complements and connects to the more specialized text mining techniques. By structuring text data in a tidy format, one can more easily integrate text with other data sources and utilize the rich ecosystem of R packages for data manipulation and visualization on text-derived data. This enhances reproducibility and clarity of text analysis workflows. Having covered preprocessing and representation of text, including a tidy approach, we now move on to specific analytical techniques applied to structured text data: recognizing named entities, discovering topics via topic modeling, analyzing sentiment, and performing text classification. Each of these adds a layer of semantic understanding or predictive power on top of the structured forms we have created. ## Named Entity Recognition (NER) in R **Named Entity Recognition (NER)** is the task of identifying and classifying named entities in text into predefined categories such as persons, organizations, locations, dates, etc. It is a fundamental NLP task that adds *semantic structure* to unstructured text by highlighting the real-world entities mentioned. As spaCy's documentation succinctly puts it, *“a named entity is a ‘real-world object’ that’s assigned a name – for example, a person, a country, a product, or a book title”*. NER systems seek out those spans of text (often proper nouns or numerical expressions) and tag them with labels (PERSON, ORG, GPE (geo-political entity), DATE, MONEY, etc.). For instance, in the sentence *"Barack Obama, the former President of the United States, was born in Hawaii on August 4, 1961."*, an NER system might recognize **Barack Obama** as a PERSON, **United States** as a GPE (country), **Hawaii** as a location (GPE), and **August 4, 1961** as a DATE. Performing NER typically involves more than just simple pattern matching; it often requires statistical or rule-based models that consider context (to distinguish, say, "May" as a month vs. part of a name like "Theresa May"). In R, NER can be tackled through a few approaches: * **spaCy via spacyr:** The **spacyr** package is an R interface to spaCy, a modern NLP library in Python with state-of-the-art NER for several languages. spaCy's English models, for example, can recognize dozens of entity types (PER, ORG, GPE, LOC, PRODUCT, EVENT, etc.). Using spacyr, after parsing text (`spacy_parse()` as shown earlier), one can either inspect the `entity` column or use helper functions. The output of `spacy_parse(..., entity = TRUE)` includes an `entity` tag for tokens that are part of an entity, using IOB notation (B-begin, I-inside, O-outside) with type (e.g., "PERSON_B" for beginning of a person name). There is also a convenience function `entity_extract(parsed_data)` which will pull out full entity spans with their type. For example: ```r parsed <- spacy_parse(c("Mr. Smith spent two years in North Carolina.", "Apple is looking at buying U.K. startup for $1 billion"), entity = TRUE) entities <- entity_extract(parsed) print(entities) # doc_id sentence_id entity entity_type # 1 1 1 Smith PERSON # 2 1 1 North Carolina GPE # 3 2 1 Apple ORG # 4 2 1 U.K. GPE # 5 2 1 $1 billion MONEY ``` Here we see *Smith* identified as a PERSON, *North Carolina* as a geopolitical entity (location), *Apple* as an ORG (organization), *U.K.* as a GPE, and *$1 billion* as MONEY. spaCy did this using its statistical model. We can trust these annotations to a large extent, although no NER is perfect (and models can confuse entity boundaries or types, especially with ambiguous names or out-of-context mentions). The spacyr approach is quite powerful; it essentially brings a high-quality NLP model into R with minimal fuss. One can customize the model (e.g., use a larger `en_core_web_trf` transformer-based spaCy model for better accuracy, albeit slower). * **OpenNLP:** The **openNLP** package provides an interface to the Apache OpenNLP library (a Java-based NLP toolkit). It includes pre-trained models for English NER (person, organization, location, etc.). The usage involves initializing annotators and then annotating text. For example: ```r library(openNLP) text <- as.String("Barack Obama was born in Hawaii.") sent_token_annotator <- Maxent_Sentence_Annotator() word_token_annotator <- Maxent_Word_Token_Annotator() person_annotator <- Maxent_Entity_Annotator(kind = "person") loc_annotator <- Maxent_Entity_Annotator(kind = "location") pipeline <- list(sent_token_annotator, word_token_annotator, person_annotator, loc_annotator) annotations <- annotate(text, pipeline) entities <- text[annotations] # The annotations contain entity spans with their type ``` OpenNLP models are a bit older (MaxEnt models) and might not be as accurate as spaCy's newer neural models, but they can still identify major entities. It's fully in R (but requires Java) and is an option if one prefers not to rely on Python. * **CleanNLP and UDPipe:** The **cleanNLP** package provides a unified interface to various backends (including spaCy, Stanford CoreNLP, and UDPipe). UDPipe is particularly interesting as it provides pre-trained models for many languages (tokenization, POS, lemma, and dependency parsing, including NER for some languages). With cleanNLP, one can initialize a backend and get an annotated tbl_df in R. For instance: ```r library(cleanNLP) cnlp_init_udpipe(model_name = "english-ewt") result <- cnlp_annotate("Barack Obama was born in Hawaii in 1961.") result$entity # This might show a data frame of entities with columns like entity_type, token_ids, etc. ``` However, UDPipe's pre-trained English model might not have NER included (it depends on the model; some UD models focus on POS and dependencies). * **EntityR or Custom RegEx:** For very specific tasks or with well-defined entity patterns (like extracting phone numbers, product codes, etc.), sometimes simple pattern matching is enough. One could use `grepl` or the stringr package to find patterns that match an entity of interest. But for general NER (names of people, etc.), statistical models are far superior. NER is useful for structuring text because it pulls out a level of information that can be treated as **features or metadata**. For example, after extracting entities, one could ask: how many persons are mentioned in this document? Is a particular organization mentioned across many documents? One can also use NER to enrich datasets – for instance, tagging locations in text and then linking them to latitude/longitude for geospatial analysis, or identifying product names in customer feedback and tallying sentiment for each product. In R, after extracting entities (say using spacyr), we might integrate that with our tidy workflow. The `entity_extract()` from spacyr gives a data frame with columns for doc_id, sentence_id, entity text, and entity_type. We could join that back to our original data or count frequencies. For example, count how many times each PERSON is mentioned in a corpus, or filter sentences that contain a DATE entity. A quick example: Suppose we have a vector of news headlines and we want to find all the companies (ORG) mentioned: ```r library(spacyr) spacy_initialize(model="en_core_web_sm") headlines <- c("Google acquires Fitbit for $2.1 billion", "Tesla releases new software update", "Apple and Microsoft battle in cloud computing", "UN General Assembly convenes amid pandemic", "Facebook faces antitrust investigation by FTC") parsed <- spacy_parse(headlines, entity=TRUE) entities <- entity_extract(parsed) orgs <- entities %>% filter(entity_type=="ORG") print(orgs) # likely output: # doc_id sentence_id entity entity_type # 1 1 1 Google ORG # 1 1 1 Fitbit ORG # 2 2 1 Tesla ORG # 3 3 1 Apple ORG # 3 3 1 Microsoft ORG # 5 5 1 Facebook ORG # 5 5 1 FTC ORG ``` We see the ORG entities extracted from each headline, effectively structuring the text by pulling out the company names. We could then tabulate these or use them as features (maybe a one-hot encoding indicating presence of a particular ORG in a document, etc., depending on task). In evaluating NER results, one must remember that models can have errors – e.g., spaCy might label a common noun phrase as ORG by mistake or miss an entity with an uncommon name. But overall, NER adds a valuable structured layer: unstructured text → structured list of entities with types. To conclude, **NER in R** is achievable either by leveraging powerful external libraries (through wrappers like spacyr or cleanNLP) or using built-in models (openNLP). It *structures text by identifying proper nouns and other key named concepts* and labeling them. This structured information can then feed into further analysis. For instance, one might use NER before topic modeling to replace entities with a placeholder (to avoid topics clustering by specific names), or conversely, use NER output to specifically analyze networks of people and organizations mentioned together in a corpus (as is done in news analytics). By converting raw text into a list of entities, we reduce complexity and can focus on relationships between those entities or their frequencies. ## Topic Modeling in R While preprocessing and the DTM representation structure text at the *lexical level*, **topic modeling** provides structure at a *semantic or thematic level*. Topic modeling is an **unsupervised machine learning** technique that discovers latent themes (topics) in a collection of documents. Each topic is a distribution over words, and each document is modeled as a distribution over topics. The most popular method for topic modeling is **Latent Dirichlet Allocation (LDA)**, introduced by Blei, Ng, and Jordan (2003). *LDA assumes that every document is a mixture of a small number of topics, and each topic is characterized by a set of words with certain probabilities*. In essence, it tries to **cluster words that frequently co-occur into topics**, and assign documents to those topics in varying proportions. To clarify this with Silge & Robinson's summary: *“Latent Dirichlet allocation (LDA) treats each document as a mixture of topics, and each topic as a mixture of words”*. This means a topic model might find (for example) one topic about "politics" that has high probability for words like "government", "election", "president", and another topic about "sports" with words like "team", "coach", "season". A document about the President attending a basketball game might then be, say, 60% the politics topic and 40% the sports topic. Unlike clustering where a document belongs to one cluster, LDA allows documents to *overlap* topics, which is often more realistic for text. **Theoretical foundations:** LDA is a generative probabilistic model. It posits that for each document, one chooses a distribution over K topics (randomly from a Dirichlet prior). Then for each word in the document, a topic is chosen according to that distribution, and then a word is generated from that topic's word distribution. The inference process (fitting LDA) takes the observed words in documents and tries to infer the likely topics and their word probabilities that could have generated the corpus. The result of fitting LDA is typically: * φ (phi): a matrix of size K topics x V words, where φ_{k,w} = P(word w | topic k). Each row is essentially a probability distribution over the vocabulary (sums to 1). * θ (theta): a matrix of size D documents x K topics, where θ_{d,k} = P(topic k | document d). Each row is a distribution over topics for a document (sums to 1). Additionally, the model might learn hyperparameters for those Dirichlet distributions, but typically one focuses on φ and θ as the output. **Practical usage in R:** * **topicmodels package:** The primary CRAN package for LDA is **topicmodels**, which provides an interface to variational and Gibbs sampling implementations of LDA (using code from the original authors/Blei's group). One can create a `DocumentTermMatrix` (or use a quanteda dfm via conversion) and then call `LDA()`. For example: ```r library(topicmodels) data("AssociatedPress", package = "topicmodels") # a DTM of AP news lda_model <- LDA(AssociatedPress, k = 5, control = list(seed=1234)) ``` This would fit a 5-topic LDA model on the AP dataset. By default it uses a VEM (variational EM) algorithm. We can then examine the results: ```r terms(lda_model, 10) # top 10 terms per topic topics(lda_model, 5) # the top topic (as an index) for each of 5 documents ``` But these base functions are limited. A better approach is to use tidytext's `tidy()` on the model. As noted earlier, `tidy(lda_model, matrix="beta")` will give a tidy data frame of word probabilities per topic. `tidy(lda_model, matrix="gamma")` would give the document-topic probabilities (θ). This tidy output can be used to analyze and visualize the topics. For instance, we could filter `beta` for the top terms in each topic and plot them with ggplot2, or examine how documents distribute across topics. * **quanteda:** quanteda does not have its own LDA implementation but provides a convenience wrapper `textmodel_LDA()` that actually calls the topicmodels package under the hood for a quanteda dfm. quanteda can seamlessly convert dfm to "topicmodels" format and back. * **stm package:** Another notable package is **stm** (Structural Topic Models), which extends LDA to include document-level covariates (allowing topics to correlate with external variables or to vary by groups). If one’s aim is purely topics, stm can be used similarly and also has tidy methods and visualization tools. * **lda package:** A somewhat lower-level package implementing collapsed Gibbs sampling for LDA. It’s less user-friendly and typically topicmodels is preferred now. **Interpreting topics:** Once the model is fit, interpreting the topics is the crucial step. Typically: * Look at the top *n* words for each topic (words with highest φ_{k,w}). These give a sense of what the topic might be about. For example, one topic might show: "game, team, season, coach, play, win..." which clearly suggests sports. Another might show: "stock, market, shares, dollar, fund..." suggesting finance. * Sometimes, reading some example documents that have a high proportion of a given topic helps in labeling the topic. * One can also examine topic correlations (in models like CTM or via post-hoc correlation of θ columns). One of the advantages of using tidytext with topicmodels is that it integrates topic modeling results into the tidy workflow for analysis. In *Text Mining with R*, an example is shown where they differentiate chapters of novels by topics. They were able to see, for example, a topic that clearly corresponded to the Shakespeare's King Henry plays (with words like "king, crown, thou..." etc.) when applying LDA to a set of literary texts. **Choosing number of topics (k):** This often requires experimentation or domain knowledge. One might use metrics like perplexity or coherence to pick k, or simply try several and see which yields coherent, meaningful topics. There is no single correct number of topics; it depends on how granular the themes you want and the diversity of the corpus. **Example:** Let's illustrate with a small example in R. Suppose we take the built-in Associated Press dataset (which is news articles) and fit an LDA with k=4: ```r library(topicmodels) data("AssociatedPress") # DTM with ~2246 docs and 10473 terms ap_lda <- LDA(AssociatedPress, k = 4, control=list(seed=42)) library(tidytext) ap_topics <- tidy(ap_lda, matrix = "beta") top_terms <- ap_topics %>% group_by(topic) %>% slice_max(beta, n = 8) %>% ungroup() %>% arrange(topic, -beta) print(top_terms) ``` The output might show something like: ``` # A tibble: 32 x 3 topic term beta <int> <chr> <dbl> 1 1 church 0.015 2 1 catholic 0.010 3 1 pope 0.009 4 1 god 0.008 5 1 percent 0.008 6 1 life 0.007 7 1 people 0.007 8 1 church's 0.007 9 2 stock 0.012 10 2 market 0.010 11 2 shares 0.009 12 2 prices 0.008 13 2 money 0.006 14 2 sales 0.006 15 2 company 0.006 16 2 billion 0.005 ... ``` Topic 1 seems about religion/church, Topic 2 about stocks/market, and so on. Each topic is characterized by those top words. Indeed, by examining these, one can label topic 1 as "Religion", topic 2 as "Finance", etc. We see also some generic words like "percent, people" in topic 1 which might indicate a mix or common words; sometimes removing extremely common words beyond stopwords (like "percent") can refine topics. One can also examine document-topic assignments: ```r ap_documents <- tidy(ap_lda, matrix = "gamma") head(ap_documents) ``` This might show for each document (which in this dataset are AP articles, but unlabeled in data), the probability for each topic. If we had metadata (say each AP article had a date or known category), we could correlate that with topics (e.g., perhaps Topic 2 (finance) is more prevalent in business news section). **Applications of topic modeling:** It is used for discovering themes in large document collections where manual labeling is not feasible – for example: * Exploring research article abstracts to see what topics of research emerge. * Clustering customer feedback into themes. * Analyzing social media posts to identify prevalent discussion topics. * As a preprocessing step: using topic distributions (θ) as features for classification tasks, or using topic modeling to summarize and navigate a corpus. **Extensions:** Topic modeling has many variations beyond vanilla LDA: Correlated Topic Models (allow topics to correlate), Dynamic Topic Models (topics over time), Hierarchical LDA, etc. In R, the **stm** package can model topics with covariates (like time or group), allowing one to see how topics' prevalence changes with a covariate or how word usage differs by a document attribute, adding more structure to the analysis. In conclusion, **topic modeling provides a structured, higher-level view of text data by organizing the vocabulary and documents into topics**. In R, with packages like topicmodels and tidytext, it is relatively straightforward to implement LDA and interpret the results using tidy data principles. Topic models do not require prior labeling of documents, which makes them an excellent exploratory tool for unknown corpora. They essentially *add a layer of abstraction to the structured text*: instead of dealing with thousands of word features, one can summarize documents by a handful of topics (the θ vectors). This thematic structuring complements other analyses like NER and sentiment analysis, which we will discuss next. ## Sentiment Analysis Techniques **Sentiment analysis**, also known as **opinion mining**, is the process of determining the sentiment or emotional tone behind a block of text. Typically, this involves classifying text along a polarity spectrum (positive, negative, neutral) or extracting more nuanced emotion categories (joy, anger, sadness, etc.). In practical terms, sentiment analysis turns unstructured text (like a product review, tweet, or news headline) into structured data by assigning sentiment labels or scores. For example, the movie review *"I absolutely loved this film; it was fantastic!"* would be labeled as *positive* sentiment, whereas *"The film was a total waste of time and money."* would be *negative*. Academically, Pang and Lee (2008) define sentiment analysis as classifying a given text into sentiment categories such as positive or negative. It can be done at different granularities: document-level (overall sentiment of a review), sentence-level, or aspect-level (sentiment towards specific aspects of an entity). The most common binary classification is positive vs. negative, sometimes with neutral as a third category for objectively worded text. In R, there are several approaches and tools for sentiment analysis: ### 1. **Lexicon-Based Sentiment Analysis** This approach relies on a predefined list of words (a lexicon) that are annotated with sentiment scores or categories. The sentiment of a text is derived from the sentiment values of the words it contains. This is a straightforward and interpretable method, though it has limitations in handling context, sarcasm, negation, etc. Common sentiment lexicons: * **Bing Liu's lexicon**: Categorizations of \~6800 words as positive or negative (no intensity). In tidytext, accessed via `get_sentiments("bing")`. * **AFINN**: \~2500 words rated with an integer sentiment score from -5 (very negative) to +5 (very positive). Accessed by `get_sentiments("afinn")`. This allows a weighted sentiment. * **NRC**: Contains \~14,000 unigrams with binary flags for ten different sentiments (eight basic emotions like joy, anger, fear, and two sentiments: positive, negative). Accessed via `get_sentiments("nrc")`. Using tidytext, one can perform lexicon-based sentiment analysis by inner joining tokens with one of these lexicons. For example: ```r library(tidytext) library(dplyr) text <- c("I loved the new design, it is wonderful!", "The update is bad and disappointing.") tokens <- tibble(line=1:2, text=text) %>% unnest_tokens(word, text) bing <- get_sentiments("bing") tokens_sentiment <- tokens %>% inner_join(bing, by="word") tokens_sentiment # line word sentiment # 1 loved positive # 1 wonderful positive # 2 bad negative # 2 disappointing negative ``` From this joined result, we can compute a simple score: document 1 had 2 positive words, 0 negative → overall positive; document 2 had 0 positive, 2 negative → overall negative. We might assign a sentiment label accordingly. With AFINN, one could sum the scores (e.g., "bad" = -3, "disappointing" = -2, total -5 indicates negative). The tidytext book demonstrates this on larger scales, for instance computing the sentiment score of each chapter of a novel by summing AFINN scores. Lexicon methods are easy to implement and fast. They work well when text is fairly straightforward, and domain-appropriate lexicons are used (e.g., social media may require including slang or emoticons in the lexicon). Limitations: * Negation handling: The phrase "not good" might get a positive from "good". Basic lexicon methods would incorrectly label it positive unless we handle negation explicitly (some strategies: flip sentiment of words after a "not" within a window). * Sarcasm/irony: "Great, another delay." has "great" (positive word) but clearly is negative; lexicon might misfire. * Context/domain: Lexicons are generic; domain-specific usage (e.g., "cold" in medical context might be neutral noun, but lexicon might see it as negative adj). * Composition: "good" vs "very good" vs "not very good" all have different intensities not fully captured by just adding word scores. Despite these, lexicon approaches often serve as a baseline and are useful for quick analyses. There are also R packages focusing on lexicon methods: * **syuzhet** package (by Jockers) which includes multiple lexicons (NRC, Bing, AFINN, etc.) and some convenient functions to get sentiment by sentence. It’s known for the "syuzhet" method of mapping sentiment across narrative time (for literature). * **sentimentr** package (by Rinker) which improves upon lexicon by accounting for valence shifters (negations, amplifiers, de-amplifiers) in a consistent way. It computes sentiment at the sentence level with some smarter rules. For example, sentimentr would catch that "not good" flips the valence of "good". Using sentimentr is relatively straightforward: ```r library(sentimentr) sentiment(c("I am not happy", "This is extremely good")) # it will output a data frame with sentiment scores (around 0 to 1 positive, 0 to -1 negative typically). ``` sentimentr uses a lexicon with valence adjustments and yields a continuous score. ### 2. **Machine Learning (Supervised) Approaches** Another approach is to treat sentiment analysis as a text classification problem. This requires a labeled dataset (e.g., a collection of movie reviews each labeled positive or negative). One then uses a machine learning model to learn which words, phrases, or other features correlate with positive or negative sentiment. Classic algorithms for this include: * **Naive Bayes classifier**: often used for its simplicity and surprisingly good performance on text. For sentiment, one would calculate the probability of each word given positive vs negative classes and use Bayes’ theorem to classify new documents. (Quanteda has `textmodel_nb()` that can do this given a dfm and labels, or one can do it manually). * **Support Vector Machines (SVM)**: with appropriate kernels, SVMs have historically performed well for text classification, including sentiment. As Research suggests, SVM often outperforms Naive Bayes in text classification tasks by better handling high-dimensional sparse data, though NB is competitive considering its simplicity. * **Logistic Regression** (often with L1 or L2 regularization, a.k.a. Maximum Entropy model in NLP terminology): This is also widely used for sentiment. R's **glmnet** package (for regularized regression) can handle large sparse matrices and could be used to train a logistic model on TF–IDF features for sentiment. * **Tree-based models and Ensembles**: Random forests, gradient boosting (xgboost) can be applied as well. But linear models (NB, SVM, logistic) typically suffice and are more interpretable for text. For instance, using quanteda, one could do: ```r library(quanteda) library(quanteda.textmodels) # Suppose we have a data frame 'reviews' with columns 'text' and 'sentiment' (pos/neg) dfm_reviews <- reviews$text %>% tokens(remove_punct=TRUE, remove_numbers=TRUE) %>% tokens_remove(stopwords("en")) %>% dfm() # split into train and test train_id <- sample(seq_len(nrow(dfm_reviews)), size = 0.8 * nrow(dfm_reviews)) dfm_train <- dfm_reviews[train_id, ] dfm_test <- dfm_reviews[-train_id, ] true_train_labels <- reviews$sentiment[train_id] true_test_labels <- reviews$sentiment[-train_id] nb_model <- textmodel_nb(dfm_train, y = true_train_labels) pred <- predict(nb_model, newdata = dfm_test) confusion_matrix <- table(Predicted=pred, Actual=true_test_labels) ``` This yields a Naive Bayes classification model. We could also do `textmodel_linear` (SVM via the Liblinear wrapper) or use the **caret** package to try different algorithms. The advantage of supervised models is they can learn domain-specific sentiment indicators (maybe particular jargon or multi-word expressions). They can also naturally handle negation if such patterns are frequent and distinguishable (like "not good" might appear often in neg reviews and the model learns "not good" as a feature combination if using bigrams or with appropriate modeling). However, supervised learning requires labeled data, which might not be available or expensive to obtain for every domain. ### 3. **Advanced (Neural) Approaches** In recent years, state-of-the-art sentiment analysis often uses neural network models and embeddings: * **Word embeddings** (Word2Vec, GloVe) followed by an LSTM or CNN to capture sequence information and context, often outperform traditional models. * **Fine-tuning pretrained language models** like BERT (Devlin et al., 2019) on a sentiment classification task has become a dominant approach. BERT already encodes a lot of linguistic knowledge, and when fine-tuned on a sentiment dataset (like SST or IMDB), it achieves very high accuracy. In R, one can access these through: * **reticulate** to use Python libraries (like Hugging Face transformers or TensorFlow). For example, using `reticulate` to import a transformers pipeline for sentiment (as was demonstrated in the *Posit AI Blog: BERT from R*). * The **text** package and **transformers** R package (as mentioned before). The `text` package allows using Hugging Face transformer models to get embeddings and even do classification. The medium article we saw indicates there is an R package called `transformers` which provides access to Hugging Face models, and they mention that with these one can implement sentiment analysis in R with minimal Python interference. For instance, the text package has `textClassify()` or one can use `textEmbed()` to get embeddings and then train with `textTrain()`. * **keras** and **torch** in R: R interface to Keras/TensorFlow can build and train neural nets. One could, say, tokenize text and feed into an LSTM using keras in R. There are examples of doing an IMDB review classifier in R with Keras (the keras package has an example built-in for text classification). Using such advanced models often significantly boosts accuracy. For example, a BERT fine-tuned on a large movie review corpus can achieve > 95% accuracy in distinguishing positive vs negative reviews, which is far above a simple lexicon approach. The cost is complexity and requiring more computational resources. The text package was specifically created to make transformers accessible in R for such use cases. ### Example: To illustrate a simple sentiment classification, consider using a lexicon to analyze sentiment of some sentences: ```r library(tidytext) sentences <- tibble(id = 1:3, text = c("I love this product, it's amazing!", "This is the worst purchase I've ever made.", "It's okay, not great but not terrible.")) sentiment_words <- sentences %>% unnest_tokens(word, text) %>% inner_join(get_sentiments("bing"), by="word") %>% count(id, sentiment) %>% tidyr::pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) sentiment_words <- sentiment_words %>% mutate(sentiment = case_when(positive > negative ~ "positive", negative > positive ~ "negative", TRUE ~ "neutral")) print(sentiment_words) # id positive negative sentiment # 1 1 2 0 positive # 2 2 0 2 negative # 3 3 0 1 negative (because "not" might not be in lexicon, "great" is positive, "terrible" negative; here "not" absence led to misclassification maybe) ``` We see sentence 1 classified positive (which is correct), sentence 2 negative (correct), sentence 3 ended as negative (the phrase "not great but not terrible" is essentially saying moderate, but our simplistic method only caught "terrible" as a negative word – an example of needing to handle negation or understanding "not X but not Y" construction). More advanced methods or lexicons with bigrams could catch that "not great" is negative or "not terrible" is a mitigating positive. In practice: * If just a quick analysis needed on social media or reviews, lexicon may suffice to get an aggregate sense (like average sentiment over time or fraction of negative vs positive). * For applications like customer feedback classification, training a model using historical labeled data (if available) yields better precision. One might use caret or tidymodels (the **tidymodels** suite in R can handle text features via recipes: e.g., `step_tokenize`, `step_tf` or `step_tfidf` to incorporate text into a modeling pipeline). To sum up, **sentiment analysis** turns unstructured text into structured sentiment data – either a categorical label or a numeric score – which can then be used to draw insights (e.g., "80% of tweets about our brand this week were positive"). R offers multiple techniques: lexicon-based (quick and interpretable, but needing careful handling of context) and supervised machine learning (requiring data but potentially more accurate). With the integration of advanced models (like BERT via reticulate or the text package), R users can also leverage cutting-edge NLP for sentiment tasks. Indeed, using Hugging Face transformers in R has been demonstrated for sentiment analysis, bringing modern deep learning capabilities into our R workflow. Having extracted sentiments, one could combine this with other structured data (e.g., see if negative reviews correlate with certain product categories, or map sentiment geographically if texts are tied to locations). The final piece we discuss is **text classification** in a broader sense, of which sentiment is a special case. ## Text Classification and Categorization Text classification is a broad term for assigning categories or labels to text based on its content. Sentiment analysis (positive/negative) is one example of text classification. Other examples include topic labeling (classifying news articles into topics like sports, politics, tech), spam detection (spam vs ham emails), authorship attribution, language identification, or any prediction where input is text and output is a category. Structuring unstructured text via classification means training a model that can automatically tag new text documents with the appropriate label, effectively adding a new structured attribute (the predicted class). In an academic context, this involves feature extraction from text (often the DTM/TF–IDF features we discussed) and applying supervised learning algorithms. **Common approaches & algorithms:** As mentioned, Naive Bayes and SVM are widely used for text classification due to their effectiveness on high-dimensional sparse data. Naive Bayes is particularly popular as a baseline because of its speed and reasonable accuracy by assuming word independence (which, while a simplification, works decently). SVMs tend to perform very well, often outscoring NB, especially with proper regularization and kernel choice. For instance, an SVM with a linear kernel is essentially doing something similar to logistic regression but with a max-margin criteria; it handles many correlated features well via regularization. Logistic regression (a.k.a. MaxEnt classifier in NLP) is also commonly used. In fact, a well-tuned logistic regression with L2 regularization on TF–IDF features is often a strong baseline. Many Kaggle competitions or studies have shown that these "shallow" models can nearly match more complex neural networks on moderate-sized datasets, especially if one uses n-gram features and maybe additional engineering (like capturing word position or simple negation handling). **Feature engineering:** Besides raw word counts or TF–IDF, one can enrich features for classification: * **N-grams:** Including bigrams or trigrams can capture short phrases or context that single words miss (like "United States" vs "United" separately, or "not good" as a bigram feature). * **Parts of speech tags or syntax patterns:** Sometimes useful in tasks like authorship or formality classification. * **Domain-specific cues:** e.g., number of exclamation points as a feature for enthusiasm in sentiment, presence of certain domain-specific jargon for topic classification. * **Metadata:** If available, e.g., the length of the text, time of writing (some tasks combine textual and non-textual features). The R ecosystem allows feature engineering with **recipes** in tidymodels or manual methods with dfm in quanteda (which has `dfm_ngrams()` etc. to add n-gram features). **Model training and evaluation:** One typically splits data into training and test sets (or uses cross-validation). In R, **caret** (Classification And Regression Training) or the newer **tidymodels** (specifically parsnip, workflows, tune, etc.) can streamline trying different models with text data. For example: ```r # Using tidymodels for text classification: library(tidymodels) text_rec <- recipe(label ~ text, data = training_data) %>% step_tokenize(text) %>% step_stopwords(text) %>% step_tfidf(text) # converts to tfidf features lr_spec <- logistic_reg() %>% set_engine("glmnet") text_wf <- workflow() %>% add_recipe(text_rec) %>% add_model(lr_spec) text_fit <- text_wf %>% fit(data = training_data) text_preds <- predict(text_fit, new_data = test_data) %>% bind_cols(test_data) metrics(text_preds, truth = label, estimate = .pred_class) ``` This tidy pipeline will handle all the steps in one go. Under the hood, `step_tfidf` likely creates a sparse matrix of TF–IDF and then glmnet uses that. One can easily swap `logistic_reg` with e.g. `svc_rbf()` for an SVM (if a relevant engine is available, possibly via kernlab). **Performance considerations:** High-dimensional features (tens of thousands of terms) can slow down training for some models, but linear models scale well (glmnet, Liblinear SVM). It's often crucial to trim the vocabulary (remove very rare terms) to reduce dimensionality with minimal impact on accuracy. **Example scenario:** A classic dataset is the **20 Newsgroups** dataset (posts from 20 different Usenet newsgroups, labeled by topic). A text classifier can be trained to predict which of the 20 newsgroups a given post belongs to. Achieving 80-90% accuracy is feasible with SVM or logistic regression on this data. Another is **spam detection** using the **SMS Spam Collection** dataset (SMS labeled ham or spam); a simple NB model can reach \~97% accuracy on that. In R, one could implement that with a few lines using quanteda or tidytext to get DTM and then e1071 (for SVM) or textmodels. **Using advanced models:** Just like with sentiment, one can employ deep learning for classification if needed. Fine-tuning BERT or other transformers on multi-class tasks can yield excellent results. For example, classifying news articles by topic using BERT might reach very high accuracy and also allow use of pre-trained knowledge (like understanding that "stocks" and "Wall Street" imply Finance category, etc., even if "Wall Street" never appeared in the training set explicitly, because BERT knows it from pretraining). The **reticulate** and **text** package approach: The user could import `transformers` pipeline for text classification: ```r library(reticulate) transformers <- import("transformers") classifier <- transformers$pipeline("sentiment-analysis") # or another model type classifier("The movie was absolutely wonderful!") # Returns something like {'label': 'POSITIVE', 'score': 0.98} ``` This shows for sentiment but similarly for any classification if a model exists (or fine-tuned externally). The `text` package specifically provides `textTrain()` to train models on embeddings. They demonstrate using these for numeric prediction or classification tasks, all within R but leveraging Python under the hood for the heavy-lifting. **Connected to structuring text data**: When we successfully train a text classifier, we essentially have a function that maps unstructured input (text) to a structured label. This can be used to automatically tag large volumes of documents. For example, feeding a million customer reviews through a sentiment classifier adds a "sentiment_score" or "sentiment_label" column to that dataset, which can then be aggregated or correlated with other data (e.g., do certain products get more negative reviews? Did sentiment improve after a certain date?). Similarly, classifying support tickets by topic can route them to the appropriate department – here the model adds a structured topic category to each ticket. It’s worth noting that in multi-class classification tasks with many classes (like topic labeling with dozens of possible labels), simpler models might struggle if classes have overlapping vocabularies, whereas more complex ones (like neural networks) might capture subtle patterns better. But the trade-off is complexity and the need for more data. **Accuracy considerations:** * Naive Bayes, while often used, makes independence assumptions that might not hold, and can be less accurate than discriminative models like SVM or logistic regression. Yet, NB is extremely fast and has low variance (especially good when data is limited – it doesn't overfit easily due to assumptions). * SVMs and logistic regression with regularization are robust choices; SVM often used in academic benchmarks for text (even beating some older neural nets pre-2012). The snippet from research suggests SVM outperforms NB in most cases for text classification, though NB is "widely used" due to simplicity. * Ensemble approaches (combining NB and SVM, as in the biomedical example) can sometimes yield slight improvements. In their case, they combined NB and SVM results to leverage both, because NB might get some right that SVM misses and vice versa. To conclude, **text classification** is a powerful way to structure text by predictive labeling. R provides multiple paths: from quick lexicon-based classification to comprehensive machine learning pipelines. The structured outcome (predicted class labels or probabilities) can then be integrated into decision-making or further data analysis. For instance, one might find that support tickets classified as "Login Issue" have a higher resolution time – guiding process improvements. Or, classifying news articles by sentiment and topic can allow an economist to quantify "negativity in financial news" over time and correlate it with market indicators. By converting text into labels or scores, we simplify and condense the unstructured content into actionable information. ## Integrating Advanced Language Models (BERT and Transformers in R) Recent advances in NLP have been dominated by large **pre-trained language models** such as BERT, RoBERTa, GPT, XLNet, etc., based on the Transformer architecture (Vaswani et al., 2017). These models are trained on massive corpora and capture rich linguistic and world knowledge. By integrating these models into our R workflow, we can significantly enhance the structuring and analysis of text data, especially for tasks that were traditionally challenging (like understanding context, detecting sarcasm, etc.). BERT (Bidirectional Encoder Representations from Transformers) in particular, introduced by Devlin et al. (2018), achieved state-of-the-art results on a wide range of NLP tasks by using a deep bidirectional transformer that reads text *contextually in both directions*. **Why use transformers in R?** Utilizing models like BERT can provide: * **Better feature representations:** Instead of simple bag-of-words, BERT can provide contextual embeddings for words or entire sentences. These embeddings often cluster by semantic meaning. For example, BERT might map "bank" in "river bank" vs "bank account" to different vector representations, whereas bag-of-words cannot disambiguate those. * **Pre-trained knowledge:** Transformers come pre-trained on enormous datasets (e.g., BERT was trained on Wikipedia + BooksCorpus). This knowledge can be fine-tuned to specific tasks with relatively smaller data. They effectively serve as off-the-shelf engines for tasks like question answering, NER, summarization, etc., with minimal additional training. * **Improved accuracy on many tasks:** For classification, sentiment, NER, etc., fine-tuned transformers often outperform traditional models by a large margin. **How to integrate in R:** 1. **Reticulate to call Python libraries (Hugging Face Transformers, TensorFlow, PyTorch)**: This approach treats Python as a backend. For example, using `reticulate` you can import the `transformers` library and use pipelines as shown earlier. The RStudio TensorFlow team has shown examples of using BERT via keras and reticulate. In the "BERT from R" blog post, they demonstrate loading a Keras implementation of BERT, then calling `reticulate::import('keras_bert')` and constructing the model, then using it for classification. They essentially leveraged Python code within R to train a model on text classification. * One can also directly call a HuggingFace pipeline: e.g. `classifier <- transformers$pipeline('sentiment-analysis')`, then `classifier(c("I love this.", "This is bad."))` returns labeled results. Similarly, a NER pipeline can be called. The bottleneck is moving data between R and Python, but for moderate sizes it's fine. 2. **The `text` R package**: As we explored, the `text` package by Kjell et al. (2023) provides a high-level R interface to Hugging Face transformer models through reticulate and torch. It allows one to transform texts to embeddings (using models like BERT) with a single function call `textEmbed()`, then do various analyses on those embeddings. For example: * `textEmbed()` will return a matrix of embeddings for each text (one row per input text, columns being embedding dimensions, typically 768 for base BERT). * One can then do `textTrain()` on those embeddings with a vector of outcomes (for regression or classification), which under the hood might train a simple neural network or other model to predict from embeddings. * `textPredict()` then can predict on new data with the trained model. * The package also includes `textSimilarity()` or `textDistance()` to compute semantic similarity between texts by their embeddings, which is useful for information retrieval or clustering. * There are visualization helpers like `textProjectionPlot()` that can help visualize how words or texts relate in the embedding space. The text package essentially wraps the heavy computation such that an R user doesn't need to write Python code; it downloads models from Hugging Face and uses them. For instance, by default it might use "bert-base-uncased" model for English if asked. The abstract of the *text-package* paper highlights that it “provides user-friendly functions tailored to test hypotheses… for both relatively small and large data sets” and is both modular and end-to-end. It is designed for social scientists to easily leverage transformers without deep technical overhead. The core functions (textEmbed, textTrain, textSimilarity, etc.) encapsulate typical use-cases: embedding texts, building predictive models, and measuring semantic similarity/distance. The inclusion of these advanced methods in R greatly expands what we can do. 3. **Onnx or others**: There is also the option of using ONNX (Open Neural Network Exchange) models via the **onnxruntime** R package to run pre-trained transformer models without python. But that requires a bit more setup (exporting a model to ONNX). Not as common yet. 4. **torch for R**: The **torch** package (by MLVerse) is an R native interface to libtorch (PyTorch C++ backend). There's now **torchtransformers** and related efforts to allow loading transformer models directly in R. This is a developing area. But conceptually, one could load a pre-trained BERT in torch in R and do predictions. It's lower-level than the `text` package, which already does a lot for you, but it's an emerging alternative for those who want to avoid Python. **Use cases in this chapter context**: * If performing classification, instead of using TF–IDF + SVM, one could fine-tune BERT on the labeled data. Some R examples (like on blogs or Kaggle) show doing this with reticulate or keras. The `text` package might let you do something like `textTrain()` with model_type = "bert" to directly fine-tune. Actually, looking at text documentation: they mention textTrain can train predictive models with embeddings as input. If those embeddings are from BERT (e.g., produced by textEmbed), then effectively that's using BERT's features but not fine-tuning BERT fully. It’s more like: use BERT as feature extractor, then train a smaller model. Full fine-tuning (updating BERT weights) would likely require going into reticulate or keras. * For NER: Instead of spaCy’s static model, one could use Hugging Face’s `pipeline("ner")` which often is a BERT-based model for NER and might yield even better results or more fine-grained entity types (like distinguishing PERSON vs NORP (nationalities) etc.). The usage is similar via reticulate, and the output would be entity text with labels and confidence scores. * For similarity and clustering of documents: rather than relying on LDA or raw TF–IDF cosine similarity, using BERT’s sentence embeddings (like Sentence-BERT) can yield semantic similarities (e.g., "the cat sits on the mat" vs "a kitten is on the rug" are considered similar by embeddings, which TF–IDF might not realize if words differ). The text package’s `textSimilarity()` can compute this with BERT embeddings easily. * For **multi-lingual text**: Many transformers are multilingual (e.g., XLM-RoBERTa, mBERT). The same pipeline can then handle languages beyond English without separate models for each – a huge advantage if working with international data. One must note the **computational considerations**: Transformers are heavy. In R, when using reticulate, it’s essentially Python doing the heavy work, which is fine if properly installed. Memory and possibly GPU usage (if configured) come into play for large data. But for moderate tasks or using cloud services, it’s become manageable. The CRAN text package likely relies on having Python ≥ 3.6 and the transformers library installed and will handle the rest, as indicated by its CRAN description requiring Python and torch. A quick demonstration with the `text` package (assuming we have it and the required environment): ```r library(text) # Suppose we have a data frame of sentences and a numeric rating variable (like sentiment 1-5) embeddings <- textEmbed(sentences_df$sentence) # 'embeddings' now contains the text embeddings (perhaps average of token embeddings or [CLS] token from BERT). # We can then cluster or classify using these embeddings. # If classification: model <- textTrain(embeddings, y = sentences_df$label) # textTrain will create a model (could be an MLP or so) preds <- textPredict(model, new_data = textEmbed(new_sentences)) ``` All internal details are abstracted away, which is convenient. Lastly, the **transformers package** mentioned in the Heartbeat article suggests there might be a direct R package named `transformers` that wraps Hugging Face (maybe just a reticulate wrapper). They listed it as required along with reticulate and tokenizers. Possibly that package simplifies using the Python library. **The state of using BERT in R** as of 2025: It's quite feasible and increasingly user-friendly: * Data scientists can remain mostly in R, using recipes and known modeling infrastructure while plugging in embeddings from transformer models. * The results are state-of-the-art: e.g., using BERT for classification might boost accuracy significantly compared to classical methods, as shown by many research benchmarks. * The integration is endorsed by the R community (the RStudio/Posit team) with blog posts and package development, indicating it's a recommended path for serious NLP tasks. In summary, **integrating advanced models like BERT** enables R users to elevate their text analysis from counting words to truly understanding text in context. Tools like reticulate and the `text` package act as bridges to the sophisticated transformer models. This synergy allows one to benefit from Python's NLP advancements while still conducting analysis, visualization, and reporting in R. By using these models, unstructured text can be structured and analyzed in ways that were not previously possible with older methods – capturing nuance, context, and deep semantic relationships. The rest of this chapter includes examples demonstrating how these advanced techniques can be applied to real-world text datasets, illustrating the complete process of structuring unstructured text data in R from start to finish. ## Real-World Examples and Datasets To ground the concepts discussed, we consider a few real-world scenarios and datasets where structuring unstructured text in R proves invaluable. These examples demonstrate the end-to-end application of preprocessing, structuring, and analyzing text using the techniques covered: **1. Movie Reviews Sentiment Classification (IMDb Reviews Dataset):** *Dataset:* The IMDb movie reviews dataset (Maas et al., 2011) contains 50,000 movie reviews labeled positive or negative, commonly used for sentiment analysis benchmarks. *Task:* Predict whether a review is positive or negative (binary classification). *Procedure:* Using a combination of approaches: * Preprocess reviews by removing HTML tags (the dataset has some), lowercasing, perhaps keeping negations ("not", "never") as is (or even bigram "not_good" approach). * Tokenize and remove stopwords (though some argue to keep negations and intensifiers for sentiment tasks). * Create a Document-Term Matrix with TF–IDF weighting. * Train a classifier: e.g., a logistic regression with regularization or an SVM. Evaluate on a held-out test set. * *Result:* In literature, a simple Bag-of-Words + logistic regression can exceed 85% accuracy. With tuning (bigrams, etc.) \~90%. Using BERT fine-tuning, one can achieve \~95%+. For instance, a fine-tuned BERT model on this dataset would involve using reticulate to load `transformers` and then training (which could be done in a few lines with the transformers Trainer API or using the text package to get embeddings and train a smaller model). * *Outcome:* The unstructured text (movie review content) gets structured into a **sentiment label** or even a probability of positivity for each review. Researchers can then analyze, say, which words were most indicative of positive vs negative in the model (using coefficients from logistic regression or SHAP values for complex models). One might also aggregate by movie to see overall reception or correlate with box office success. **2. News Topic Modeling and Trend Analysis (Reuters or New York Times Corpus):** *Dataset:* The Reuters-21578 news articles (a classic dataset with articles labeled by topics like earn, acq, grain, etc.), or a more modern one like a large set of New York Times articles with categories. Reuters has multiple topics per article sometimes, but one can focus on single-label subset (like "ModApte" split). *Task:* Discover topics in news or classify news by topic. *Procedure (Topic Modeling):* * Compile the corpus of news articles. Preprocess by removing boilerplate, stopwords; possibly stem or not depending on needs (for topics, stem could help merge variants). * Use **LDA** to find, say, 10 topics. Examine top words per topic and assign labels ("Earnings/Business", "Sports", "Politics", etc.). * Check if these topics align with known categories or if they reveal subtopics. For example, an LDA on NYT articles might separate international politics vs domestic politics, or tech business vs financial news. * One can plot the prevalence of each topic over time if timestamps are present. E.g., topic about "election" spikes during election years. This turns unstructured text into a structured time-series of topic proportions. * *Procedure (Classification):* If labels exist (like Reuters categories), one can train a multiclass classifier. For instance, classify each article into one of the top 5 Reuters categories. A one-vs-rest logistic regression or linear SVM could be used. Multi-label classification (if one article can have multiple topics) is more complex; one might train separate classifiers per label. * *Outcome:* Using LDA, we structured the text into **topics** per document. Using classification, we add a **category label** to each article. This makes it possible to, say, filter all articles about "grain" or to quantify that, e.g., 30% of Reuters news in a certain month was about earnings (business). The structured data can be used for downstream tasks like recommendation (suggest related articles of same topic) or content analysis in social sciences (e.g., measuring media attention on topics over time). **3. Customer Complaint Analysis (Twitter Airlines Sentiment or Consumer Complaints):** *Dataset:* A well-known dataset is the Twitter US Airline Sentiment, where tweets directed at airlines are labeled positive, negative, or neutral. Another is the Consumer Complaint Database (US CFPB) where complaints are text describing issues with financial products, often categorized by issue and product. *Task:* Structure the tweets or complaints by sentiment and key issues. *Procedure:* * **Sentiment:** For tweets, one could directly apply a pretrained sentiment model (like a RoBERTa sentiment analyzer via text::textEmbed + textTrain or huggingface pipeline). Or train a model on the provided labels. Since tweets are short, lexicon methods might also be okay. Indeed, the Twitter airline data was often used to benchmark lexicon vs ML. * **NER & Aspect Extraction:** Perhaps use NER to find airline names, locations, etc. Many tweets mention flight numbers or airports (like "AA123", "JFK"). A custom NER or regex could tag those. * **Topic/Issue classification:** For consumer complaints, one could use topic modeling to see emergent issues (e.g., "customer service", "billing error", "fraud"). Or use the provided structured categories as a supervised label. * The **tidytext** framework can be very helpful in analyzing common words by sentiment category (like what words are most associated with negative tweets? Possibly "delayed", "cancelled"). Or bigrams analysis (common phrases). * *Outcome:* We transform a mass of tweets into a summary: e.g., 60% negative, top complaint topics are "late flight", "lost luggage". For the Consumer Complaints, one can output something like: complaints about "mortgage" often involve the topic "loan modification" and have a certain polarity. This structured insight can help companies prioritize issues. * On the modeling side, one might achieve, say, \~80% accuracy in classifying tweet sentiment with a simple model, and higher with transformer. The structured output can be visualized (like bar charts of sentiment counts or time-series of sentiment over a day). **4. Academic Literature Mining (Research Paper Abstracts – NER + Classification):** *Dataset:* Consider a collection of scientific paper abstracts (like from arXiv or PubMed). Suppose we want to structure them by identifying key entities and classifying their research area. *Task:* Extract named entities like chemical names, diseases, or author affiliations from abstracts, and classify papers into fields or detect if they propose a new method vs application. *Procedure:* * Use **NER** (maybe domain-specific, e.g., sciSpacy or BioBERT NER via reticulate) to extract entities: gene names, chemical compounds, algorithms, etc. * Use a classification model to label each abstract by field (perhaps using journal info or an existing taxonomy). * Alternatively, use unsupervised topic modeling to see groupings of research topics (maybe topics correspond to subfields). * Example: On PubMed abstracts, an LDA might separate topics like "cancer research", "cardiology", "neurology", "bioinformatics", etc. Entities extracted would include specific proteins, gene IDs, etc., which are structured data that could be linked to databases. * *Outcome:* Unstructured abstracts become enriched data: each abstract has a set of entities (structured as, say, key = 'Protein', value = 'TP53'), and possibly a predicted category. This could feed into a knowledge base or search index. For instance, one can query "papers that mention BRCA1 gene" easily after this structuring. Or track how many papers per year mention a given technique. **5. Multi-language Social Media Analytics:** *Dataset:* Suppose we have product reviews in English, Spanish, and French. We want an overall sentiment score for each, but don't have labeled data in all languages. *Approach:* Use a multilingual transformer (like multilingual BERT or XLM-R) via the `text` package or reticulate, to predict sentiment for all reviews, regardless of language. Because these models handle multiple languages, we don't need separate pipelines. Alternatively, translate all to English (using an API) then use an English model, but that introduces errors and cost. The multilingual model approach keeps it unified. * Implementation: `transformers$pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")` is a known model on HF that predicts 1-5 star sentiment on many languages. With reticulate, you could load that and run it on all texts. * The output is a rating or class per text, which is the structured result. * Then one could compare sentiment distribution by language or region, etc. * This shows the power of advanced models combined with R's data handling: ingest raw text in different languages, output a structured numeric sentiment score or label for each. **Evaluation and References:** Throughout these examples, we see academic references and evidence for the effectiveness of methods: * The notion that *"Naïve Bayes classifier is widely used for text classification due to its simplicity, efficiency, and speed"*, but SVM often yields higher accuracy, supports why one might start with NB for a quick baseline and then move to SVM or BERT for better performance. * The introduction of quanteda and tidytext in academic work gives credence to their use in scholarly analysis, e.g., political science research analyzing big text corpora using these tools. * The *text-package* publication in *Psychological Methods* highlights how transformers can be applied for social science text analysis, validating our advanced integration approach. By applying the strategies from this chapter, researchers and practitioners can convert raw text datasets into structured forms that yield insights. The combination of classical methods (DTMs, TF–IDF, LDA) and modern deep learning (BERT, transformers) provides a comprehensive toolkit. Importantly, these techniques do not exist in isolation: an analysis might begin with exploratory lexicon sentiment analysis to get a quick sense, then proceed to train a more robust model for deployment; or use topic modeling to suggest categories which are then validated or used as features in a supervised model. In conclusion, structuring unstructured text data in R involves a pipeline of **theoretical understanding** (knowing why to tokenize, normalize, etc.) and **practical application** (using R packages to implement each step). With the knowledge from this chapter, one can take virtually any collection of text – be it social media posts, customer feedback, academic articles, or news – and transform it into structured data: tokens, entities, sentiment scores, topic assignments, and more. This structured data can then be analyzed like any other quantitative data, unlocking the vast information contained in text for insights and decision-making. ## Conclusion Unstructured text data, ubiquitous in today’s digital age, contains a wealth of information that can be unlocked through careful processing and analysis. In this chapter, we have detailed a comprehensive roadmap for **structuring unstructured text data in R**, bridging theoretical foundations with hands-on techniques and tools. We began by discussing the nature of unstructured text and the challenges it poses – high volume, noise, context sensitivity – underlining why structuring is necessary for meaningful analysis. We then delved into **preprocessing**, the critical first stage where text is cleaned and standardized. Techniques such as tokenization were defined and demonstrated, highlighting how breaking text into tokens provides the units for further analysis. We saw that R’s ecosystem (e.g., *tokenizers*, *tidytext*, *quanteda*, *spacyr*) offers robust solutions to handle tokenization across languages and contexts. Normalization steps like lowercasing, removing punctuation, and stopword removal were explained as methods to reduce noise. We also contrasted stemming and lemmatization – two paths to reduce words to base forms – noting their pros and cons and how they can be applied in R with packages like SnowballC or *textstem*. This preprocessing toolkit is crucial: by converting raw text into a cleaner, standardized token list, we set the stage for reliable structuring. Building on tokens, we introduced the **Document-Term Matrix (DTM)**, the primary structured representation for text. We explained how a DTM enumerates term frequencies across documents, and discussed sparsity and weighting. The concept of **TF–IDF** was highlighted for its ability to score term importance by balancing local frequency with global rarity. We showed how quanteda and tidytext in R make it straightforward to compute a DTM and apply TF–IDF weighting, turning texts into numerical feature vectors ready for analysis or modeling. Recognizing that not all analyses require leaving the tidy paradigm, we examined the **tidytext framework** in detail, demonstrating how unstructured text can be transformed into a tidy one-token-per-row format. This allows integration with dplyr, ggplot2, and other tidyverse tools, making text analysis more accessible and legible. We emphasized how tidytext enables fluid conversion between raw text, tidy tokens, and matrix representations, embodying the principle that text data is just data – amenable to standard data science workflows when structured properly. The chapter then explored *higher-level structuring techniques* that extract or impose additional structure on text: * **Named Entity Recognition (NER):** We defined named entities and illustrated how tools like spacyr in R can tag entities in text (PERSON, ORG, etc.), effectively adding a layer of structured information about “who/what is mentioned” in the text. This transforms unstructured text into databases of entities and their occurrences. * **Topic Modeling:** We explained Latent Dirichlet Allocation (LDA) and how it discovers latent topics in document collections, treating each document as a mixture of topics and each topic as a mixture of words. Using examples and tidytext integration, we showed how LDA structures a corpus into interpretable topics – a powerful unsupervised structuring that reveals thematic organization without prior labels. * **Sentiment Analysis:** We discussed both lexicon-based methods and machine learning approaches for determining sentiment, effectively structuring text by its emotional or opinion dimension. The use of sentiment lexicons in tidytext was exemplified along with cautionary notes on context (negation, sarcasm). We also touched on more advanced models that can be brought in for sentiment tasks. * **Text Classification:** Extending beyond sentiment, we covered general text categorization using supervised learning, where the structured outcome is a predicted label for each text (e.g., spam vs ham, topic categories, etc.). We pointed out that algorithms like Naive Bayes and SVM have been especially popular and effective for text, and that R’s quanteda.textmodels or tidymodels make implementing these models straightforward. The result of classification is another form of structured data extracted from text – a label or class probability that can be analyzed or used in decision processes. Crucially, we acknowledged the shifting landscape of NLP by discussing **integration of advanced models like BERT**. Transformers have arguably redefined what it means to structure text because they can derive deep contextual embeddings and perform tasks like NER, sentiment, or Q\&A with minimal task-specific data. We described how R users can harness these through packages such as reticulate (to interface with HuggingFace or TensorFlow) and the text package (providing a high-level R interface to transformers). By linking the latest research advances to R workflows, we equip readers with tools to push the frontier of text structuring – enabling, for instance, cross-lingual sentiment analysis or highly accurate classification by leveraging pre-trained knowledge. This not only improves performance but broadens the range of text that can be structured (e.g., low-resource languages, domain-specific jargon) because the models carry over learning from massive training corpora. Throughout the chapter, **real-world examples** were interwoven to illustrate how these methods come together in practice: analyzing social media sentiment, mining customer reviews, classifying news, extracting information from research literature, and more. Each example demonstrated a flow from raw text to structured insights (like sentiment trends, key entity extraction, or topic distributions), highlighting the value gained by structuring the text. We also underscored the importance of evaluation and validation at each step. In an academic textbook context, one should not only perform these transformations but also assess their quality: Does the tokenizer handle edge cases in our corpus? Is our list of stopwords appropriate or did we remove important domain terms? How coherent are the topics generated by LDA (which can be checked by the top words or topic coherence metrics)? What is the accuracy of our classification models on held-out data? By considering these questions, the reader can ensure that the structuring process is yielding reliable and valid structures that truly reflect the underlying text content. In conclusion, structuring unstructured text data in R is a multi-step journey – from cleaning and tokenizing, through feature representation (DTMs, embeddings), to higher-level structure discovery (entities, topics, sentiment, categories). It combines linguistics, statistics, and computer science techniques, many of which we have cited from seminal works and recent advances. The R environment, enriched with packages like **tidytext** (Silge & Robinson, 2016) and **quanteda** (Benoit et al., 2018), provides an efficient and user-friendly platform to implement these techniques, while the integration of transformer models brings state-of-the-art capabilities to our toolkit. The end result of applying these methods is that unstructured text – once opaque and unanalyzed – becomes **structured data**: a form we can aggregate, correlate, model, and infer from, using all the strengths of quantitative analysis. This empowers researchers and practitioners across disciplines to include textual evidence in their analyses, enriching insights and enabling data-driven decisions based on textual information that was previously locked in qualitative form. By mastering the content of this chapter, readers should be equipped to handle a wide array of text datasets. Whether the goal is to build a sentiment dashboard for product reviews, extract policy-relevant information from legislative documents, or perform a content analysis for academic research, the principles and techniques outlined here provide a solid foundation. The structured results – be it a tidy table of tokens, a matrix of features, or a set of predicted labels – form a bridge between human language and quantitative analysis, allowing the wealth of knowledge embedded in text to be systematically harnessed. ## References {.unnumbered} Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., & Matsuo, A. (2018). **quanteda: An R package for the quantitative analysis of textual data.** *Journal of Open Source Software, 3*(30), 774. Devlin, J., Chang, M.‑W., Lee, K., & Toutanova, K. (2019). **BERT: Pre‑training of deep bidirectional transformers for language understanding.** In *Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies* (pp. 4171–4186). Association for Computational Linguistics. Doan, A., Ramakrishnan, R., & Vaithyanathan, S. (2006). **Managing information extraction: State of the art and research directions.** In *Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data* (pp. 799–800). Association for Computing Machinery. Gandomi, A., & Haider, M. (2015). **Beyond the hype: Big data concepts, methods, and analytics.** *International Journal of Information Management, 35*(2), 137–144. Manning, C. D., Raghavan, P., & Schütze, H. (2008). *Introduction to information retrieval.* Cambridge University Press. Kjell, O., Giorgi, S., & Schwartz, H. A. (2023). **The text package: An R package for analyzing and visualizing human language using natural language processing and transformers.** *Psychological Methods, 28*(6), 1478–1498. Silge, J., & Robinson, D. (2016). **tidytext: Text mining and analysis using tidy data principles in R.** *Journal of Open Source Software, 1*(3), 37. Explosion AI. (2022). **spaCy 3 documentation: Linguistic features.** Retrieved July 14, 2025, from [https://spacy.io/usage/linguistic‑features](https://spacy.io/usage/linguistic‑features) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). **Attention is all you need.** In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), *Advances in Neural Information Processing Systems, 30* (pp. 5998–6008). Curran Associates.