10  Structuring Audio Data for Qualitative Social Science Analysis (with R and Python via Reticulate)

In the social sciences, researchers often collect audio data such as interview recordings, focus group discussions, oral histories, and public speeches. These audio recordings capture not only what is said (the verbal content) but also how it is said – the tone of voice, pauses, emphasis, and emotions conveyed. Traditionally, qualitative analysts transcribe audio to text and then analyze the transcripts, effectively discarding the rich aural information contained in the recordings. As Ashley Barnwell (2025) observes, transcription is an “almost hallowed step” in qualitative research, yet by relying only on text, researchers “miss rich layers of sensory, emotional, and embodied data” present in the audio. In other words, crucial cues like a speaker’s emotion, tone, and pauses – which can influence interpretation – are often lost when we treat transcripts as the sole data.

Recent developments in technology and methodology are prompting a re-examination of how audio data can be used in social research. Advances in automatic speech recognition (ASR) (e.g. OpenAI’s Whisper model and cloud speech-to-text services) and computational audio analysis have made it feasible to efficiently extract structured information from audio. With these tools, researchers can now transcribe large volumes of audio quickly, and also quantify acoustic features (like pitch or loudness) that reflect how something was spoken. This opens the door to incorporating paralinguistic elements (how words are delivered) alongside the content of speech in qualitative analysis. For example, political scientists Lucas and Knox (2021) demonstrated that vocal tone in Supreme Court oral arguments conveyed skepticism and information not evident from transcripts alone. Such findings underscore the potential of audio data: in many social settings – from therapy sessions to political debates – how people speak can be as meaningful as what they say.

In this chapter, we provide a comprehensive guide to structuring audio data into analyzable formats using R and Python (via the reticulate bridge between them). We focus on applications in the social sciences and qualitative research, where the goal is often to integrate narrative content with vocal characteristics. We will cover the process from start to finish: recording and transcribing speech, cleaning and preparing transcripts, extracting quantitative audio features (such as pitch, intensity, speaking rate, and Mel-frequency cepstral coefficients (MFCCs)), and finally, combining these structured audio features with text-based analyses. Throughout, we use an academic tone and provide examples in both R and integrated Python code. By the end, readers (even those without advanced computer science training) should appreciate both the why and how of incorporating audio data into qualitative research workflows.

10.1 Challenges in Handling and Analyzing Audio Data

Working with audio data poses several challenges, especially for researchers trained primarily in text-based or quantitative methods. Before diving into solutions, it is important to recognize these key challenges:

  • Unstructured, High-Dimensional Format: Raw audio is essentially an unstructured time-series of sound waves. Unlike survey data or text transcripts, which can be neatly arranged into tables or documents, audio recordings do not naturally arrive in a format that is easy to analyze statistically. One must first convert or encode audio into some structured form (text or numeric features) before analysis. This conversion process can be complex.

  • Need for Transcription: Human speech in audio form is not directly searchable or analyzable for content without transcription. Transcribing audio into text (either manually or automatically) is a necessary step to analyze what was said. Manual transcription is extremely time-consuming and prone to human error or bias. Automated transcription with ASR tools is much faster, but can have accuracy issues (e.g. mis-recognized words, especially for technical terms, multiple speakers, or poor audio quality). We often must balance speed and accuracy in choosing a transcription method.

  • Quality and Noise: Audio recordings vary in quality. Background noise, overlapping speech (e.g. in focus groups), and recording artifacts can all complicate analysis. Noisy audio can reduce transcription accuracy and interfere with acoustic feature extraction. Researchers may need to apply noise reduction or ensure high-quality recording practices to get usable data.

  • Volume of Data: Audio files (especially if recorded in high fidelity or for long durations) are large in size and harder to skim through than text. Hours of recordings can quickly become overwhelming to manually process. This “data overload” makes automated structuring and summarization techniques practically important.

  • Skill and Tooling Gap: Social scientists historically have less training in audio signal processing. Methods for parsing and analyzing audio (e.g. using FFTs for spectra or extracting MFCCs) come from engineering and linguistics fields. Without user-friendly tools, these methods can seem daunting. However, modern libraries in R and Python (which we will explore) have lowered the barrier by providing high-level functions for common audio analyses.

  • Context and Privacy: Audio data can carry identifiable information (a person’s voice or accent) which raises ethical and privacy concerns, especially in qualitative research on sensitive topics. Analyzing audio might require considering how to anonymize speakers (e.g. voice alteration) or secure consent for using voice data, more so than text data. This is a practical challenge to address during data handling.

Despite these challenges, the rewards of including audio in analysis are significant. By structuring audio into quantitative and text-based formats, we can ask new research questions that were previously infeasible when working with transcripts alone. The next sections will outline workflows to meet these challenges – starting with transcription, the gateway to any audio analysis.

10.2 Transcription Workflows Using R and Python

Transcription is the process of converting spoken audio into written text. It is a critical first step because it yields a text corpus that can be read, searched, and coded for themes. In qualitative social science, transcription has traditionally been done by hand (researchers or hired transcribers listen and type what was said). Manual transcription provides full control over notation (including nonverbals or pauses), but it is extremely labor-intensive: one hour of speech can take 4–6 hours or more to transcribe manually with high fidelity. Given this cost, researchers often end up transcribing selectively or using rough summaries. Fortunately, advances in speech-to-text technology now offer faster alternatives through automation.

Automated transcription tools can quickly generate a “first draft” transcript, which the researcher can then edit for accuracy. For instance, Google Cloud Speech-to-Text API is a widely used service that recognizes over 80 languages and variants. Using Google’s API in R is made convenient by packages like googleLanguageR (Edmondson, 2020). With this package, one can send an audio file to Google’s cloud and receive a transcript (plus word-level timestamps) in return. The code example below demonstrates a simple transcription of an audio file using Google’s API in R:

# Install and load googleLanguageR (if not already installed)
install.packages("googleLanguageR")
library(googleLanguageR)

# Authenticate with Google Cloud (requires a service account JSON key)
gl_auth("my-google-auth-key.json")

# Transcribe an audio file using Google Speech-to-Text
result <- gl_speech("interview1.wav", languageCode = "en-US")

# Extract the transcript text and confidence
transcript <- result$transcript
cat(transcript$transcript)  # print the transcribed text

In this snippet, gl_speech sends the file interview1.wav to Google’s speech recognition engine. The result contains both a transcript tibble (with the full transcription and confidence score) and a timings tibble that provides timestamps for each recognized word. Having word-level timestamps is valuable; it allows the researcher to align the text with the original audio timeline, which is useful for tasks like speaker diarization (distinguishing who spoke when) and calculating speaking rate or silence gaps.

Google’s API can also perform speaker diarization during transcription. By specifying a custom configuration, we can request the service to identify different speakers in the audio. For example, setting enableSpeakerDiarization = TRUE in the request will instruct the model to label segments by speaker (Speaker 1, Speaker 2, etc.), which is extremely useful for transcribing focus groups or conversations where multiple people talk. Automated diarization isn’t perfect, but it can greatly speed up the process of separating each speaker’s contributions in a group interview.

Another powerful ASR tool is OpenAI’s Whisper (released in 2022). Whisper is an open-source model trained on 680,000 hours of multilingual data, making it robust to accents, background noise, and technical language. A major advantage of Whisper is that it can be run fully offline on your own machine, which is beneficial for sensitive data or when internet connectivity is an issue. Whisper supports multiple languages and even direct translation of speech to English. OpenAI has open-sourced Whisper’s code and models, so they can be integrated into R workflows via Python. Indeed, one can use the reticulate package in R to call Python’s Whisper library for transcription. For example:

# Use reticulate to import and run Python's OpenAI Whisper for transcription
library(reticulate)
whisper <- import("whisper")                     # import the whisper Python module
model <- whisper$load_model("small")             # load a small pre-trained Whisper model
result <- model$transcribe("interview1.wav")     # transcribe the audio file
transcript_text <- result$`text`
cat(transcript_text)

In this code, we leverage reticulate to load the Python whisper module directly into R. We select a model size (here "small" for demonstration; larger models like "base", "medium", or "large-v2" are available with increasing accuracy at the cost of speed and memory). The transcribe() method returns a list containing the transcribed text along with segmentation and confidence info. We then extract the text and print it. The ability to mix R and Python in one script (thanks to reticulate) means we can use cutting-edge Python tools like Whisper without leaving the R environment. This is highly convenient in practice – for instance, one could write an R script that automatically transcribes a folder of interview recordings by looping over files and calling model$transcribe on each.

Accuracy considerations: Automated transcripts are usually not 100% accurate. However, many studies have found them to be “good enough” as a starting point, dramatically reducing the manual effort needed. Bokhove and Downey (2018) compared automated captioning services across different contexts (interview, public hearing, classroom) and found that while machine transcripts contain errors, the majority of errors are easy to spot and correct in a subsequent review. They conclude that these methods can produce “good enough” transcripts for first versions, yielding significant time and cost advantages in qualitative research. For example, an automated transcript might incorrectly hear a proper name or jumble a technical term, but a researcher listening through can quickly fix these without having to type the entire transcript from scratch. The net effect is a hybrid workflow: use ASR to get an initial transcript, then clean and edit it manually to produce a high-quality final transcript. We discuss cleaning steps next.

Before moving on, it’s worth noting that whichever transcription method one uses, the output can be structured further. Many ASR tools (like Whisper and Google) can output transcripts in formats like JSON or XML, with timestamps, speaker labels, and even alternative recognition hypotheses. These structured outputs can be parsed in R or Python to create rich data frames – for instance, a table where each row is a speech segment with columns for speaker ID, start time, end time, and the text spoken. Such structured transcripts become very powerful when combined with qualitative coding or quantitative text analysis.

10.3 Cleaning and Preprocessing Transcribed Data

Once you have raw transcripts (either from ASR or manual typing), the next step is cleaning and preprocessing the textual data for analysis. Transcribed speech tends to be messy. It often includes false starts, filler words (e.g., “um”, “you know”), colloquialisms, and idiosyncratic punctuation or capitalization (depending on the transcriber). Some of this “messiness” carries meaning – for example, pauses or hesitations might indicate uncertainty – so the decision to clean must align with your analytic goals. However, many analytical techniques (like text mining or natural language processing) benefit from a cleaned, normalized text input. Here are common preprocessing steps:

  • Remove or Mark Filler Words: Spoken language is full of fillers and interjections (“um”, “uh”, “er”, “like”, etc.). If our analysis is focused on content (topics, keywords), we might remove these fillers as they don’t contribute semantic meaning. For instance, we could use a simple find-and-replace or regex in R to eliminate “um” and “uh” from transcripts. On the other hand, if analyzing conversation dynamics or speech patterns, fillers might be relevant (they could indicate hesitation or cultural speech norms). In such cases, we might keep them or even annotate them specially (e.g., mark all fillers with a tag).

  • Standardize Text: Transcripts can be converted to a consistent case (usually lowercase) to avoid treating “Health” and “health” as different words in analysis. Also, we often remove extraneous punctuation that is not needed for analysis (e.g., stutters like “I - I - I think…” might be edited to “I think…”). If transcripts contain nonverbal descriptions in brackets (e.g., “[laughter]” or “[pause]”), one must decide whether to keep these. They can be valuable annotations for context (e.g., laughter indicating humor or sarcasm), so a common practice is to retain them or even convert them into explicit markers (like a column indicating segments with laughter).

  • Correct Obvious Errors: Automated transcripts may have mis-recognized words that are apparent from context. During the cleaning phase, one can correct these. For example, if an interviewee mentioned “PEI” and it was transcribed as “P.E.”, we’d fix the transcript to the correct term. It’s wise to do a quick quality check by listening to the audio while reading through the transcript, at least for key sections, to catch and correct such errors.

  • Annotate Speaker Turns: If not already done by the transcription process, it’s important to label who is speaking in each segment (especially for multi-speaker data). For a one-on-one interview, you might label segments as “Interviewer” vs “Participant”. For a focus group, you might number speakers (Speaker 1, Speaker 2, etc.) or use actual names/pseudonyms if known. Structuring the transcript by speaker turn allows analysis of each speaker’s contributions (e.g., one could analyze if certain themes are predominantly discussed by a particular participant). Many transcription tools provide diarization labels as discussed; if not, one might do this manually or with the help of diarization algorithms post hoc.

  • Convert to Analysis Format: After cleaning the raw text, researchers often convert transcripts into formats suitable for analysis. For qualitative coding, one might keep the transcript as a text document (with speaker labels) to import into coding software (like NVivo or ATLAS.ti). For quantitative text analysis in R, a tidy format is useful: for example, a data frame where each row is a sentence or a speaker turn, with columns for speaker, time, and text. One can use the tidytext package in R to further tokenize this text (e.g., one word per row for word-frequency analysis, or one sentence per row for sentiment analysis).

  • Synchronize with Audio (if needed): In some projects, you may want to preserve the alignment between the transcript and the original audio. Tools like ELAN or Praat allow you to time-align transcripts and export them as TextGrid files (Praat’s annotation format). If your analysis will examine timing (pauses, overlaps, etc.), you might keep an aligned version. There are R packages (e.g., textgridR) that can read Praat TextGrid files so you can work with time-aligned transcript data in R. This is an advanced step but worth mentioning for completeness.

Cleaning is often iterative and depends on the analysis goals. A thematic qualitative analysis might tolerate more disfluencies in the text (since a human coder can interpret around them), whereas a topic modeling algorithm would need a cleaner input (since it treats every token literally). Throughout cleaning, document your decisions (perhaps in a codebook or metadata file) – for example, note if you removed all instances of “[laughter]” or if you standardized certain terms – to maintain transparency in your qualitative data preparation.

10.4 Feature Extraction from Audio Signals

Beyond transcripts, a key advantage of working with audio is the ability to extract acoustic features – structured numerical measures that quantify how something was said. While transcripts capture the content of speech, acoustic features capture the delivery. In this section, we introduce several important features (pitch, intensity, speaking rate, MFCCs, etc.) and demonstrate how to compute them using R and Python libraries. Converting audio into these features is essentially translating the raw waveform into informative variables that can be analyzed quantitatively or correlated with other data (like text or participant attributes).

What are acoustic features? They are measurements derived from the audio waveform or spectrum. Some features summarize an entire audio file or segment (e.g. the average pitch of a speaker across an interview), while others are time-varying (e.g. pitch at each moment, forming a contour). Choosing features depends on the research question. Common acoustic features in social science applications include:

  • Fundamental Frequency (Pitch): This is the frequency of the speaker’s voice pitch, usually measured in Hertz (cycles per second). It corresponds to the vibration rate of the vocal folds. Pitch is perceived as how “high” or “low” a voice sounds. In speech, pitch carries intonation and prosody (e.g., raising pitch at the end of a sentence may indicate a question in English). It’s also associated with speaker characteristics like gender and emotion – for instance, excitement or nervousness can raise pitch, while authority might be conveyed with a steady, lower pitch. We often use the symbol F0 to denote fundamental frequency. Tools can extract an F0 contour over time or summary stats like mean F0, minimum and maximum F0 for a segment.

  • Intensity (Loudness): Intensity refers to the amplitude or energy of the audio signal, which correlates with how loud the sound is. In technical terms, intensity can be measured in decibels (dB). In conversation, variations in intensity might correspond to emphasis (speaking louder to stress a point) or emotion (shouting in anger, speaking softly when sad or confidential). We can extract an intensity envelope (volume over time) or compute summary measures like average intensity of a speaker, or the ratio of loud segments to quiet segments. Intensity is influenced by both the speaker (how forcefully they speak) and recording conditions (microphone distance), so care is needed in interpretation.

  • Speaking Rate: Speaking rate is how fast a person speaks, often expressed in words per minute (WPM) or syllables per second. It reflects the tempo of speech. A higher speaking rate might indicate excitement, urgency, or perhaps nervousness (rushing through words), whereas a slower rate might indicate a calm, deliberative style or sadness. In research on speech tempo, two related measures are distinguished: speaking rate (including pauses) and articulation rate (excluding pauses). If someone speaks 150 words in a 1-minute segment (including silences), their speaking rate is 150 WPM. If we remove the silent pauses and find they actually spoke those 150 words in 45 seconds of talking, the articulation rate would be 200 WPM (since pauses are excluded). For many purposes, words-per-minute (with pauses) is a straightforward measure. Indeed, studies show that words per minute correlates very strongly (r ≈ 0.91) with listeners’ perception of speech tempo. Tools to compute speaking rate typically rely on having a transcript with timestamps: for example, counting words spoken divided by the duration of the segment yields WPM. Alternatively, one could detect syllable nuclei in the audio automatically (a more technical approach) to estimate syllables per second.

  • Voice Quality Features: These include measures like jitter (small fluctuations in pitch), shimmer (fluctuations in amplitude), and spectral tilt. Such features are often used in phonetics to quantify voice characteristics (breathy, creaky voice, etc.). They can indicate affect or health of the speaker’s voice. For example, a “trembling” voice might show higher jitter. While not explicitly listed in our chapter outline, it’s worth noting that specialized R packages (like wrassp or voice packages) can compute these if needed for a study (e.g., analyzing emotional arousal might use jitter as an indicator of a shaky voice).

  • Mel-Frequency Cepstral Coefficients (MFCCs): MFCCs are a set of features widely used in speech and audio analysis, especially in speech recognition systems. They are a numeric representation of the audio spectrum that approximates how humans perceive sound frequencies. In simple terms, MFCCs capture the timbre or tone quality of the audio in a series of coefficients (often the first 12 or 13 coefficients are used). They are derived by taking a short window of audio (~20–40 ms), computing the frequency spectrum (via FFT), warping it onto the Mel scale (which spaces frequencies according to human pitch perception), taking the logarithm of the power at those Mel-spaced frequency bands, and then applying a discrete cosine transform (DCT) to decorrelate and compress the information. The result is a small set of coefficients that succinctly describe the spectral shape of the audio frame. Each coefficient can be thought of as capturing certain frequency characteristics (e.g., the first MFCC roughly corresponds to overall spectral energy, others capture balance of low vs high frequencies, etc.). While individual MFCCs are not easily interpretable in human terms, collectively they serve as an acoustic “fingerprint” of the sound. They are extremely useful as input features for machine learning (e.g., classifying speakers or emotions) or for comparing similarity of audio segments. MFCCs are commonly used features for representing audio in quantitative analyses, including social science studies where one might want to incorporate voice characteristics into statistical models. For example, one study found that using basic phonetic features including MFCC-based measures of tone could predict political ideology better than chance, demonstrating that tone of voice contains signals about the speaker’s profile.

  • Other Prosodic Features: We should also mention duration of pauses, frequency of pauses, speech rhythm, and intonation patterns (like rises and falls in pitch). Duration of pauses between speaker turns or within a monologue can indicate hesitation or cognitive processing. Intonation patterns (e.g., rising pitch at sentence end for a question, or a particular rhythm) could be relevant for discourse analysis. These are higher-level features often derived from combinations of the basic ones above (pitch contour analysis, etc.). Advanced tools or manual annotation is sometimes used to capture them.

Extracting these features can be done with various software. In R, the packages tuneR, seewave, and wrassp are very useful. In Python, the librosa library is a powerful tool for audio analysis. We can also interface with specialized tools (like the open-source Praat software for phonetic analysis) via scripting if needed. Below, we demonstrate how to extract a few key features using R and reticulate (for Python integration).

Reading Audio Data in R

First, we need to read in an audio file for analysis. The tuneR package provides a convenient readWave() function for WAV files (and similar functions exist for other formats or via packages like av for mp3). For example:

library(tuneR)
wave_obj <- readWave("interview1.wav")

This creates a Wave object in R containing the audio sample data. We can inspect its structure:

str(wave_obj)

This will show fields such as the sample rate (wave_obj@samp.rate), bit depth, number of channels, and the waveform data itself (in wave_obj@left and wave_obj@right for stereo channels). For instance, if wave_obj has a sample rate of 44,100 Hz and length of 441,000 samples in the left channel, we know the audio is 10 seconds long (since 441000/44100 = 10 seconds).

The waveform can be plotted to visualize the raw sound amplitude over time. For example:

# Create a time axis in seconds
time <- seq(0, length(wave_obj@left)-1) / wave_obj@samp.rate
plot(time, wave_obj@left, type="l", xlab="Time (s)", ylab="Amplitude")

Figure: Waveform of an audio segment (amplitude over time). Peaks indicate louder moments, and the oscillation frequency relates to the pitch of the sound.

Such a time-domain plot is useful for seeing speaking turns or loudness at a glance (e.g., you might see silence vs speech regions). However, to extract more informative features like pitch or MFCCs, we typically move to frequency-domain analysis or use specialized functions.

Extracting Pitch (Fundamental Frequency)

Pitch extraction involves analyzing the signal for its fundamental frequency (F0) at each moment. One approach is via the autocorrelation method or the cepstral method on short frames of the signal. The seewave package has a function fund() that estimates the fundamental frequency track using cepstral analysis. The wrassp package (which wraps the Austalian Speech Science and Technology Association’s tools) also provides functions like ksvF0 for pitch tracking. For simplicity, we will demonstrate using seewave::fund:

library(seewave)
# Estimate fundamental frequency contour
# Assuming our audio is reasonably short; for longer audio, you might do this in chunks
pitch_track <- seewave::fund(wave_obj, f = wave_obj@samp.rate, 
                              ovlp = 50,            # 50% overlap between analysis frames
                              wl = 2048,            # window length for analysis (in samples)
                              ylim = c(0, 500))     # limit search to 0-500 Hz (human voice range)

Here, f is the sampling rate, ovlp is overlap (to get smoother contour), and wl is the window length for FFT (2048 samples ~46ms at 44.1kHz). The result pitch_track will be a two-column matrix: time (s) and frequency (Hz). We can analyze or plot this:

plot(pitch_track[,1], pitch_track[,2], type="l", xlab="Time (s)", ylab="Fundamental Frequency (Hz)")

This plot would show the pitch contour over time for the speaker. From it, we could compute summary statistics or notice patterns (e.g., rising intonation at certain points). In our qualitative context, we might not need extremely precise pitch extraction for every millisecond; even average pitch per sentence or range of pitch per speaker could be informative.

Another method: tuneR itself provides a function FF() that uses autocorrelation to estimate a single fundamental frequency for a wave (assuming it’s mostly one tone). That’s more used in music or tonal analysis. For speech, the contour is usually needed since pitch varies within utterances.

When interpreting pitch data, remember it can be influenced by speaker physiology (e.g., adult males generally have lower F0 than adult females, due to vocal fold length). So differences in mean pitch across speakers could reflect demographics rather than emotional or situational factors. To use pitch meaningfully, it’s often best to look at changes or patterns for the same speaker (e.g., did their pitch increase when talking about a sensitive topic compared to a neutral topic?).

Extracting Intensity (Loudness)

To get intensity, we can compute the root-mean-square (RMS) energy in short windows across the audio. seewave has an env() function that computes the amplitude envelope, and wrassp provides rmsana() for RMS contour. For example, using seewave::env:

# Compute amplitude envelope (smoothed squared amplitude) 
env <- seewave::env(wave_obj, f = wave_obj@samp.rate, envt="abs", 
                    msmooth=c(50,0))  # 50-sample moving average smoothing

By default env() might return a vector of envelope values. These can be converted to dB if needed (20*log10 of normalized amplitude). If we wanted decibel relative to full scale:

env_vals <- as.numeric(env)
env_db <- 20 * log10(env_vals / max(env_vals))

We could then find the maximum intensity or average intensity. For instance, max(env_db) might indicate the peak loudness in dB (where 0 dB corresponds to the loudest point in that file). If we had calibrated recordings, we could get absolute dB SPL, but usually relative is fine for one study.

Another approach: using tuneR, we can obtain the waveform data (as numeric) and compute RMS by chunking the vector. For example, to get average volume per second:

y <- wave_obj@left / (2^(wave_obj@bit-1))  # convert to numeric range -1 to 1
sec <- wave_obj@samp.rate
# compute RMS for each 1-second block
rms_per_sec <- sapply(split(y, (seq_along(y)-1) %/% sec), function(frame) sqrt(mean(frame^2)))

This would yield a vector of RMS values for each second of audio. Converting to dB: 20*log10(rms_per_sec).

The main point is that intensity features can be extracted at various granularities. For a qualitative interview, one might simply compute the overall average intensity of each speaker as a single number (which could, for example, indicate who tends to speak louder). Or, one could mark sections of transcript with noticeably higher volume (perhaps indicating emotional exclamation) by thresholding the intensity contour.

Extracting Speaking Rate

Speaking rate can be obtained by leveraging the transcript and timestamps. If we have word-level timestamps (as from Google’s API or Whisper with word_timestamps=True), we could calculate words per minute over any span. For example, suppose our transcript data frame has columns: speaker, start_time, end_time, text. We can do:

library(dplyr)
transcript_df %>%
  group_by(speaker) %>%
  summarize(total_words = sum(str_count(text, "\\S+")),
            total_time_min = (max(end_time) - min(start_time)) / 60,
            speaking_rate_wpm = total_words / total_time_min)

This gives an average WPM for each speaker over the whole session. We could refine this by excluding long pauses if needed (for articulation rate) – e.g., by summing only durations of segments when the person is talking.

If word timestamps are not available, one could approximate speaking rate using speech segments. For instance, if an interview question answer spans 30 seconds and contains 100 words, that’s 200 WPM in that answer. Doing this for each answer yields a distribution of speaking rates for that speaker.

There are also algorithms to detect syllables or phonetic events from audio which can infer speech rate without a transcript, but these are beyond our scope and less accurate for conversational speech. Since we usually will have transcripts, the text-based approach is simpler and sufficient.

One must also consider pauses. If analyzing conversational dynamics, it might be interesting to calculate not just raw WPM but how long each speaker pauses before responding, or how often they pause mid-sentence. Those require timestamp analysis too (e.g., difference between one speaker’s end time and the other’s start time gives pause length between turns).

Extracting MFCCs and Other Spectral Features

For MFCCs, the tuneR package conveniently provides a function melfcc(). This function will take a Wave object and return MFCC features. Let’s use it:

# Compute MFCCs for the audio (using tuneR's melfcc)
mfcc_matrix <- tuneR::melfcc(wave_obj, numcep = 13, wintime = 0.025, hoptime = 0.01)
dim(mfcc_matrix)

The result mfcc_matrix is typically a matrix with each row corresponding to one analysis frame (observation) and each column to one of the MFCC features (e.g., 13 coefficients). The wintime and hoptime here specify a 25 ms window length and 10 ms step between frames, which are standard settings in speech analysis. If frames_in_rows=FALSE (the default in some versions), the arrangement might be transposed. The documentation notes that melfcc can also return additional attributes like the spectrogram or filter bank if requested.

We could summarize MFCCs by taking averages or variances across time, but often they are used as input into further analysis (like clustering or classification). For example, to compare voices, one might take the mean MFCC vector for each speaker and compute distances. However, interpreting MFCCs directly is not intuitive, so they shine when fed into algorithms that can learn patterns from them.

Using Python’s librosa via reticulate: Alternatively, many researchers leverage Python’s librosa library for audio features, as it is very feature-rich. With reticulate, we can do the same MFCC extraction in Python and bring the result into R. For instance:

library(reticulate)
librosa <- import("librosa")

# Load audio using librosa (it returns waveform array and sample rate)
y_sr <- librosa$load("interview1.wav", sr=NULL)  # sr=NULL means use original sample rate
y <- y_sr[[1]]    # waveform as a numeric vector
sr <- y_sr[[2]]   # sample rate

# Compute MFCCs using librosa
mfcc <- librosa$feature$mfcc(y = y, sr = sr, n_mfcc = 13)
dim(mfcc)

The librosa$feature$mfcc function returns a matrix of shape (n_mfcc, n_frames). So if dim(mfcc) returns 13 x 1000, that means 13 coefficients over 1000 frames. We could transpose it to have each row as a frame if needed. Librosa offers many other features too: for example, librosa$feature$rmse for energy (RMS) per frame, librosa$feature$zero_crossing_rate (which can correlate with noisiness), and functions for spectral centroid, bandwidth, etc. It even has librosa$beat$tempo which can estimate tempo (for music, but in speech it might catch speaking rate roughly).

For an illustration, librosa can also compute a Mel spectrogram (the step before MFCC):

mel_spec <- librosa$feature$melspectrogram(y = y, sr = sr, n_mels = 96)
mel_db <- librosa$power_to_db(mel_spec, ref=1.0)  # convert to log scale (dB)

We could visualize this spectrogram in R:

image(t(mel_db)[, nrow(mel_db):1], col=terrain.colors(50), xlab="Time frames", ylab="Mel frequency bins")

However, a more straightforward way is to use R’s plotting or even save the matrix as an image. Spectrograms are great for visual analysis: they show time on one axis, frequency on the other, and intensity by color (as in the figure below).

Figure: Example spectrogram of a recorded sound (time on the horizontal axis, frequency on the vertical axis up to 3 kHz). Warmer colors indicate higher energy at those frequencies. Such visualizations help in qualitative interpretation of audio (e.g., identifying high-pitched vs low-pitched sections, or seeing pauses as blank vertical gaps).

With MFCCs and related features extracted, we now have a structured numerical representation of the audio. For instance, suppose we have an interview with two speakers; we could compute features like average pitch, pitch range, average intensity, speaking rate, etc. for each speaker. We could tabulate these in a data frame: each speaker (or each turn, or each thematic segment) as a unit, with columns for these features. This transforms the audio into a quantitative dataset that can be merged or correlated with other data (such as survey results, text analytics, or demographic information).

To summarize this section: feature extraction moves us from raw audio into meaningful numbers. These features can feed into statistical analyses or be used to augment qualitative insights. For example, one might find that a participant’s speaking rate drastically slowed down when discussing a traumatic memory – a measurable indicator of emotional impact that complements the content of their narrative. Or one could observe that interviews conducted over the phone had lower average intensity (perhaps due to softer voices or microphone differences) than those in person, suggesting a need to control for recording conditions in analysis. The possibilities are broad, and choosing the right features is guided by theory and the research questions at hand.

10.5 Use Cases for Structured Audio Data in Qualitative Research

With transcripts and audio features in hand, what can we do in a social science context? This section explores several use cases where structured audio data enriches qualitative and mixed-methods research. We will discuss thematic content analysis, speaker diarization and interaction analysis, and emotion or paralinguistic analysis – highlighting how combining audio-derived measures with traditional qualitative data can yield deeper insights.

Enhancing Thematic Analysis with Audio Features

Thematic analysis is a staple in qualitative research, where researchers code transcripts to identify patterns and themes in what participants say. Typically, this relies solely on the textual content. However, incorporating audio features can add a new dimension: understanding how topics are discussed. For example, imagine a study of counseling sessions where one theme is “anxiety about the future.” Through content coding, you identify when clients talk about future anxieties. By bringing in audio data, you could also examine vocal markers of anxiety – perhaps the client’s pitch rises and volume falls when discussing those fears, indicating a nervous tone. If you consistently find, say, that when Theme X is discussed the speaker’s speaking rate speeds up and sentences trail off (softer intensity), that pairing of content and delivery is a finding in itself (e.g., a sign of emotional arousal or hesitation around that topic).

Concretely, one could do the following: code the transcripts for themes in the usual way (using a qualitative data analysis software or manual coding). Then, use the time-stamped transcript to segment the audio by theme. For each thematic segment, compute average acoustic features. This results in a dataset where each theme occurrence has both qualitative code and quantitative tone measures. Analysis might reveal patterns like “Participants consistently spoke with higher pitch and intensity when discussing experiences of discrimination compared to when discussing everyday events.” This could be evidenced by statistically higher mean pitch (say 20 Hz higher on average) in segments coded as Discrimination versus Daily Life. It provides an empirical backing (beyond the researcher’s subjective impression) that certain topics were indeed accompanied by a change in vocal delivery, suggesting emotional salience.

Another use case: in policy deliberation research, one might be interested in persuasiveness or emphasis. Suppose you have transcripts of political debates. A thematic analysis identifies when speakers express key messages or arguments. By measuring audio features, you might find that speakers tend to slow down and lower their volume when saying something they want to appear thoughtful, or conversely raise volume for key rally points. Linking these to outcomes (did the argument land? did it get applause?) could be insightful. In fact, scholars have noted that “auditory cues that convey emotion, signal positions, and establish reputation” are embedded in how political speech is delivered. Including those cues in analysis can confirm or challenge assumptions that are based only on transcript words.

Practically, to combine theme coding with audio features, one can use a spreadsheet or statistical tool to align them. For instance, export coded segments with their time ranges, then use R to calculate features for those time ranges (possibly using the indices from word timings that fall in that segment). The result can be analyzed via cross-tabulations (theme by vocal quality) or even visualized (one could imagine a chart showing average speaking rate for each theme category).

Speaker Diarization and Interaction Analysis

In focus groups, community meetings, or multi-party interviews, who speaks when and how much is often a research question. Speaker diarization – identifying segments of the audio by speaker – is the key to unlocking this type of analysis. Once you have structured audio with speaker labels and time stamps, you can quantify interaction patterns:

  • Speaking Time and Turn-Taking: You can calculate how long each speaker talked (total seconds or as a percentage of the session). This can address questions of dominance or engagement: did one participant monopolize the discussion? For example, in a focus group of 5 people, you might find Participant A spoke 40% of the time while Participant E only 5%. Combined with qualitative notes, this could contextualize whose voices are most heard and whose are marginalized. If Participant E’s few contributions were also high in intensity (perhaps indicating they struggled to interject), that might be a sign of group dynamics at play.

  • Number of Turns and Interruptions: By analyzing the sequence of speaker turns (possible if diarization includes the timeline), one can count how many times each person spoke and how often interruptions occurred. For instance, overlapping segments (where two speakers’ speech overlaps in the audio) can be automatically detected by some diarization tools or by energy-based detection. Frequent interruptions of one speaker by another could indicate power dynamics or conflict in the group. Conversely, long uninterrupted monologues might indicate one person lecturing or others deferring.

  • Interaction Sequences: In conversation analysis, the order and timing of responses matter. With time-coded data, one can measure latency – how long does it take for someone to respond after another stops talking. If in a counseling session the client is consistently pausing 5 seconds before answering certain questions, that might signal reluctance or the need to formulate thoughts. If in a meeting, people jump in immediately (short latencies), it could signal a lively debate or conversely, impatience.

Modern diarization algorithms like pyannote.audio (a Python toolkit) can automatically separate speakers with reasonable accuracy. We could leverage that via reticulate if needed, but using the Google API’s built-in diarization as earlier may suffice for many cases. It’s noteworthy that Google’s API, as shown, can output a transcript with speaker tags (Speaker 1, Speaker 2, etc. with each sentence tagged). We might need to clean those labels (e.g., map “Speaker 1” to an actual name if known).

Once the audio is structured by speaker, we can create an interaction summary. For example:

# Pseudocode for summarizing speaker interaction from a diarized transcript df
transcript_df %>%
  group_by(speaker) %>%
  summarize(turns = n(),
            total_time = sum(duration),
            avg_turn_duration = mean(duration))

Where duration = end_time - start_time for each segment. This yields how many speaking turns each speaker had and their average turn length.

Combining with features, one could also examine how each speaker talks: e.g., Speaker A’s average pitch, intensity, rate as discussed. This could tie into qualitative observations (maybe Speaker A, who dominates time, also speaks loudly and quickly – a particular communicative style).

In qualitative interpretation, such quantitative interaction data can be triangulated with participant feedback or observer notes. If a certain participant’s low contribution was observed qualitatively as a sign of discomfort, the hard data on speaking time backs that up, lending credibility. Or if unexpected, it can prompt a re-examination of the dynamic (e.g., “we assumed Person X was quiet, but actually, they spoke more than some others – perhaps their influence was subtle”).

Emotion and Paralinguistic Analysis

One of the most compelling reasons to analyze audio is to gain insight into emotions and attitudes that are not fully captured by words. The paralinguistic elements of speech – tone, pace, emphasis, intonation – carry emotional information. Social science researchers are increasingly interested in sentiment and emotion analysis not just from text (what was said) but from voice (how it was said). The structured audio features we extracted enable such analysis.

For instance, sentiment analysis on text might tell you that a person’s words are positive, neutral, or negative in sentiment. But sentiment analysis alone can miss sarcasm or emotional tone. Audio features can complement this by indicating, for example, that although the transcript’s words are neutral, the tone was sarcastic (perhaps detected by a certain intonation pattern or a combination of high pitch with a particular rhythm). As a hypothetical example, an interviewee might say the words “I’m totally fine with that” – text sentiment would rate this positively – but if the audio reveals the phrase was spoken in a flat, low pitch, with a long pause before it, a human listener might detect resignation or discomfort. By quantifying aspects of that delivery (e.g., monotonic low pitch, slower rate), we could train or use models to detect such mismatches between textual and vocal sentiment.

There are already computational models for emotion recognition from speech (a field often called speech emotion recognition, SER). These often utilize features like MFCCs, pitch, intensity, and voice quality as inputs to classify emotions (happy, sad, angry, etc.). While building an emotion classifier may be beyond a typical social science project, researchers can use simpler approaches: for example, measure cues that correlate with arousal (pitch variability, intensity) and valence (tone breathiness, etc.), or even use off-the-shelf tools. One accessible tool is the openSMILE feature extractor (from audEERING), which can output a large set of predefined features known to be useful for emotion detection; its output could be analyzed in R statistically. Another approach is to use APIs like IBM Watson or Azure that offer emotion tone analysis from audio, though those are proprietary and less transparent.

A practical example in qualitative research is given by Cottingham and Erickson (2020), who used audio diaries to capture emotion in participants’ own voices. They argue that audio diaries allow participants to record their feelings in real-time with the expressive nuances of voice, overcoming some limitations of written diaries. In their study, they analyzed not just the content of diary entries but how those entries were spoken, to understand emotional labor and authenticity. One could imagine coding those diary entries for emotion (maybe using a combination of human coders and acoustic indicators). They found that the “slippery” nature of emotion can be better captured when the actual voice is analyzed, since feelings like hesitation, excitement, or sadness often manifest in tone and pacing (e.g., a trembling voice when recalling a painful memory, or a bright, animated tone when describing something joyful).

Another use case: Speaker identity and impression. Research by Fasoli & Maass (2020) looked at “sounding gay” – they found that simply hearing someone’s voice led listeners to infer sexual orientation and then apply stigma. This shows that people pick up on subtle voice characteristics. As researchers, we could attempt to quantify those characteristics (perhaps certain pitch ranges, intonational patterns, or sibilance frequencies). If doing a study on bias, one could correlate participants’ vocal features with the impressions they make on others. Similarly, sociolinguists examine how dialect or accent is perceived; structured audio data could feed into such analysis by measuring, say, the vowel formant frequencies to quantify accent.

In sum, emotion and paralinguistic analysis with structured audio might involve: correlating acoustic features with questionnaire measures of emotion (e.g., does a person who self-reports high stress speak with a higher pitch on average?), identifying changes in delivery when sensitive topics arise (e.g., as a qualitative analyst might note “her voice grew softer when discussing her family”), or even performing exploratory data analysis to see if different categories of speech cluster in feature space (one could do a PCA or UMAP on extracted features to see if, say, all the excited speech segments group separately from calm segments). These approaches connect the quantitative dots to support qualitative interpretations about emotional expression.

10.6 Combining Audio Features with Text Analytics Workflows

A strength of structuring audio data is that it enables multi-modal analysis – linking the quantitative acoustic measures with traditional text or categorical data. Many social science projects can benefit from this combination. Let us consider a few concrete integrative analyses one can do:

  • Tone-Topic Analysis: After performing a topic modeling on transcripts (e.g., using LDA to find common topics discussed across interviews), one could examine whether the acoustic tone differs by topic. For example, in community forums, perhaps discussions on “school funding” tend to have more heated exchanges (higher volume, higher interrupt rate) compared to discussions on “community events” which might be calmer. If one has topic labels for each segment of audio, it’s straightforward to compute feature averages per topic and compare. This could yield statements like: “Topic A was discussed with significantly faster speech and higher pitch than Topic B, suggesting Topic A sparked more animated conversation.” Such findings give texture to pure content analysis.

  • Sentiment-Tone Consistency: If performing sentiment analysis on the transcript text (using, say, a lexicon approach or machine learning sentiment classifier), one can cross-check it with acoustic indicators of sentiment. One might find, for instance, that segments identified as “negative sentiment” in text also correlate with lower pitch and slower pace (which could be markers of sadness or seriousness), whereas “positive sentiment” segments had higher pitch variability and louder volume (markers of excitement). Alternatively, mismatches could be insightful: parts where text sentiment is positive but tone is flat might indicate politeness or masking true feelings (important in contexts like customer service calls or diplomatic speech). By quantifying both, we can systematically identify such mismatches for closer qualitative review.

  • Demographic or Individual Differences: Structured audio data allows exploring how different speakers or groups speak. For example, do youth and elders differ in speaking rate or pitch use in an interview context? Perhaps older adults speak more slowly on average (articulation rate differences) or use a narrower pitch range. Gender differences in pitch are well-known (biologically influenced), but differences in prosodic style (like intonational patterns or uptalk) could also be analyzed. One could correlate audio features with demographic attributes of participants (age, gender, cultural background) to see if there are patterns that warrant interpretation. This must be done carefully and contextually (avoiding stereotypes), but it can enrich sociolinguistic aspects of analysis. For instance, a study might note “Female participants in our study had higher median pitch as expected, but interestingly, they also used greater pitch variation when discussing personal experiences compared to male participants, who tended to keep a more monotone delivery.” That kind of observation, backed by data, can lead to discussions about social conditioning in communication styles.

  • Predictive Modeling and Triangulation: In mixed-methods projects, one might even build simple predictive models – e.g., does vocal tone predict an outcome of interest when controlling for text? Lucas and Knox’s MASS model is an advanced version of this, where they used vocal tone features to predict things like a justice’s level of skepticism and showed it improved prediction of judicial behavior over text-only models. On a simpler level, one could do a regression with, say, interviewee satisfaction (perhaps rated on a survey) as the outcome, and use both what topics were mentioned (text-based features) and how the interviewee spoke (audio features) as predictors. This might reveal, for example, that speaking in a more animated tone (higher energy) was associated with higher expressed satisfaction, controlling for the content of what was said. The interplay of modalities can provide robustness: if both the words and the tone align to suggest a certain emotional state, we have stronger evidence of it.

From a workflow perspective, combining audio and text features typically means merging datasets. One would have the text analytics outputs (like topic tags, sentiment scores, or code frequencies) in one table, and the audio features (pitch, intensity, etc.) in another table keyed by segment or speaker. Joining them on the segment ID or speaker ID allows integrated analysis. Many of the analysis techniques (t-tests, ANOVAs, correlations, clustering) from quantitative methods can then be applied, but always to inform qualitative understanding, not as an end in themselves. Visualizations such as scatter plots can help: for instance, plotting sentiment score (text-derived) against mean pitch (audio-derived) for segments could visually show whether they correlate (e.g., more negative statements often had lower pitch).

It is important in such integration to remain mindful of context. Numbers can complement qualitative interpretations but shouldn’t be divorced from them. A high speaking rate in one scenario might mean something different in another (excited vs. simply explaining technical info quickly). Thus, the researcher should iterate between quantitative findings and listening to actual audio or reading transcripts to interpret why a certain pattern might be occurring.

10.7 Case Example: Analyzing Interview Data with Audio and Text in Tandem

To illustrate the entire process, let’s walk through a hypothetical case example. Imagine a study in sociology on job interview experiences among recent college graduates. You conduct semi-structured interviews with 20 participants, each roughly an hour long, about their experiences and feelings during job interviews. You want to analyze not only the content of what they say (e.g., themes like “anxiety,” “preparation,” “self-presentation”) but also how they say it (do they become nervous or excited when recounting certain incidents?).

Data Collection: You record audio of each interview (with consent). The audio files are named and stored securely. You also take field notes on notable moments (e.g., “Participant voice shook when discussing first job fair”).

Transcription: Using R and Python tools as described, you decide to use OpenAI Whisper for transcription because of privacy (keeping data local) and its strong accuracy for conversational English. In R, you loop through each audio file with a reticulate call to Whisper, producing a transcript for each. Whisper conveniently also provides timestamps for each sentence. After getting the automated transcripts, you or a research assistant quickly review them while listening to the audio at 1.5x speed, correcting any misheard words (especially company names or jargon) and inserting labels for the interviewer’s questions vs. the participant’s answers.

Transcript Cleaning: You remove filler words like “um” and “uh” from the participant responses for easier reading, but you keep some strategic pauses marked (e.g., “[long pause]”) when the participant fell silent for a while – since that might indicate a moment of emotion or thoughtfulness worth noting. You format the transcripts in a table with columns: participant_id, segment_id, speaker (Interviewer or Participant), start_time, end_time, text. Each row is roughly a turn of speech (the interviewer’s question or the participant’s answer, which might be a long monologue).

Qualitative Coding: You read through the transcripts and use a qualitative coding approach (perhaps thematic coding). You identify several recurring themes: e.g., Confidence (moments participants sounded confident), Anxiety (moments they expressed or presumably felt anxiety), Preparation (discussing how they prepared), Uncertainty (expressing uncertainty or hesitation about what to do), etc. You apply these codes to relevant segments of the text. Notably, you realize some codes are actually reflected in tone – for instance, whenever a participant says “I was really nervous,” in the audio they laugh nervously or their voice wavers. You annotate those observations.

Audio Feature Extraction: Now you extract audio features for each participant’s speaking segments. Using tuneR and seewave (or librosa), you calculate for each participant:

  • Mean and median pitch of their voice (across all their speech in the interview).
  • Pitch variability (e.g., standard deviation of F0, or range: max–min).
  • Mean intensity (perhaps in dB) of their speech.
  • Speaking rate in their answers (words per minute, calculated from transcript timestamps for their segments).
  • A measure of voice shakiness: you decide to use wrassp::forest to compute jitter (% irregularity in pitch periods) as an indicator of tremor in voice.
  • Additionally, you get MFCC means for each participant just in case you want to explore clustering by vocal profile.

You compile these into a participant-level feature table. For example, Participant 7 might have: mean pitch 210 Hz, pitch SD 30 Hz, mean intensity -18 dBFS, speaking rate 180 WPM, jitter 1.5%.

Integration: Now you integrate the qualitative and quantitative. First, you notice a pattern during coding: participants who self-reported being very anxious during real job interviews (say you asked them to rate their anxiety) tended to have more disfluencies in speech and seemed to speak faster when recounting those stories. You verify this quantitatively: you have their anxiety self-rating from a survey, and you correlate that with their speaking rate during the interview. Indeed, you find a moderate positive correlation (r ~ 0.5) – those who said they were very anxious in real job interviews also spoke faster in your research interview. Perhaps anxious personalities have a higher baseline speech rate. You mention this finding in your report as an interesting link between trait anxiety and communication style, citing the correlation as evidence (with caution about causal direction).

Next, consider the themes Anxiety and Confidence that you coded in the narratives. For each participant, you identify segments where they talked about feeling anxious versus segments they talked about feeling confident. You compute the average pitch and volume in those respective sets of segments. You discover that for almost all participants, when talking about anxiety experiences, their pitch was higher and volume slightly lower than when talking about confidence experiences. For one participant, when describing a particularly anxious moment, their pitch went nearly an octave higher than their normal (you see a spike in the pitch contour to around 300 Hz whereas their median is 200 Hz). You double-check by listening – indeed their voice was trembling and high. This pattern – higher pitch during anxiety – is consistent with general psychophysiological responses (stress can tighten vocal cords). So, in your analysis, you integrate this: “Participants’ paralinguistic cues underscored their described emotions. When narrating anxious interview moments, their voices often rose in pitch and sometimes quivered, aligning with the content of their stories. In contrast, when describing moments of confidence or success, their voices were steady and relatively lower in pitch.” You support this with the data, e.g., “On average, fundamental frequency was 15 Hz higher during ‘anxiety’ narratives than ‘confidence’ narratives.”.

Another angle: you use the MFCC features to see if there are clusters of speaking style. Perhaps you do a quick principal components analysis on the participants’ MFCC means and find two main clusters. When you look at who is in each cluster, you realize one cluster is mostly participants who grew up in a different region (say, they have a slight Southern U.S. accent) and the other are not. The MFCCs, picking up accent differences, effectively clustered them. While this might not be central to your research question, it is a neat example of how audio features can reveal background differences. If relevant, you might report that “audio-derived features even captured accent or dialect variations between participants, which could be further analyzed in relation to their interview experiences (though in this study, no clear pattern emerged linking accent to interview outcomes).”

Finally, you compile a holistic interpretation. For each major theme, you write a narrative that includes both what was said and how it was said. For example: Theme: Self-Doubt. In text, you note participants often used phrases like “I’m not sure if I was what they wanted.” Paralinguistically, these admissions of doubt were accompanied by quieter speech and downward intonation. You might quote a piece of transcript and then describe the audio: “When saying ‘I wasn’t confident at all,’ the participant’s volume dropped by about 5 dB and her pitch lowered, as if conveying resignation.” This enriches the qualitative analysis, making it almost multimodal description – the reader can imagine the voice thanks to the data-driven description you provide.

In the discussion/conclusion of that case study, you would highlight how integrating audio analysis provided evidence for interpretations (such as confirming participants’ reported emotions through vocal cues) and uncovered subtleties that might have been overlooked (like the correlation between trait anxiety and speaking rate). You might reflect that without audio, one might misinterpret some statements – e.g., sarcasm or uncertainty that were clear in tone but not in text.

This hypothetical case demonstrates in a concrete way how to apply the techniques covered: transcription with Whisper (Python in R), cleaning transcripts, extracting features with R packages, and then melding qualitative coding with quantitative audio measures to arrive at richer insights.

10.8 Conclusion

Audio data in social science research offers a profound opportunity to enrich analysis by re-incorporating the aural dimension of human communication. Structuring audio into text (via transcription) and quantitative features allows researchers to systematically examine aspects of speech that have traditionally been left to intuition or anecdote. As we have shown, modern tools in R (e.g., tuneR, seewave, googleLanguageR) and Python (e.g., Whisper, librosa), combined through the reticulate interface, make it feasible even for non-engineers to perform advanced audio processing and analysis within a reproducible workflow.

By following a clear process – record high-quality audio, transcribe (with the help of ASR to save time), clean and organize transcripts, extract features capturing voice characteristics, and then integrate those with qualitative analysis – researchers can add a new layer of evidence to their studies. This integrated approach yields insights that are both qualitative (rich, contextual, narrative) and quantitative (measurable, comparable, statistically analyzable). In our examples, we saw how vocal cues corroborated participants’ described emotions, how speaker interaction patterns could be quantified in a group discussion, and how topics might be spoken about differently in tone. These findings would be difficult or impossible to obtain from transcripts alone, affirming the sentiment that by “adhering to only the transcript, researchers miss a layer of data”.

Importantly, working with audio need not undermine the human-centered spirit of qualitative research – instead, it augments it. The goal is not to reduce complex narratives to numbers, but to use numbers to illuminate new facets of the narratives. As Lucas et al. (2021) demonstrated with their MASS model in political speech, incorporating vocal tone can “open the door to new research questions” and even challenge existing theories built only on text. For qualitative researchers, similar potential awaits: one can ask novel questions like “How do people sound when they recount empowering versus disempowering experiences?” or “Do interviewees who engage more energetically (as measured by voice) provide more detail in their responses?”, and actually have data to explore them.

Looking forward, as speech technology continues to advance (e.g., real-time transcription, emotion detection algorithms, multilingual models), the integration of audio in social science research is likely to become more common and more seamless. We may see qualitative analysis software integrating acoustic analysis modules, or the use of voice data in computational social science at scale (e.g., analyzing thousands of hours of legislative speech for sentiment and persuasion tactics). Social scientists should stay attuned to these developments, as they offer powerful extensions to our methodological toolkit.

In conclusion, structuring audio data into usable formats bridges the gap between listening and analyzing. It empowers researchers to systematically “listen” to their data, ensuring that the voices of participants are not only heard but also measured and represented in the analysis. By combining R and Python, we can achieve this with flexibility and precision, leveraging the strengths of both ecosystems. The end result is a more holistic understanding of qualitative material – one that honors both the content of speech and its music, providing insight into the human experience that is both intellectually rigorous and true to the lived reality captured on tape.

References

Barnwell, A. (2025). Listening to interviews: Attending to aurality, emotions, and atmospheres in qualitative analysis. Forum: Qualitative Social Research, 26(1).

Bokhove, C., & Downey, C. (2018). Automated generation of “good enough” transcripts as a first step to transcription of audio‑recorded data. Methodological Innovations, 11(2), 1–14.

Cottingham, M. D., & Erickson, R. J. (2020). Capturing emotion with audio diaries. Qualitative Research, 20(5), 549–564.

Fasoli, F., & Maass, A. (2020). The social costs of sounding gay: Voice‑based impressions of adoption applicants. Journal of Language and Social Psychology, 39(1), 112–131.

Iwarsson, J., Næss, J., & Hollen, R. (2023). Measuring speaking rate: How do objective measurements correlate with audio‑perceptual ratings? Logopedics Phoniatrics Vocology, 48(2), 57–66.

Lucas, C., & Knox, D. (2021). A dynamic model of speech for the social sciences. American Political Science Review, 115(2), 649–666.

OpenAI. (2022, September 21). Introducing Whisper [Blog post]. OpenAI. https://openai.com/research/whisper

Sueur, J. (2018). Sound analysis and synthesis with R. Springer.