Learning from Unstructured Data

The first three parts of this book showed how quantitative insights emerge when variables arrive neatly as columns in a spreadsheet. Yet most information generated by—and about—human activity is unstructured: free‑form text, images and video, audio streams, social graphs, sensor traces. Part IV equips you to transform these high‑dimensional, highly correlated data into structured representations that the modelling techniques of Parts II and III can exploit—or, when necessary, into direct inputs for specialised deep‑learning architectures.

We begin with natural‑language processing (NLP). Chapters on text preprocessing, tokenisation, and word embeddings (word2vec, GloVe, fastText) set the stage for classical bag‑of‑words models, topic modelling with Latent Dirichlet Allocation, and modern transformer‑based encoders such as BERT. You will learn to fine‑tune large language models for downstream tasks—sentiment analysis, stance detection, entity recognition—while addressing domain shift and data sparsity that often plague social‑science corpora.

Next we turn to computer vision, starting with the fundamentals of digital imagery, convolutional filters, and pooling operations. Building upward through ResNet and EfficientNet families, we demonstrate transfer‑learning workflows that make state‑of‑the‑art vision models accessible on modest GPU budgets. Case studies include socioeconomic inference from satellite imagery and frame‑level coding of protest videos. Emphasis is placed on annotation strategies, data‑augmentation pipelines, and interpretability tools such as Grad‑CAM.

A dedicated chapter tackles time‑series and sequential signals—audio clips, IoT sensor feeds, longitudinal behavioural traces—introducing recurrence (LSTM/GRU), dilated convolutions, and the attention mechanisms that power modern sequence models. You will experiment with tasks ranging from speech emotion recognition to anomaly detection in electricity‑consumption logs, and see how sequence‑to‑sequence architectures underpin translation and summarisation systems.

Because unstructured datasets often lack explicit labels, we devote substantial space to unsupervised and self‑supervised representation learning. Techniques such as contrastive learning, autoencoders, masked‑language objectives, and graph neural networks enable you to mine structure without exhaustive human annotation. We connect these ideas to transfer learning and to the dimensionality‑reduction methods (PCA, t‑SNE, UMAP) that aid exploratory analysis.

Throughout Part IV we weave four cross‑cutting themes:

Vectorisation as a bridge – how raw artefacts become fixed‑length numeric embeddings;
Scalability and storage – best practices for sharding, streaming, and hardware acceleration when datasets run into terabytes;
Evaluation beyond accuracy – BLEU, ROUGE, perception‑based metrics, retrieval precision, and human‑in‑the‑loop validation;
Ethical and legal considerations – privacy in text corpora, bias in face recognition, and the environmental cost of large‑scale pre‑training.

Every chapter combines conceptual exposition with runnable R/Python notebooks that leverage libraries such as spaCy, transformers, torch, and keras. By completing the step‑by‑step labs—sentiment analysis of political speeches, poverty mapping from night‑lights, audio‑based stress detection—you will gain the practical fluency needed to incorporate unstructured evidence into rigorous data‑science workflows.

Armed with these skills, you can extend the predictive and inferential techniques of earlier parts to the messy, multimodal data that increasingly define contemporary research and industry practice.

# Learning from Unstructured Data The first three parts of this book showed how quantitative insights emerge when variables arrive neatly as columns in a spreadsheet. Yet most information generated by—and about—human activity is *unstructured*: free‑form text, images and video, audio streams, social graphs, sensor traces. Part IV equips you to transform these high‑dimensional, highly correlated data into structured representations that the modelling techniques of Parts II and III can exploit—or, when necessary, into direct inputs for specialised deep‑learning architectures. We begin with **natural‑language processing (NLP)**. Chapters on text preprocessing, tokenisation, and word embeddings (word2vec, GloVe, fastText) set the stage for classical bag‑of‑words models, topic modelling with Latent Dirichlet Allocation, and modern transformer‑based encoders such as BERT. You will learn to fine‑tune large language models for downstream tasks—sentiment analysis, stance detection, entity recognition—while addressing domain shift and data sparsity that often plague social‑science corpora. Next we turn to **computer vision**, starting with the fundamentals of digital imagery, convolutional filters, and pooling operations. Building upward through ResNet and EfficientNet families, we demonstrate transfer‑learning workflows that make state‑of‑the‑art vision models accessible on modest GPU budgets. Case studies include socioeconomic inference from satellite imagery and frame‑level coding of protest videos. Emphasis is placed on annotation strategies, data‑augmentation pipelines, and interpretability tools such as Grad‑CAM. A dedicated chapter tackles **time‑series and sequential signals**—audio clips, IoT sensor feeds, longitudinal behavioural traces—introducing recurrence (LSTM/GRU), dilated convolutions, and the attention mechanisms that power modern sequence models. You will experiment with tasks ranging from speech emotion recognition to anomaly detection in electricity‑consumption logs, and see how sequence‑to‑sequence architectures underpin translation and summarisation systems. Because unstructured datasets often lack explicit labels, we devote substantial space to **unsupervised and self‑supervised representation learning**. Techniques such as contrastive learning, autoencoders, masked‑language objectives, and graph neural networks enable you to mine structure without exhaustive human annotation. We connect these ideas to transfer learning and to the dimensionality‑reduction methods (PCA, t‑SNE, UMAP) that aid exploratory analysis. Throughout Part IV we weave four cross‑cutting themes: 1. **Vectorisation as a bridge** – how raw artefacts become fixed‑length numeric embeddings; 2. **Scalability and storage** – best practices for sharding, streaming, and hardware acceleration when datasets run into terabytes; 3. **Evaluation beyond accuracy** – BLEU, ROUGE, perception‑based metrics, retrieval precision, and human‑in‑the‑loop validation; 4. **Ethical and legal considerations** – privacy in text corpora, bias in face recognition, and the environmental cost of large‑scale pre‑training. Every chapter combines conceptual exposition with runnable R/Python notebooks that leverage libraries such as *spaCy*, *transformers*, *torch*, and *keras*. By completing the step‑by‑step labs—sentiment analysis of political speeches, poverty mapping from night‑lights, audio‑based stress detection—you will gain the practical fluency needed to incorporate unstructured evidence into rigorous data‑science workflows. Armed with these skills, you can extend the predictive and inferential techniques of earlier parts to the messy, multimodal data that increasingly define contemporary research and industry practice.