[R Course] PDF Text Extraction

To automatically pull out tables and text from PDFs.

Thierry Warin https://warin.ca/aboutme.html (HEC Montréal and CIRANO (Canada))https://www.hec.ca/en/profs/thierry.warin.html

Course objectives

The purpose of this course is to allow you to automatically pull out tables and text from PDFs. Two packages are presented to you to tackle this task. The first one is pdftools and the second one is tabulizer.

Course plan

1. pdftools: Getting started

Extracting text from PDF.

2. pdftools: Utilities

Extracting the table of contents, PDF author, version and PDF fonts.

3. pdftools: Tables

Extracting tables from PDF.

4. pdftools: Scanned text

5. tabulizer: Getting started

Extract text from PDF.

6. tabulizer: Utilities

Splitting up a PDF by its pages.

Merging a collection of PDFs.

Getting the number of pages in a PDF.

Getting metadata associated with a PDF.

7. tabulizer: Tables

Extracting tables from PDF.



For attribution, please cite this work as

Warin (2020, March 24). Thierry Warin: [R Course] PDF Text Extraction. Retrieved from https://warin.ca/posts/pdf-text-extraction/

BibTeX citation

  author = {Warin, Thierry},
  title = {Thierry Warin: [R Course] PDF Text Extraction},
  url = {https://warin.ca/posts/pdf-text-extraction/},
  year = {2020}