[R Course] PDF Text Extraction

R Courses

To automatically pull out tables and text from PDFs.

Thierry Warin https://warin.ca/aboutme.html (HEC Montréal and CIRANO (Canada))https://www.hec.ca/en/profs/thierry.warin.html
03-24-2020

Course objectives

The purpose of this course is to allow you to automatically pull out tables and text from PDFs. Two packages are presented to you to tackle this task. The first one is pdftools and the second one is tabulizer.

Course plan

1. pdftools: Getting started

Extracting text from PDF.

2. pdftools: Utilities

Extracting the table of contents, PDF author, version and PDF fonts.

3. pdftools: Tables

Extracting tables from PDF.

4. pdftools: Scanned text

5. tabulizer: Getting started

Extract text from PDF.

6. tabulizer: Utilities

Splitting up a PDF by its pages.

Merging a collection of PDFs.

Getting the number of pages in a PDF.

Getting metadata associated with a PDF.

7. tabulizer: Tables

Extracting tables from PDF.



ACCESS TO THE COURSE



Citation

For attribution, please cite this work as

Warin (2020, March 24). Thierry Warin, PhD: [R Course] PDF Text Extraction. Retrieved from https://warin.ca/posts/rcourse-pdf-text-extraction/

BibTeX citation

@misc{warin2020[r,
  author = {Warin, Thierry},
  title = {Thierry Warin, PhD: [R Course] PDF Text Extraction},
  url = {https://warin.ca/posts/rcourse-pdf-text-extraction/},
  year = {2020}
}