[R Course] PDF Text Extraction

R Courses

To automatically pull out tables and text from PDFs.

Published

March 24, 2020

DOI

10.6084/m9.figshare.11744013.v2

Course objectives

The purpose of this course is to allow you to automatically pull out tables and text from PDFs. Two packages are presented to you to tackle this task. The first one is pdftools and the second one is tabulizer.

Course plan

1. pdftools: Getting started

Extracting text from PDF.

2. pdftools: Utilities

Extracting the table of contents, PDF author, version and PDF fonts.

3. pdftools: Tables

Extracting tables from PDF.

4. pdftools: Scanned text

5. tabulizer: Getting started

Extract text from PDF.

6. tabulizer: Utilities

Splitting up a PDF by its pages.

Merging a collection of PDFs.

Getting the number of pages in a PDF.

Getting metadata associated with a PDF.

7. tabulizer: Tables

Extracting tables from PDF.



ACCESS TO THE COURSE



Footnotes

    Citation

    For attribution, please cite this work as

    Warin (2020, March 24). Thierry Warin, PhD: [R Course] PDF Text Extraction. Retrieved from https://warin.ca/posts/rcourse-pdf-text-extraction/

    BibTeX citation

    @misc{warin2020[r,
      author = {Warin, Thierry},
      title = {Thierry Warin, PhD: [R Course] PDF Text Extraction},
      url = {https://warin.ca/posts/rcourse-pdf-text-extraction/},
      year = {2020}
    }