pdftools: Getting started

The pdftools manual page shows a brief overview of the main utilities. The most important function is pdf_text() which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.

library(pdftools)

download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")

# first page text
cat(txt[1])
##                                               The jsonlite Package: A Practical and Consistent Mapping
##                                                                    Between JSON Data and R Objects
##                                                                                     Jeroen Ooms
## arXiv:1403.2805v1 [stat.CO] 12 Mar 2014
##                                                                               UCLA Department of Statistics
##                                                                                              Abstract
##                                                   A naive realization of JSON data in R maps JSON arrays to an unnamed list, and JSON objects to a
##                                                named list. However, in practice a list is an awkward, inefficient type to store and manipulate data.
##                                                Most statistical applications work with (homogeneous) vectors, matrices or data frames. Therefore JSON
##                                                packages in R typically define certain special cases of JSON structures which map to simpler R types.
##                                                Currently there exist no formal guidelines, or even consensus between implementations on how R data
##                                                should be represented in JSON. Furthermore, upon closer inspection, even the most basic data structures
##                                                in R actually do not perfectly map to their JSON counterparts and leave some ambiguity for edge cases.
##                                                These problems have resulted in different behavior between implementations and can lead to unexpected
##                                                output. This paper explicitly describes a mapping between R classes and JSON data, highlights potential
##                                                problems, and proposes conventions that generalize the mapping to cover all common structures. We
##                                                emphasize the importance of type consistency when using JSON to exchange dynamic data, and illustrate
##                                                using examples and anecdotes. The jsonlite R package is used throughout the paper as a reference
##                                                implementation.
##                                           1    Introduction
##                                           JavaScript Object Notation (JSON) is a text format for the serialization of structured data (Crockford, 2006a).
##                                           It is derived from the object literals of JavaScript, as defined in the ECMAScript Programming Language
##                                           Standard, Third Edition (ECMA, 1999). Design of JSON is simple and concise in comparison with other
##                                           text based formats, and it was originally proposed by Douglas Crockford as a “fat-free alternative to XML”
##                                           (Crockford, 2006b). The syntax is easy for humans to read and write, easy for machines to parse and generate
##                                           and completely described in a single page at http://www.json.org. The character encoding of JSON text
##                                           is always Unicode, using UTF-8 by default (Crockford, 2006a), making it naturally compatible with non-
##                                           latin alphabets. Over the past years, JSON has become hugely popular on the internet as a general purpose
##                                           data interchange format. High quality parsing libraries are available for almost any programming language,
##                                           making it easy to implement systems and applications that exchange data over the network using JSON. For
##                                           R (R Core Team, 2013), several packages that assist the user in generating, parsing and validating JSON
##                                           are available through CRAN, including rjson (Couture-Beil, 2013), RJSONIO (Lang, 2013), and jsonlite
##                                           (Ooms et al., 2014).
##                                           The emphasis of this paper is not on discussing the JSON format or any particular implementation for using
##                                                                                                  1
# second page text
cat(txt[2])
## JSON with R. We refer to Nolan and Temple Lang (2014) for a comprehensive introduction, or one of the
## many tutorials available on the web. Instead we take a high level view and discuss how R data structures are
## most naturally represented in JSON. This is not a trivial problem, particulary for complex or relational data
## as they frequently appear in statistical applications. Several R packages implement toJSON and fromJSON
## functions which directly convert R objects into JSON and vice versa. However, the exact mapping between
## the various R data classes JSON structures is not self evident. Currently, there are no formal guidelines,
## or even consensus between implementations on how R data should be represented in JSON. Furthermore,
## upon closer inspection, even the most basic data structures in R actually do not perfectly map to their
## JSON counterparts, and leave some ambiguity for edge cases. These problems have resulted in different
## behavior between implementations, and can lead to unexpected output for certain special cases. To further
## complicate things, best practices of representing data in JSON have been established outside the R community.
## Incorporating these conventions where possible is important to maximize interoperability.
## 1.1     Parsing and type safety
## The JSON format specifies 4 primitive types (string, number, boolean, null) and two universal structures:
##     • A JSON object : an unordered collection of zero or more name/value pairs, where a name is a string and
##        a value is a string, number, boolean, null, object, or array.
##     • A JSON array: an ordered sequence of zero or more values.
## Both these structures are heterogeneous; i.e. they are allowed to contain elements of different types. There-
## fore, the native R realization of these structures is a named list for JSON objects, and unnamed list for
## JSON arrays. However, in practice a list is an awkward, inefficient type to store and manipulate data in R.
## Most statistical applications work with (homogeneous) vectors, matrices or data frames. In order to give
## these data structures a JSON representation, we can define certain special cases of JSON structures which get
## parsed into other, more specific R types. For example, one convention which all current implementations
## have in common is that a homogeneous array of primitives gets parsed into an atomic vector instead of a
## list. The RJSONIO documentation uses the term “simplify” for this, and we adopt this jargon.
## txt <- "[12, 3, 7]"
## x <- fromJSON(txt)
## is(x)
## [1] "numeric" "vector"
## print(x)
## [1] 12     3   7
## This seems very reasonable and it is the only practical solution to represent vectors in JSON. However the
## price we pay is that automatic simplification can compromise type-safety in the context of dynamic data.
## For example, suppose an R package uses fromJSON to pull data from a JSON API on the web, similar to
## the example above. However, for some particular combination of parameters, the result includes a null
## value, e.g: [12, null, 7]. This is actually quite common, many APIs use null for missing values or unset
## fields. This case makes the behavior of parsers ambiguous, because the JSON array is technically no longer
##                                                         2

pdftools: Utilities

In addition, the pdftoolspackage has some utilities to extract other data from the PDF file.

How to extract the table of contents

The pdf_toc function shows the table of contents, i.e. the section headers which pdf readers usually display in a menu on the left. It looks pretty in JSON:

# Table of contents
toc <- pdf_toc("1403.2805.pdf")

# Show as JSON
jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)
## {
##   "title": "",
##   "children": [
##     {
##       "title": "1 Introduction",
##       "children": [
##         {
##           "title": "1.1 Parsing and type safety",
##           "children": []
##         },
##         {
##           "title": "1.2 Reference implementation: the jsonlite package",
##           "children": []
##         },
##         {
##           "title": "1.3 Class-based versus type-based encoding",
##           "children": []
##         },
##         {
##           "title": "1.4 Scope and limitations",
##           "children": []
##         }
##       ]
##     },
##     {
##       "title": "2 Converting between JSON and R classes",
##       "children": [
##         {
##           "title": "2.1 Atomic vectors",
##           "children": [
##             {
##               "title": "2.1.1 Missing values",
##               "children": []
##             },
##             {
##               "title": "2.1.2 Special vector types: dates, times, factor, complex",
##               "children": []
##             },
##             {
##               "title": "2.1.3 Special cases: vectors of length 0 or 1",
##               "children": []
##             }
##           ]
##         },
##         {
##           "title": "2.2 Matrices",
##           "children": [
##             {
##               "title": "2.2.1 Matrix row and column names",
##               "children": []
##             }
##           ]
##         },
##         {
##           "title": "2.3 Lists",
##           "children": [
##             {
##               "title": "2.3.1 Unnamed lists",
##               "children": []
##             },
##             {
##               "title": "2.3.2 Named lists",
##               "children": []
##             }
##           ]
##         },
##         {
##           "title": "2.4 Data frame",
##           "children": [
##             {
##               "title": "2.4.1 Column based versus row based tables",
##               "children": []
##             },
##             {
##               "title": "2.4.2 Row based data frame encoding",
##               "children": []
##             },
##             {
##               "title": "2.4.3 Missing values in data frames",
##               "children": []
##             },
##             {
##               "title": "2.4.4 Relational data: nested records",
##               "children": []
##             },
##             {
##               "title": "2.4.5 Relational data: nested tables",
##               "children": []
##             }
##           ]
##         }
##       ]
##     },
##     {
##       "title": "3 Structural consistency and type safety in dynamic data",
##       "children": [
##         {
##           "title": "3.1 Classes, types and data",
##           "children": []
##         },
##         {
##           "title": "3.2 Rule 1: Fixed keys",
##           "children": []
##         },
##         {
##           "title": "3.3 Rule 2: Consistent types",
##           "children": []
##         }
##       ]
##     },
##     {
##       "title": "Appendices",
##       "children": []
##     },
##     {
##       "title": "A Public JSON APIs",
##       "children": [
##         {
##           "title": "A.1 No authentication required",
##           "children": []
##         },
##         {
##           "title": "A.2 Free registration required",
##           "children": []
##         },
##         {
##           "title": "A.3 OAuth2 authentication",
##           "children": []
##         }
##       ]
##     },
##     {
##       "title": "B Simple JSON RPC with OpenCPU",
##       "children": []
##     }
##   ]
## }

How to exactract PDF author, version, etc.

Other functions provide information about fonts, attachments and metadata such as the author, creation date or tags.

# Author, version, etc
info <- pdf_info("1403.2805.pdf")
info
## $version
## [1] "1.4"
## 
## $pages
## [1] 29
## 
## $encrypted
## [1] FALSE
## 
## $linearized
## [1] FALSE
## 
## $keys
## $keys$Producer
## [1] "dvips + GPL Ghostscript GIT PRERELEASE 9.08"
## 
## $keys$Creator
## [1] "LaTeX with hyperref package"
## 
## $keys$Title
## [1] ""
## 
## $keys$Subject
## [1] ""
## 
## $keys$Author
## [1] ""
## 
## $keys$Keywords
## [1] ""
## 
## 
## $created
## [1] "2014-03-12 21:00:25 EDT"
## 
## $modified
## [1] "2014-03-12 21:00:25 EDT"
## 
## $metadata
## [1] "<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>\n<?adobe-xap-filters esc=\"CRLF\"?>\n<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6'>\n<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/'>\n<rdf:Description rdf:about='uuid:68dde9e4-e267-11ee-0000-a5c788a95450' xmlns:pdf='http://ns.adobe.com/pdf/1.3/'><pdf:Producer>dvips + GPL Ghostscript GIT PRERELEASE 9.08</pdf:Producer>\n<pdf:Keywords>()</pdf:Keywords>\n</rdf:Description>\n<rdf:Description rdf:about='uuid:68dde9e4-e267-11ee-0000-a5c788a95450' xmlns:xmp='http://ns.adobe.com/xap/1.0/'><xmp:ModifyDate>2014-03-12T21:00:25-04:00</xmp:ModifyDate>\n<xmp:CreateDate>2014-03-12T21:00:25-04:00</xmp:CreateDate>\n<xmp:CreatorTool>LaTeX with hyperref package</xmp:CreatorTool></rdf:Description>\n<rdf:Description rdf:about='uuid:68dde9e4-e267-11ee-0000-a5c788a95450' xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' xapMM:DocumentID='uuid:68dde9e4-e267-11ee-0000-a5c788a95450'/>\n<rdf:Description rdf:about='uuid:68dde9e4-e267-11ee-0000-a5c788a95450' xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li xml:lang='x-default'>()</rdf:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>()</rdf:li></rdf:Seq></dc:creator><dc:description><rdf:Seq><rdf:li>()</rdf:li></rdf:Seq></dc:description></rdf:Description>\n</rdf:RDF>\n</x:xmpmeta>\n                                                                        \n                                                                        \n<?xpacket end='w'?>"
## 
## $locked
## [1] FALSE
## 
## $attachments
## [1] FALSE
## 
## $layout
## [1] "no_layout"

How to exactract PDF fonts

# Table with fonts
fonts <- pdf_fonts("1403.2805.pdf")
fonts

Rendering pdf

A bonus feature on most platforms is rendering of PDF files to bitmap arrays. The poppler library provides all functionality to implement a complete PDF reader, including graphical display of the content. In R we can use pdf_render_page to render a page of the PDF into a bitmap, which can be stored as e.g. png or jpeg.

# renders pdf to bitmap array
bitmap <- pdf_render_page("1403.2805.pdf", page = 1)

# save bitmap image
png::writePNG(bitmap, "page.png")
jpeg::writeJPEG(bitmap, "page.jpeg")
webp::write_webp(bitmap, "page.webp")

This feature is still experimental and currently does not work on Windows.

pdftools: Tables

Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data with pdftools.

It is possible to use pdftools with some creativity to parse tables from PDF documents. That's what it's done in the subsections below.

About PDF textboxes

A pdf document may seem to contain paragraphs or tables in a viewer, but this is not actually true. PDF is a printing format: a page consists of a series of unrelated lines, bitmaps, and textboxes with a given size, position and content. Hence a table in a pdf file is really just a large unordered set of lines and words that are nicely visually positioned. This makes sense for printing, but makes extracting text or data from a pdf file extremely difficult. Because the pdf format has little semantic structure, the pdf_text() function in pdftools has to render the PDF to a text canvas, in order to create the sentences or paragraphs. It does so pretty well, but some users have asked for something more low level.

Low-level text extraction

An example pdf file from the rOpenSci tabulizer package is used. This file contains a few standard datasets which have been printed as a pdf table. First let’s try the pdf_text() function, which returns a character vector of length equal to the number of pages in the file.

library(pdftools)
pdf_file <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf"
txt <- pdf_text(pdf_file)
cat(txt[1])
##                     mpg  cyl    disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6   160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6   160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4   108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6   258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8   360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6   225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8   360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4   146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4   140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6   167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6   167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8   275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8   275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8   275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8   472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8   460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8   440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4    78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4    75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4    71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4   120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8   318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8   304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8   350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8   400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4    79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4   120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4    95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8   351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6   145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8   301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4   121.0 109 4.11 2.780 18.60  1  1    4    2
##                              1

Hence pdf_text() converts all text on a page to a large string, which works pretty well. However if you would want to parse this text into a data frame (using e.g. read.table) you run into a problem: the first column contains spaces within the values. Therefore we can’t use the whitespace as the column delimiter (as is the default in read.table). Hence to write a proper pdf table extractor, we have to infer the column from the physical position of the textbox, rather than rely on delimiting characters. The new pdf_data() provides exactly this. It returns a data frame with all textboxes in a page, including their width, height, and (x,y) position:

# All textboxes on page 1
test <- pdf_data(pdf_file)[[1]]
test

Converting this pdf data into the original data frame is left as an exercise for the reader!

pdftools: Scanned text

If you want to extract text from scanned text present in a pdf, you’ll need to use OCR (optical character recognition). Please refer to the rOpenSci tesseract package that provides bindings to the Tesseract OCR engine. In particular read the section of its vignette about reading from PDF files using pdftools and tesseract.

tabulizer: Getting started

Let's load the package tabulizer and define a variable referencing an example PDF.

library(tabulizer)
 
site <- "http://www.sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf"

The PDFs you manipulate with this package don’t have to be located on your machine — you can use tabulizer to reference a PDF by a URL. For this first example, we’re going to use a sample PDF file found here

Scraping text from our sample PDF can be done using extract_text():

text <- extract_text(site)

# print text
cat(text)
## Tutoring to Enhance Science Skills
## Tutoring Two: Learning to Make Data Tables
## . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## Sample Data for Data Tables
## ����������� �������� �������
## NATIONAL PARTNERSHIP FOR QUALITY AFTERSCHOOL LEARNING
## www.sedl.org/afterschool/toolkits
## Use these data to create data tables following the Guidelines for Making a Data Table and 
## Checklist for a Data Table.
## Example 1: Pet Survey (GR 2–3)
## Ms. Hubert’s afterschool students took a survey of the 600 students at Morales Elementary 
## School. Students were asked to select their favorite pet from a list of eight animals. Here 
## are the results. 
## Lizard 25, Dog 250, Cat 115, Bird 50, Guinea pig 30, Hamster 45, Fish 75, 
## Ferret 10 
## Example 2: Electromagnets—Increasing Coils (GR 3–5)
## The following data were collected using an electromagnet with a 1.5 volt battery, a switch, 
## a piece of #20 insulated wire, and a nail. Three trials were run. Safety precautions in 
## repeating this experiment include using safety goggles or safety spectacles and avoiding 
## short circuits.  
##   Number of Coils         Number of Paperclips
##  5 3, 5, 4
##  10        7, 8, 6
##  15  11, 10, 12
##  20  15, 13, 14
##     
## Example 3: pH of Substances (GR 5–10)
## The following are pH values of common household substances taken by three different 
## teams using pH probes. Safety precautions in repeating this experiment include hooded 
## ventilation, chemical-splash safety goggles, gloves, and apron. Do not use bleach, 
## ammonia, or strong acids with children.
## Lemon juice 2.4, 2.0, 2.2; Baking soda (1 Tbsp) in Water (1 cup) 8.4, 8.3, 8.7; 
## Orange juice 3.5, 4.0, 3.4; Battery acid 1.0, 0.7, 0.5; Apples 3.0, 3.2, 3.5; 
## Tomatoes 4.5, 4.2, 4.0; Bottled water 6.7, 7.0, 7.2; Milk of magnesia 10.5, 10.3, 
## 10.6; Liquid hand soap 9.0, 10.0, 9.5; Vinegar 2.2, 2.9, 3.0; Household bleach 
## 12.5, 12.5, 12.7; Milk 6.6, 6.5, 6.4; Household ammonia 11.5, 11.0, 11.5;
## Lye 13.0, 13.5, 13.4; and Sodium hydroxide 14.0, 14.0, 13.9; Anti-freeze 10.1, 
## 10.9, 9.7; Windex 9.9. 10.2, 9.5; Liquid detergent 10.5, 10.0, 10.3; and 
## Cola 3.0, 2.5, 3.2
## Teaching tip: The pH scale is from 0 to 14. Have students make two data tables, one 
## with the data as given and one with the pH scale 0 to 14 with the substances’ average 
## pH in rank order on the scale (Battery acid at the lower end and Sodium hydroxide at 
## the upper end) or create a pH graphic organizer.
## 1
## © 2006 WGBH Educational Foundation. All rights reserved.
## Example 4: Automobile Land Speed Records (GR 5-10)
## In the first recorded automobile race in 1898, Count Gaston de Chasseloup-Laubat of 
## Paris, France, drove 1 kilometer in 57 seconds for an average speed of 39.2 miles per hour 
## (mph) or 63.1 kilometers per hour (kph). In 1904, Henry Ford drove his Ford Arrow across 
## frozen Lake St. Clair, MI, at an average speed of 91.4 mph. Now, the North American 
## Eagle is trying to break a land speed record of 800 mph. The Federation International de 
## L’Automobile (FIA), the world’s governing body for motor sport and land speed records, 
## recorded the following land speed records. (Retrieved on February 5, 2006, from 
## http://www.landspeed.com/lsrinfo.asp.)
## Speed (mph)
## 407.447
## 413.199
## 434.22
## 468.719
## 526.277
## 536.712
## 555.127
## 576.553
## 600.601
## 622.407
## 633.468
## 763.035
## Driver
## Craig Breedlove
## Tom Green 
## Art Arfons
## Craig Breedlove
## Craig Breedlove
## Art Arfons
## Craig Breedlove
## Art Arfons
## Craig Breedlove
## Gary Gabelich
## Richard Noble 
## Andy Green
## Car
## Spirit of America 
## Wingfoot Express 
## Green Monster 
## Spirit of America
## Spirit of America
## Green Monster 
## Spirit of America, Sonic 1 
## Green Monster 
## Spirit of America, Sonic 1
## Blue Flame 
## Thrust 2 
## Thrust SSC
## Engine
## GE J47
## WE J46  
## GE J79 
## GE J79 
## GE J79 
## GE J79  
## GE J79 
## GE J79 
## GE J79 
## Rocket 
## RR RG 146 
## RR Spey
## Date
## 8/5/63
## 10/2/64
## 10/5/64
## 10/13/64
## 10/15/65
## 10/27/65
## 11/2/65 
## 11/7/65 
## 11/15/65 
## 10/23/70  
## 10/4/83  
## 10/15/97
## Example 5: Distance and Time (GR 8-10)
## The following data were collected using a car with a water clock set to release a drop in 
## a unit of time and a meter stick. The car rolled down an inclined plane. Three trials were 
## run. Create a data table with an average distance column and an average velocity column, 
## create an average distance-time graph, and draw the best-fit line or curve. Estimate the 
## car’s distance traveled and velocity at six drops of water. Describe the motion of the car. Is 
## it going at a constant speed, accelerating, or decelerating? How do you know?
##    Time (drops of water)           Distance (cm)
##  1  10,11,9
##  2  29, 31, 30
##  3  59, 58, 61
##  4  102, 100, 98
##  5  122, 125, 127   
##      
## 2

Note: This package only works if the PDF’s text is highlightable (if it’s typed) — i.e. it won’t work for scanned-in PDFs, or image files converted to PDFs.

tabulizer: Utilities

How to split up a PDF by its pages

tabulizer can also create separate files for the pages in a PDF. This can be done using the split_pdf() function:

# split PDF referenced above
# output separate page files to current directory
split_pdf(site, getwd())

# or output to different directory
split_pdf(site, "C:/path/to/other/folder")

The first argument of split_pdf() is the filename or URL of your PDF; the second argument is the directory where you want the individual pages to be output.

How to merge a collection of PDFs

What if we want to reverse what we just did? We can use the merge_pdfs function, which takes as input a vector of file names and and the name of the output file which will be the result of merging the files together.

merge_pdfs("C:/path/to/pdf/files", "C:/path/to/merged_result.pdf")

How to get the number of pages in a PDF

Getting the number of pages in a PDF is made easy with the get_n_pages function, which you can call like this:

get_n_pages(site)
## [1] 2

How to get metadata associated with a PDF

You can get metadata associated with our PDF using extract_metadata:

extract_metadata(site)
## $pages
## [1] 2
## 
## $title
## [1] "Sample Data for Data Tables"
## 
## $author
## NULL
## 
## $subject
## NULL
## 
## $keywords
## NULL
## 
## $creator
## [1] "Adobe InDesign 2.0.2"
## 
## $producer
## [1] "Adobe PDF Library 5.0"
## 
## $created
## [1] "Tue Nov 08 10:20:02 EST 2005"
## 
## $modified
## [1] "Thu Jul 06 16:38:57 EDT 2006"
## 
## $trapped
## [1] "False"

This function returns a list containing information showing the number of pages, title, created / modified dates, and more.

tabulizer: Tables

The tabulizer package is dedicated to extracting tables from PDF, and includes interactive tools for selecting tables. However, tabulizer depends on rJava and therefore requires additional setup steps or may be impossible to use on systems where Java cannot be installed.

You can extract tables from this PDF using the aptly-named extract_tables() function, like this:

# default call with no parameters changed
matrix_results <- extract_tables(site)
matrix_results
## [[1]]
##      [,1]              [,2]                  
## [1,] "Number of Coils" "Number of Paperclips"
## [2,] "5"               "3, 5, 4"             
## [3,] "10"              "7, 8, 6"             
## [4,] "15"              "11, 10, 12"          
## [5,] "20"              "15, 13, 14"          
## 
## [[2]]
##       [,1]          [,2]              [,3]                         [,4]       
##  [1,] "Speed (mph)" "Driver"          "Car"                        "Engine"   
##  [2,] "407.447"     "Craig Breedlove" "Spirit of America"          "GE J47"   
##  [3,] "413.199"     "Tom Green"       "Wingfoot Express"           "WE J46"   
##  [4,] "434.22"      "Art Arfons"      "Green Monster"              "GE J79"   
##  [5,] "468.719"     "Craig Breedlove" "Spirit of America"          "GE J79"   
##  [6,] "526.277"     "Craig Breedlove" "Spirit of America"          "GE J79"   
##  [7,] "536.712"     "Art Arfons"      "Green Monster"              "GE J79"   
##  [8,] "555.127"     "Craig Breedlove" "Spirit of America, Sonic 1" "GE J79"   
##  [9,] "576.553"     "Art Arfons"      "Green Monster"              "GE J79"   
## [10,] "600.601"     "Craig Breedlove" "Spirit of America, Sonic 1" "GE J79"   
## [11,] "622.407"     "Gary Gabelich"   "Blue Flame"                 "Rocket"   
## [12,] "633.468"     "Richard Noble"   "Thrust 2"                   "RR RG 146"
## [13,] "763.035"     "Andy Green"      "Thrust SSC"                 "RR Spey"  
##       [,5]      
##  [1,] "Date"    
##  [2,] "8/5/63"  
##  [3,] "10/2/64" 
##  [4,] "10/5/64" 
##  [5,] "10/13/64"
##  [6,] "10/15/65"
##  [7,] "10/27/65"
##  [8,] "11/2/65" 
##  [9,] "11/7/65" 
## [10,] "11/15/65"
## [11,] "10/23/70"
## [12,] "10/4/83" 
## [13,] "10/15/97"
## 
## [[3]]
##      [,1]                    [,2]           
## [1,] "Time (drops of water)" "Distance (cm)"
## [2,] "1"                     "10,11,9"      
## [3,] "2"                     "29, 31, 30"   
## [4,] "3"                     "59, 58, 61"   
## [5,] "4"                     "102, 100, 98" 
## [6,] "5"                     "122, 125, 127"
# get back the tables as data frames, keeping their headers
df_results <- extract_tables(site, output = "data.frame", header = TRUE)
df_results
## [[1]]
##   Number.of.Coils Number.of.Paperclips
## 1               5              3, 5, 4
## 2              10              7, 8, 6
## 3              15           11, 10, 12
## 4              20           15, 13, 14
## 
## [[2]]
##    Speed..mph.          Driver                        Car    Engine     Date
## 1      407.447 Craig Breedlove          Spirit of America    GE J47   8/5/63
## 2      413.199       Tom Green           Wingfoot Express    WE J46  10/2/64
## 3      434.220      Art Arfons              Green Monster    GE J79  10/5/64
## 4      468.719 Craig Breedlove          Spirit of America    GE J79 10/13/64
## 5      526.277 Craig Breedlove          Spirit of America    GE J79 10/15/65
## 6      536.712      Art Arfons              Green Monster    GE J79 10/27/65
## 7      555.127 Craig Breedlove Spirit of America, Sonic 1    GE J79  11/2/65
## 8      576.553      Art Arfons              Green Monster    GE J79  11/7/65
## 9      600.601 Craig Breedlove Spirit of America, Sonic 1    GE J79 11/15/65
## 10     622.407   Gary Gabelich                 Blue Flame    Rocket 10/23/70
## 11     633.468   Richard Noble                   Thrust 2 RR RG 146  10/4/83
## 12     763.035      Andy Green                 Thrust SSC   RR Spey 10/15/97
## 
## [[3]]
##   Time..drops.of.water. Distance..cm.
## 1                     1       10,11,9
## 2                     2    29, 31, 30
## 3                     3    59, 58, 61
## 4                     4  102, 100, 98
## 5                     5 122, 125, 127

By default, this function will return a matrix for each table, as in the first line of code above. However, as in the second line, we can add parameters to the function to specify the output flag to be data.frame, and set header = TRUE, to get back a list of data frames corresponding to the tables in the PDF.

Once we have the results back, we can refer to any individual PDF table like any data frame we normally would in R.

first_df <- df_results[[1]]
 
first_df$Number.of.Coils
## [1]  5 10 15 20

References

This course uses the following sources:

Acknowledgments

To cite this course:

Warin, Thierry. 2020. “Covid-19 Simulation: A Data Science Perspective.” doi:10.6084/m9.figshare.12020994.v1.