MyNixOS website logo
Description

Read Portable Document Format (PDF) Files.

Provides an interface to 'PDFMiner' <https://github.com/pdfminer/pdfminer.six> a 'Python' package for extracting information from 'PDF'-files. 'PDFMiner' has the goal to get all information available in a 'PDF'-file, position of the characters, font type, font size and informations about lines. Which makes it the perfect starting point for extracting tables from 'PDF'-files. More information can be found in the package 'README'-file.

pdfminer

The R package pdfminer provides an interface to low level functionality of the Python package pdfminer.

Installation

Python

pip install pdfminer.six
pip install pandas

R

install.packages("pdfminer")

Basic usage

library("pdfminer")

args(read.pdf)
#R> function (file, pages = integer(), method = c("csv", "sqlite", 
#R>     "PythonInR"), laycntrl = layout_control(), encoding = "utf8", 
#R>     strip_control = FALSE, password = "", caching = TRUE, maxpages = Inf, 
#R>     rotation = 0L, image_dir = "", pyexe = "python3") 
file <- system.file("pdfs/cars.pdf", package = "pdfminer")
d <- read.pdf(file, method = "csv")
#R> A pdf document with 2 pages and
#R>   metainfo text line rect curve figure textline textbox textgroup image
#R> 1        2  469    0    0     0      0      155      10         8     0
#R> elements.

The function read.pdf() returns an object of class pdf_document (a list containing data.frame's). Each object of class pdf_document contains the elements:

  • "metainfo"
  • "text"
  • "line"
  • "rect"
  • "curve"
  • "figure"
  • "textline"
  • "textbox"
  • "textgroup"
  • "image"

The elements can be accessed as by each other list.

head(d[["text"]])
#R>   pid block text         font size colorspace     color    x0      y0    x1      y1
#R> 1   1     1    s Courier-Bold   12 DeviceGray [0, 0, 0]  77.2 751.272  84.4 763.272
#R> 2   1     1    p Courier-Bold   12 DeviceGray [0, 0, 0]  84.4 751.272  91.6 763.272
#R> 3   1     1    e Courier-Bold   12 DeviceGray [0, 0, 0]  91.6 751.272  98.8 763.272
#R> 4   1     1    e Courier-Bold   12 DeviceGray [0, 0, 0]  98.8 751.272 106.0 763.272
#R> 5   1     1    d Courier-Bold   12 DeviceGray [0, 0, 0] 106.0 751.272 113.2 763.272
#R> 6   1    NA                     NA                         NA      NA    NA      NA

The R package pdfminer only returns raw data extracted from the PDF-file. To refine this raw data into a format usable for data analysis the pdfmole can be used.

Details on the data exchange

The data exchange between Python and R can be executed by one of the methods "csv", "sqlite" or "PythonInR". The methods "csv" and "sqlite" call Python via the system2 command and the data is written out to temporary files. The Python version called by system2 can be changed by changing the pyexe argument. For example if a specific conda environment (in this example the pdf environment) should be used. Obtain the path to the Python executable

import sys
sys.executable
#Py> '/home/f/anaconda3/envs/pdf/bin/python'

and specify it via the pyexe argument.

pyexe <- '/home/f/anaconda3/envs/pdf/bin/python'
d <- read.pdf(file, method = "sqlite", pyexe=pyexe)
Metadata

Version

1.0

License

Unknown

Platforms (75)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows