What is pdfsplitter?

There are lots of repeated tasks you have to perform when working with PDF files for a machine learning project. I found myself wanting a tool that could handle some of the more common parts of this. Not finding anything suitable, I built something for myself.

Features

downloading all the PDF files on a web page
extraction / exporting a single image file for each page of the PDF
statistics generation to get an overview of the total page count of the PDFs.

Install

pip install --upgrade pdfsplitter

How to use

The highest-level function for exporting image files from a series of images is extract_images_from_pdfs, which will take all the PDF files inside a source directory and extract the images to a destination directory. You have the added option of specifying which sort of image filetype you'd like for the exported images, as in this example:

source = Path("./tryout/")
destination = Path("./tryout/processed")

# download all the PDFs listed on a particular list of URLs
download_pdf_files(
    get_pdf_links("https://open.defense.gov/Transparency/FOIA.aspx"), "./tryout"
)

# extracts all the images from the downloaded PDFs and saves them to a directory
extract_images_from_pdfs(source, destination, "jpg")

display_stats(get_stats(source))

                                  Stats for your PDF Files                                   
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ PageCou… ┃ Filename                                      ┃ ocr_lay… ┃ pdf_fil… ┃ author   ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│       27 │ 2014_ACFO_Report_FINAL_REPORT.pdf             │ False    │ 236655   │ Stephan… │
│          │                                               │          │          │ Carr     │
│        3 │ 7-26-2013_Determination.pdf                   │ False    │ 214683   │          │
│        2 │ DA Determination-DCRIT Hawaii Water Wells.pdf │ False    │ 115574   │          │
│        3 │ 12-18-14_Determination.pdf                    │ False    │ 50925    │          │
│        4 │ 6-1-2012_Determination.pdf                    │ False    │ 463902   │          │
│        2 │ 8-19-2021_Determination.pdf                   │ False    │ 350438   │          │
│       15 │ 2012_ACFO_Report_FINAL_REPORT.pdf             │ False    │ 242305   │ CarrS    │
│        3 │ 2-12-2014_Determination.pdf                   │ False    │ 23823    │ timothy… │
│        2 │ DA%20Determination%20DoD%20Flights.pdf        │ False    │ 111521   │          │
│       22 │ 2013_ACFO_Report_FINAL_REPORT.pdf             │ False    │ 258462   │ CarrS    │
│        2 │ 2-15-2018_Determination.pdf                   │ False    │ 342195   │          │
│       49 │ DoDFY2020AnnualFOIA_Report.pdf                │ False    │ 1247446  │          │
│        3 │ 7-5-2019_Determination.pdf                    │ False    │ 204453   │          │
│       30 │ 2017_DoD_Chief_FOIA_Officer_Report.pdf        │ False    │ 4810077  │          │
│       28 │ 2021_DoD_Chief_FOIA_Officer_Report.pdf        │ False    │ 1131474  │          │
│       10 │ 2011_DoD_Chief_FOIA_OfficerReport.pdf         │ False    │ 113387   │ CarrS    │
│       27 │ 2018_DoD_Chief_FOIA_Officer_Report.pdf        │ False    │ 788227   │ brandoct │
│        2 │ 8-3-15_Determination.pdf                      │ False    │ 105563   │          │
│        3 │ 1-21-2016_Determination.pdf                   │ False    │ 122706   │          │
│        2 │ 12-6-2017_Determination.pdf                   │ False    │ 189563   │ deleonv  │
│        2 │ 12-18-2018_Determination.pdf                  │ False    │ 153675   │          │
│       30 │ 2016_ACFO_Report_FINAL_REPORT.pdf             │ False    │ 1108008  │          │
│        2 │ 11-29-2017_Determination.pdf                  │ False    │ 369290   │          │
│        2 │ DoD SAP IT DCRIT Determination.pdf            │ False    │ 127858   │          │
│        3 │ 10-19-2018_Determination.pdf                  │ False    │ 70088    │ JAMES    │
│          │                                               │          │          │ HOGAN    │
│       30 │ 2015_ACFO_Report_FINAL_REPORT.pdf             │ False    │ 287445   │ Stephan… │
│          │                                               │          │          │ Carr     │
│        3 │ 7-31-2020_Determination.pdf                   │ False    │ 88447    │ Dziecic… │
│          │                                               │          │          │ Gerald J │
│          │                                               │          │          │ Jr CIV   │
│          │                                               │          │          │ OSD OGC  │
│          │                                               │          │          │ (USA)    │
└──────────┴───────────────────────────────────────────────┴──────────┴──────────┴──────────┘

TOTAL PAGECOUNT: 311