A simple way to extract and parse images for machine learning workflows.
What is pdfsplitter?
There are lots of repeated tasks you have to perform when working with PDF files for a machine learning project. I found myself wanting a tool that could handle some of the more common parts of this. Not finding anything suitable, I built something for myself.
Features
- downloading all the PDF files on a web page
- extraction / exporting a single image file for each page of the PDF
- statistics generation to get an overview of the total page count of the PDFs.
pip install --upgrade pdfsplitter
The highest-level function for exporting image files from a series of images is extract_images_from_pdfs
, which will take all the PDF files inside a source directory and extract the images to a destination directory. You have the added option of specifying which sort of image filetype you'd like for the exported images, as in this example:
source = Path("./tryout/")
destination = Path("./tryout/processed")
# download all the PDFs listed on a particular list of URLs
download_pdf_files(
get_pdf_links("https://open.defense.gov/Transparency/FOIA.aspx"), "./tryout"
)
# extracts all the images from the downloaded PDFs and saves them to a directory
extract_images_from_pdfs(source, destination, "jpg")
display_stats(get_stats(source))