pdfsplitter

A simple way to extract and parse images for machine learning workflows.

%load_ext autoreload
%autoreload 2

Our base functionality is fairly simple. It must be able to the following:

open a PDF file
iterate through the pages of the file
for each page, save that page with a counter str on the end as an image file

`pdf_to_img`[source]

pdf_to_img(pdf_path:PathLike[Any], destination_path:PathLike[Any], img_type:str, export_quality_factor=2.0)

Converts a PDF file into a series of image files.

Each image file is labelled with its page number

`extract_images_from_pdfs`[source]

extract_images_from_pdfs(source_folder:PathLike[Any], destination_folder:PathLike[Any], img_type:str, export_quality_factor=2.0)

Converts all PDF files inside a particular source folder into individual image files.

Each PDF file exports a single image for each page. You can specify the type of image you want. See https://pymupdf.readthedocs.io/en/latest/faq.html#how-to-convert-images for a full list of support export options.

`say_name`[source]

say_name(name='alex')

source = Path("./tryout/")
destination = Path("./tryout/processed")
extract_images_from_pdfs(source, destination, "png")

# assert say_hello("Jeremy")=="Hello Jeremy!"