A simple way to extract and parse images for machine learning workflows.
%load_ext autoreload
%autoreload 2

Our base functionality is fairly simple. It must be able to the following:

  • open a PDF file
  • iterate through the pages of the file
  • for each page, save that page with a counter str on the end as an image file

pdf_to_img[source]

pdf_to_img(pdf_path:PathLike[Any], destination_path:PathLike[Any], img_type:str, export_quality_factor=2.0)

Converts a PDF file into a series of image files.

Each image file is labelled with its page number

extract_images_from_pdfs[source]

extract_images_from_pdfs(source_folder:PathLike[Any], destination_folder:PathLike[Any], img_type:str, export_quality_factor=2.0)

Converts all PDF files inside a particular source folder into individual image files.

Each PDF file exports a single image for each page. You can specify the type of image you want. See https://pymupdf.readthedocs.io/en/latest/faq.html#how-to-convert-images for a full list of support export options.

say_name[source]

say_name(name='alex')

source = Path("./tryout/")
destination = Path("./tryout/processed")
extract_images_from_pdfs(source, destination, "png")
# assert say_hello("Jeremy")=="Hello Jeremy!"