Associate text to images from PDF

mlauber71 · March 19, 2023, 6:35pm

@andrejz you might want to add more details and offer a real sample of your data. I tried a few things and there are some results but it will very much depend on the real structure of your data / PDF how this will handle the file.

This is the call where the Jupyter Notebook assumes you have the PDF in the same folder:

pdf_path = 'd93fbf13f4d429c8dcb75b37d7a87d098d2955d9.pdf'
extract_images_and_descriptions(pdf_path)

This is the code that worked on my Mac - to some extent

def extract_images_and_descriptions(pdf_path):
    # Set up Tesseract configuration
    pytesseract.pytesseract.tesseract_cmd = 'tesseract'  # Or your Tesseract executable path

    # Create a new Excel workbook and add a worksheet
    wb = Workbook()
    ws = wb.active
    ws.append(['PDF Name', 'Page', 'Image Name', 'Description'])

    # Extract the PDF file name without extension
    pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]

    # Convert the PDF to images (one per page)
    images = convert_from_path(pdf_path)

    # Iterate over the images and extract the descriptions
    for i, img in enumerate(images, start=1):
        # Save the image as PNG
        img_name = f'{pdf_name}_image_{i}.png'
        img.save(img_name, 'PNG')

        # Extract text from the image
        text = pytesseract.image_to_string(img)

        # Extract the description (assuming the text below the image is the description)
        description = re.sub(r'\n+', '\n', text.strip()).split('\n')[-1]

        # Write the extracted data to the Excel sheet
        ws.append([pdf_name, i, img_name, description])

    # Save the Excel file with the PDF name
    wb.save(f'{pdf_name}_image_descriptions.xlsx')

These are the imports and the things you might have to install

import os
import re
from pdf2image import convert_from_path
import PyPDF2
import pytesseract
from openpyxl import Workbook

And since this is (still) an KNIME forum I have put all this in a KNIME workflow. This can then be used to build a loop or something.