@andrejz you might want to add more details and offer a real sample of your data. I tried a few things and there are some results but it will very much depend on the real structure of your data / PDF how this will handle the file.
This is the call where the Jupyter Notebook assumes you have the PDF in the same folder:
pdf_path = 'd93fbf13f4d429c8dcb75b37d7a87d098d2955d9.pdf'
extract_images_and_descriptions(pdf_path)
This is the code that worked on my Mac - to some extent
def extract_images_and_descriptions(pdf_path):
# Set up Tesseract configuration
pytesseract.pytesseract.tesseract_cmd = 'tesseract' # Or your Tesseract executable path
# Create a new Excel workbook and add a worksheet
wb = Workbook()
ws = wb.active
ws.append(['PDF Name', 'Page', 'Image Name', 'Description'])
# Extract the PDF file name without extension
pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
# Convert the PDF to images (one per page)
images = convert_from_path(pdf_path)
# Iterate over the images and extract the descriptions
for i, img in enumerate(images, start=1):
# Save the image as PNG
img_name = f'{pdf_name}_image_{i}.png'
img.save(img_name, 'PNG')
# Extract text from the image
text = pytesseract.image_to_string(img)
# Extract the description (assuming the text below the image is the description)
description = re.sub(r'\n+', '\n', text.strip()).split('\n')[-1]
# Write the extracted data to the Excel sheet
ws.append([pdf_name, i, img_name, description])
# Save the Excel file with the PDF name
wb.save(f'{pdf_name}_image_descriptions.xlsx')
These are the imports and the things you might have to install
import os
import re
from pdf2image import convert_from_path
import PyPDF2
import pytesseract
from openpyxl import Workbook
And since this is (still) an KNIME forum I have put all this in a KNIME workflow. This can then be used to build a loop or something.