How to manipulate and organize PDF files in KNIME

I have a PDF file that I upload to KNIME using this script

import pandas as pd
from PyPDF2 import PDFReader
import PyPDF2

Access file path from input data

pdf_file = “C:/Users/aldem/Desktop/New folder/OC08320.pdf”

with open(pdf_file, ‘rb’) as f:
pdf_reader = PdfReader(f)

text = “”
for page in pdf_reader.pages:
text += page.extract_text()

output_table = pd.DataFrame({“Text”: [text]})

But the PDF was left this way

imagen

I would like it to stay this way

As is the original PDF
is that possible?

@Aprins maybe you could provide us with a sample file. If you want to extract tables from a PDF I have this article with code and examples.

1 Like

Thank you very much for the reply

But where I am working I don’t have access to those Python libraries. :sleepy:

Here is some sample of the PDF File
OC083201.pdf (33.4 KB)

@Aprins the result would look like this. Maybe you can find a way to install some Python.

The Tika Parser also found a logo inside the PDF …

PDF - Python package Camelot to extract Text and Tables - KNIME Forum (79188).knwf (241.7 KB)

2 Likes

I have a question, to do all that, do I have to use all that big flow? And the other question, does that work for any PDF?

@Aprins the article should explain how this works. And this would work on all pdfs though they are notoriously complicated structures so it will depend on the specific format. But the Camelot Python package can detect tables and the KNIME setup can help processing them.