How to manipulate and organize PDF files in KNIME

Aprins · May 15, 2024, 4:20pm

I have a PDF file that I upload to KNIME using this script

import pandas as pd
from PyPDF2 import PDFReader
import PyPDF2

Access file path from input data

pdf_file = “C:/Users/aldem/Desktop/New folder/OC08320.pdf”

with open(pdf_file, ‘rb’) as f:
pdf_reader = PdfReader(f)

text = “”
for page in pdf_reader.pages:
text += page.extract_text()

output_table = pd.DataFrame({“Text”: [text]})

But the PDF was left this way

imagen

I would like it to stay this way

As is the original PDF
is that possible?

mlauber71 · May 15, 2024, 5:09pm

@Aprins maybe you could provide us with a sample file. If you want to extract tables from a PDF I have this article with code and examples.

Aprins · May 15, 2024, 7:11pm

Thank you very much for the reply

But where I am working I don’t have access to those Python libraries.

Here is some sample of the PDF File
OC083201.pdf (33.4 KB)

mlauber71 · May 15, 2024, 7:47pm

@Aprins the result would look like this. Maybe you can find a way to install some Python.

The Tika Parser also found a logo inside the PDF …

PDF - Python package Camelot to extract Text and Tables - KNIME Forum (79188).knwf (241.7 KB)

Aprins · May 15, 2024, 8:05pm

I have a question, to do all that, do I have to use all that big flow? And the other question, does that work for any PDF?

mlauber71 · May 15, 2024, 10:10pm

@Aprins the article should explain how this works. And this would work on all pdfs though they are notoriously complicated structures so it will depend on the specific format. But the Camelot Python package can detect tables and the KNIME setup can help processing them.

system · August 13, 2024, 10:11pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.