Workflow Help needed!

NabilEnn · April 21, 2024, 4:14am

Hello Everyone, I’m new to knime and have been struggling with extracting certain data from a Bach of pdf files that then need to be exported to excel. The data from each file extracted has to then be matched with the file it came from when exporting to excel.
So far I’ve created a workflow using tika parser, string manipulator, row filter and excel writer but it’s not giving me clean results. I have attached a picture of one of the pages in the pdf file. I want to extract full name I marked in red and address. All the pages and files are the same format.

[Image removed per user request - ScottF]

mlauber71 · April 21, 2024, 8:59am

@NabilEnn you could try and take a look at the Python package Camelot to extract information from pdf tables and documents:

NabilEnn · April 21, 2024, 2:02pm

Hello thank you for the reply, so this invoice can not be done without python? All I need is the information I marked out in red. It’s in the same location for every page and file.

mlauber71 · April 21, 2024, 2:11pm

@NabilEnn there can be several ways. One is using R

And there was a challenge to extract information.

Is it possible to provide some examples maybe without sensitive data?

NabilEnn · April 21, 2024, 2:23pm

Okay I see, I don’t know ho to use the R node. Well yes this is just an example. I need to be able extract names and addresses from any location of the page. Sometimes there’s a “name:” in front of the name but sometimes there isn’t. I need to figure out which node will show me the rows of a page and then I can possibly choose which row of text in the page I can extract from? In this case I’d want whatever rows have the name and address in this invoice.

NabilEnn · April 21, 2024, 3:12pm

Note: when I choose these type of files through tika parser. It puts all the data all in one column, the content column. Could I use column splitter and try and use a delimiter to split the lines of text. Any insight on what delimiter I could use? And then what other node could I use to then choose what row I want to extract

mlauber71 · April 21, 2024, 3:59pm

@NabilEnn you will most likely need some sort of structure to identify the content you want. This might be a position or a key word. Another thing you could try is feed the extracted text (or text parts) into a LLM model and instruct it to answer with a JSON file that would contain the content.

Question is if you could provide a sample of PDFs that would represent an overview of your challenges without spelling any secrets.

NabilEnn · April 21, 2024, 4:26pm

I can’t seem to find anything downloadable online I’d have to create a pdf using docs or something. I’m also running into issues where using string manipulation node and entering the correct regex or keywords for the column I want to extract.

mlauber71 · April 24, 2024, 9:20pm

@NabilEnn I have collected some examples that I have done in the past extracting data from a PDF file. Here is a sample workflow and an article detailing that.

It will automatically extract all table that are in list of PDFs and you also can define areas from which to extract text - also on multiple pages. Maybe you can give this approach a try:

mwiegand · April 25, 2024, 9:36am

Hi @mlauber71 and @NabilEnn,

extracting data from PDF files will almost likely never result in some structured data. Even well organized PDFs containing well structured tables like this:

Whilst PDF are essentially some sort of XML, each “PDF-Printer” generates slightly different results. Adobe still holds some sort of monopoly on it. Hence, assuming PDFs being parsed perfectly when the display is neatly designed will almost always fail miserably.

Hence, I developed some principles to get around most of the limitations which are:

Do not parse line breaks resulting in multiple cells. Instead read everything with column and row separators
Use String Manipulation to inject unique markers like “$PAGE_BREAK$” i.e. via RegEx Replace
Factor in that the string replacer and manipulation node struggles to cope with line breaks within cells
Sanitize your data by removing unwanted strings and (temporarily) replacing line breaks “\r\n” with unique markers like “$LINE_BREAK$”
Re-establish the structure by splitting columns and cells based on the unique markers injected before

Best
Mike

system · July 24, 2024, 9:37am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.