PDF - Extract Values (GNRE)

Hi @lelloba
You are my helper regex Guy :slight_smile:

Could you help me to extract the values from the PDFs (files) from this folder?
The values will change.
I need the pattern.
All PDFs have the same layout.

I can’t get the right values.
Reading_GNRE_Workflow.knwf (3.1 MB)

@Felipereis50 you can take a look at this article and the example to define areas of a pdf to extract. To get the definitions you could employ a LLM.

3 Likes

Here is one example how to extract information from a bank statement which were less suffered with a local LLM and GPT4All:

A basic python version might be faster.

2 Likes

Hi @mlauber71

You know you’re at a hard level, right? :grin:

But I’ll take on the challenge…

I’m following the steps of the workflow: “GPT4All - Extract Information from PDF Bank Statements into JSON – mlauber71”

Well, help me if you can:

Step 1) I accessed the website GPT4All
GPT4All

Step 2) I installed the application

Step 3) I’m stuck at the step for the NODE: “prepare GPT4all model path”
image

Where is the .bin path?
image

In my PC, I installed the app in this path
C:\Users\felipesr\gpt4all\bin

But, I don’t know if I need to download an external file in this location from the app and then adjust the path.

Plus: Do I need to pay anything to use ChatGPT4All in this workflow?

@Felipereis50 you can read about the setup in this article

1 Like

Hi @Felipereis50 .
I suppose you need to download the model first.



The default is here

You don’t need to pay anything.
Br

1 Like

Hi

@hmfa
I installed this one.

But I’m not finding the file
Does it have to be in this folder?

Hi @Felipereis50.
It stays on the folder you configured

Br

Thank you. It worked
@hmfa

@mlauber71
I ran some tests and couldn’t get it to work, has I expected.
I even tried using ChatGPT but didn’t get a good result.

Here’s what I’m thinking:

When I wrote the prompt, I did it by looking at the PDF.
Let’s analyze the screenshot of the PDF.


You can see that it has a clear visual structure:
There’s 1 header and 1 value.

When I wrote the prompt, I said something like this:
Extract the values from the PDF by considering the header I’ll write in quotation marks and the value right below it in square brackets.
Example: "UF Favorecida" [AL]

(I passed all the fields I needed.)

However, looking at Knime, and I imagine ChatGPT works the same way, the PDF data is converted into text—and here’s where the value order gets mixed up.

Take a look at the screenshot of Knime’s output.


You can see that the values for each field are not displayed sequentially, likely because the PDF Parser or Tika Parser extracts text by line order. As a result, values from the second line end up distant from their respective headers.

This way, I can’t get a satisfactory result.

My expectation was a tabular output:

Column1 Column2
valuespdf1 valuespdf1
valuespdf2 valuespdf2
… …

I’ve seen a feature in Power Automate (Microsoft) where you map data based on the PDF layout. You click on the image and save the values into a variable.

My question is:

  1. Can I continue using Knime for this?
  2. Do I need to work on the prompt with even more detailed information?

@Felipereis50 I tried a few things. The LLMs were not very satisfactory. I think if you cannot come up with a regex you best chance might be to find the right positions on the sheets and try one of the Phyton packages. I tried that also with LLM but the results were not very encouraging.

In think Microsoft has some professional (paid) services to extract such data.

1 Like

Thank you for the feedback. @mlauber71

Regex is definitely the best solution.
I would need to identify the “words” that come before each value I need.
For example: I need the value AL, which comes right after the word GNRE. So, I would need to identify GNRE and fetch the next two characters, and so on.

Anyway, I’m not an expert in Regex. I find it quite complex.
I bought a course on Udemy today :grin: Let’s see if I can learn something.

And about Microsoft:
I don’t remember very well, but I think I could use POWER AUTOMATE DESKTOP.
But I also need to learn it to do a LOOP through each file.

There’s so much to learn. It drives you crazy :scream:

Thank you very much.
The important thing is that I learned how to configure GPT4all.

I’ll mark this topic as solved.

1 Like