Could you help me to extract the values from the PDFs (files) from this folder?
The values will change.
I need the pattern. All PDFs have the same layout.
You can see that it has a clear visual structure:
There’s 1 header and 1 value.
When I wrote the prompt, I said something like this: Extract the values from the PDF by considering the header I’ll write in quotation marks and the value right below it in square brackets.
Example: "UF Favorecida" [AL]
(I passed all the fields I needed.)
However, looking at Knime, and I imagine ChatGPT works the same way, the PDF data is converted into text—and here’s where the value order gets mixed up.
Take a look at the screenshot of Knime’s output.
You can see that the values for each field are not displayed sequentially, likely because the PDF Parser or Tika Parser extracts text by line order. As a result, values from the second line end up distant from their respective headers.
This way, I can’t get a satisfactory result.
My expectation was a tabular output:
Column1
Column2
valuespdf1
valuespdf1
valuespdf2
valuespdf2
…
…
I’ve seen a feature in Power Automate (Microsoft) where you map data based on the PDF layout. You click on the image and save the values into a variable.
My question is:
Can I continue using Knime for this?
Do I need to work on the prompt with even more detailed information?
@Felipereis50 I tried a few things. The LLMs were not very satisfactory. I think if you cannot come up with a regex you best chance might be to find the right positions on the sheets and try one of the Phyton packages. I tried that also with LLM but the results were not very encouraging.
In think Microsoft has some professional (paid) services to extract such data.
Regex is definitely the best solution.
I would need to identify the “words” that come before each value I need.
For example: I need the value AL, which comes right after the word GNRE. So, I would need to identify GNRE and fetch the next two characters, and so on.
Anyway, I’m not an expert in Regex. I find it quite complex.
I bought a course on Udemy today Let’s see if I can learn something.
And about Microsoft:
I don’t remember very well, but I think I could use POWER AUTOMATE DESKTOP.
But I also need to learn it to do a LOOP through each file.
There’s so much to learn. It drives you crazy
Thank you very much.
The important thing is that I learned how to configure GPT4all.