Could you help me to extract the values from the PDFs (files) from this folder?
The values will change.
I need the pattern. All PDFs have the same layout.
You can see that it has a clear visual structure:
There’s 1 header and 1 value.
When I wrote the prompt, I said something like this: Extract the values from the PDF by considering the header I’ll write in quotation marks and the value right below it in square brackets.
Example: "UF Favorecida" [AL]
(I passed all the fields I needed.)
However, looking at Knime, and I imagine ChatGPT works the same way, the PDF data is converted into text—and here’s where the value order gets mixed up.
Take a look at the screenshot of Knime’s output.
You can see that the values for each field are not displayed sequentially, likely because the PDF Parser or Tika Parser extracts text by line order. As a result, values from the second line end up distant from their respective headers.
This way, I can’t get a satisfactory result.
My expectation was a tabular output:
Column1
Column2
valuespdf1
valuespdf1
valuespdf2
valuespdf2
…
…
I’ve seen a feature in Power Automate (Microsoft) where you map data based on the PDF layout. You click on the image and save the values into a variable.
My question is:
Can I continue using Knime for this?
Do I need to work on the prompt with even more detailed information?
@Felipereis50 I tried a few things. The LLMs were not very satisfactory. I think if you cannot come up with a regex you best chance might be to find the right positions on the sheets and try one of the Phyton packages. I tried that also with LLM but the results were not very encouraging.
In think Microsoft has some professional (paid) services to extract such data.
Regex is definitely the best solution.
I would need to identify the “words” that come before each value I need.
For example: I need the value AL, which comes right after the word GNRE. So, I would need to identify GNRE and fetch the next two characters, and so on.
Anyway, I’m not an expert in Regex. I find it quite complex.
I bought a course on Udemy today Let’s see if I can learn something.
And about Microsoft:
I don’t remember very well, but I think I could use POWER AUTOMATE DESKTOP.
But I also need to learn it to do a LOOP through each file.
There’s so much to learn. It drives you crazy
Thank you very much.
The important thing is that I learned how to configure GPT4all.
If there is no restriction on using any other software other than KNIME, then I’d suggest using Tabula for this
I’ve used it recently to extract data from the National Achievement Survey reports (https://nas.gov.in/download-data-district-wise-2017). Each of the reports have the same structure so using tabula would be great for these kind of tasks
Also, could you share the KNIME workflow you’d shared earlier in this thread as a screenshot?
I used “Tabula,” and it is a good option; however, it does not perform a loop for each file.
You can upload multiple files, but you need to click one by one to download the data.
@Felipereis50 you could try and take a look at these examples also using tablua and multiple PDFs. Maybe you can share your approach that worked so it is possible to adapt it.