Extract Data from Invoices to XML or CSV?

mlauber71 · July 11, 2024, 3:34am

@jannikw99 to run local llm/ollama, python and knime 8 GB is not much. I would recommend 32 GB. The Streamlit App might use less.

Other option would be to use ChatGPT

Or you could try and go for a smaller model (maybe tinyllama) but I don’t know if this will be able to do the extraction.

mlauber71 · July 13, 2024, 8:07am

@jannikw99 we now also have a workflow that would use only KNIME nodes in 5.3 and GPT4All (3.0), Llama3 and the new expression node for Regex (so no Palladian dependencies).

No special encoding here just using the text from the parser.

jannikw99 · July 13, 2024, 1:27pm

Hi @mlauber71

Looks neat, thank you. I’ve upgraded my RAM and now it seems to work, although my CPU is at 100% while running the local llms which seems a bit weird. Yet it seems to work (tried the old workflow).

Did you see a difference in output quality between GPT4ALL with just a text parser and Ollama with embeddings or performance?

I am still trying to figure out the best prompt. My current version seems okay ish but sometimes got some weirdish problem like turning GmbH into “Gmb H” or using two points for numbers (1.000.45) which breaks the string to json. Still experimenting with that and whatever causes it and other minor issues.

mlauber71 · July 13, 2024, 1:35pm

@jannikw99 I have not done that much testing. What I saw were differences between Llama3 and Mistral depending on which prompts you use but not in a systematic way.

From a few tries it seems that you will have to give some more detailed instructions how you want the formats to be handled (with the JSON results). But if the instructions are too detailed it seems the result will be worse than when you give the model some flexibility to fulfil the task. So detailed but not too detailed. Not sure if asking the model itself (with a number of sample documents) to do the prompting might help.

The blanks might result from the parsing and embedding process. In the GPT4All version you can see that it is just text that you have and since PDFs are let us say not optimal for storing systematic information sometimes you will have blanks.

You might have to do some further processing like removing blanks from numbers and then converting them into numerics - while dealing with commas and dots as separators.

From what I hear there are other commercial embedding solutions out there which are superior to these free tools like Ada by OpenAI or solutions by Microsoft - but then you are in a commercial sphere.

jannikw99 · July 15, 2024, 1:13am

Hi @mlauber71,

I’ve switched to using KNIME on my host PC with a powerful GPU, and it’s significantly faster now, which is great. However, I’m having trouble finding the correct prompt to retrieve all the data I need. I can share my current prompt in either German or English if you’d like to try it out. There are a lot of inconsistencies, such as it not finding data that it had found in previous attempts. I’m not sure why this is happening. Perhaps the prompt is too detailed, although I don’t think it should be.

Even tho i believe the first option might be the better one since the PDF Parser won’t pay attention to the PDF structure, I also tried your other workflow, but I encountered an error: “Could not parse all files properly” from the PDF Parser (see attached image). I’m not sure what’s causing this issue. The Temp folder is being created correctly, and the PDF is copied to the folder, but the PDF Parser seems to fail at parsing. Do you have any ideas on how to fix this?

system · October 13, 2024, 1:13am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.