Extract Data from Invoices to XML or CSV?

Hi Community,

I’m trying to understand if the basic version of KNIME as an open-source solution meets my requirements and is the right tool for me or not. We receive digital PDF security invoices and want to extract specific information from these invoices (account number, security number, price, price value…) and output them as CSV or XML files. Unfortunately, these are not proper tables but are more or less always presented as free text somewhere and somehow on the forms. Additionally, there are dozens of banks, each sending completely different invoices, which makes it difficult, in my opinion, to write individual regex patterns for hundreds of banks. A tool that can learn or guess parameters from new banks would be a significant relief. Can KNIME do this in any way, or am I completely off track here? Thanks!

Hello @jannikw99,

Welcome to KNIME Community.

To import a PDF document in KNIME we have Tika Parser and PDF Parser Node.
Check this out

As you get invoices from different banks, you will need to apply different nodes and regular expressions to match the desired fields.

If a single bank has a specific design of invoice you can create separate workflows depending on each bank and you can also create data apps and services as advance option.

Using KNIME you can leverage its machine learning capabilities. For example, you could train a model to classify invoices based on their layout, and then apply the appropriate extraction rules based on the predicted layout.

The basic version of KNIME is open-source and should meet your requirements for extracting data from PDF invoices.

Regards,
Yogesh

2 Likes

Hello Yogesh,

Thank you for your reply.

I can also use pdftotext or something similar with a Python node since my PDFs are digital and not scanned, or is that what Tika does?

The problem with regex is that while it’s useful, managing multiple layouts (approximately 200+ from 100 banks) will be challenging, especially if banks change their layouts. That’s why I feel like regex might not be the best solution, right?

I’ve also tried the invoice2data open-source GitHub project, which uses regex as well (though not in a visual editor like here), but I’m not sure if that’s the way to success.

Jannik

1 Like

Hello @jannikw99 ,

Thanks for the reply. May be you can share some sample invoices so that we can see what solution can be offered.
I haven’t worked much with PDF documents but maybe someone with more experience can solve your problem.

Thanks

@jannikw99 you could take a look at this discussion and try some examples mentioned there. One option could be to try to invoke LLMs with specific instruction to extract certain information. You will have to maybe feed a single PDF into a vector store and gibe strong instructions to only use a specific JSON structure to output the information and see which model will follow best (maybe mistral or llama3 in the instruct versions).

And indeed maybe a set of (anonymised) sample PDFs that would show the range of your challenge might help.

1 Like

Hello @mlauber71 and @yogesh_nawale,

thanks for the replies. I am not an expert on editing PDFs, but these two examples effectively illustrate the differences in layouts sometimes:

layout-2-a.pdf (64.7 KB)
layout-1-a.pdf (78.9 KB)

We need to extract a significant amount of information from these PDFs in XML or CSV format. Our preferred solution is a fully automated workflow, potentially utilizing 2 or 3 methods that cross-verify the results. If all methods produce the same output, it’s considered accurate and ready for use. If there are discrepancies, a warning should be issued indicating the need for manual review.

Hi,

After giving it some thought overnight, I believe using Llama-3 or a similar language model might be the best approach for this situation. What do you both think? (And anyone else is welcome to share their thoughts too.)

Do I need to convert the PDFs into a Vector Store for Llama to access the information, or is there another way to provide the data to the model? Also, is there a way to verify the accuracy of Llama’s output?

Or is there a better solution?

@jannikw99 you might have to do it by pdf. You can try and bundle together the pdfs from one type and give a unique ID to everyone and append the vector store. Then you instruct llama3 to use other pdfs as examples and specially extract the one with your ID into a strict JSON file.

The problem with your examples is that they will maybe not provide genuine examples of addresses - that might be different with original data.

You will have to experiment with instructions/prompts to make sure only a strict JSON comes back.

To compare results you could try and so this with llama and mistral and see if they might differ.

From our two examples you might have to do this by type since they differ significantly and often you do not have a clear indication what is what
And you also will have to come up with a definite list of features you want.

In addition you could experiment with different vector store types like FAISS or Chroma.

Setting this up will involve some planning and consideration.

Any chance of more examples of the same pdf type?

Hello @mlauber71

I’ve uploaded ten improved anonymized invoice PDFs to my Nextcloud. Here is the link: Refugium Cloud

We usually require at least the following fields. Not all fields are always present:

  • Firma (Company)
  • Depot (Depot)
  • Bank (Bank)
  • Transaktionsart (Transaction Type - Purchase / Sale)
  • ISIN / Kenn-Nr. der Aktie (ISIN / Stock ID Number)
  • WKN (WKN)
  • Wertstellung (Valuta) (Value Date)
  • Belegdatum (Document Date)
  • WSL (Währung) (Currency)
  • FW-Kurs (Devisenkurs) (Foreign Exchange Rate)
  • Menge (Quantity)
  • Kurs (Price)
  • WSL Kurs (Währung) (Currency of Price)
  • Kurswert (Price Value)
  • WSL Kurswert (Währung) (Currency of Price Value)
  • […]

There will be more fields in the end, but for demonstration purposes, this should be enough.

I’ve read the Medium Article (and other related to those).

Do you need any additional information?
Thank you!

Jannik

@jannikw99 the prompt with “mxbai-embed-large” embedding, Chroma Vector Store and “llama3:instruct” LLM would give you something like this:

The next step would be to bring this into a loop that would iterate over all PDFs and then bring the results back to KNIME and extract the meta-information and the JSON files from the data. I will see when I will continue to explore this.

conversation_history.json.zip (1.4 KB)

Extract exactly these information from this document into a JSON file.

Do not add any information that is not there. Do not change the structure of the JSON file. Do not alter the names of the fields or the order!

If information is missing just leave it empty! The JSON fields:

Firma (Company) Depot (Depot) Bank (Bank) Transaktionsart (Transaction Type - Purchase / Sale) ISIN / Kenn-Nr. der Aktie (ISIN / Stock ID Number) WKN (WKN) Wertstellung (Valuta) (Value Date) Belegdatum (Document Date) WSL (Währung) (Currency) FW-Kurs (Devisenkurs) (Foreign Exchange Rate) Menge (Quantity) Kurs (Price) WSL Kurs (Währung) (Currency of Price) Kurswert (Price Value) WSL Kurswert (Währung) (Currency of Price Value)

2 Likes

Hello @mlauber71,

This looks quite promising. Just to clarify, this is currently implemented using your Streamlit app and not within KNIME with these nodes available from the community?

The prompt appears useful but could probably be enhanced further to ensure it finds all the correct values (for instance, I noticed that “Kurs” is empty when it should be filled, same as “Menge”). Perhaps using ChatGPT as a prompt engineer to refine the prompt could be beneficial and interesting to experiment with later on.

Does the Streamlit app offer a way to automate this process without manually uploading the document files to the app (for example loading the files from a folder or similar) and subsequently processing the results with KNIME as you mentioned?

I plan to set up a KNIME instance over the weekend. Based on my experience with Ollama, the best operating systems seem to be macOS or Ubuntu, correct?

Jannik

@jannikw99 I set up a workflow to extract information from your PDFs with two different LLM models (Llama3, Mistral). From what I saw it will make sense to work on the prompt for example adding a passage about making sure the JSON structure is valid and avoid nested data.

Your PDFs are from different banks and contain different sets of data like buying different products (stocks and bonds). I focussed on the first bank and stocks and I assume it will take some work on the prompt(s) to be ready for several different versions. It might be necessary to have different prompts for different banks. I think it is also possible to employ LLM models to construct the prompt to use. You could set up a two or three step approach where you set up the fields you want and then send some documents per bank and ask the model to provide a prompt for this bank. An even more advanced set would be to employ agent systems like crewai.com to ‘battle out’ the best prompt - though it might be difficult for the system to judge what is a goof result.


I might write a whole article about this but let us dive into a few steps of the workflow.

Scan the PDFs in a folder and provide an instruction and a prompt and also set which model you want to use. I have two lines in the workflow. One using Chroma and on FAISS vector store. I have used Llama3 and Mistral - both have interesting results.

Each individual PDF (or other document) will run thru the Python node and will be embedded (with mxbai-embed-large) and then will be put to the LLM model in one step.

KNIME will later collect the results which are in JSON structures and will be extracted into a simple table. It seems both LLMs are able to provide a quite good result. If you are missing fields you might have to provide a more detailed prompt:

Mistral seems to be good at providing just the JSON although sometimes some fields might be slightly different so you can use KNIME to make sure the structure is the same (and deal with missing fields later).

In the end the results will be in a table like this:

With regard to process you could now compare the results of the two models - you might have to do some tweaking of formats and column names - and when both models agree you can use the results for example.

You could also think about a four steps approach where you now use the compared results and provide that along with (again) the vector store and ask the model to try and find missing information.

I am not sure if an approach might help to provide a sample of other PDFs as a pattern. I tried to experiment with that in the Streamlit app but was not satisfied with the result. It might be that a one shot approach for extraction and relying on what the LLM ‘knows’ internally might be OK.

One idea could be to include some general description of what the files might contain in oder to smooth the process (“you are about to see bank statements in german with stock purchases and sales …”). This might be slightly hindered by the anonymised nature of your public bank files since they might not contain the real structure of addresses, names and so on.

At this point (again) a reminder that all this is taking place on your local machine. I was quite satisfied with the speed of my Apple Silicon M1.

This example will run on KNIME 4.7 for the sake of compatibility within my organisation. I think I might provide a similar example using more recent KNIME nodes, GPT4All and KNIME 5.3 - so to have this in a more knime-like way

(maybe best to download the whole workflow group)

1 Like

Hi @mlauber71,

Thank you very much for your generous help. I really appreciate it! I’ll try to get your workflow running within the next week and experiment with some examples. Since I’ve never used KNIME before, this might be a typical question: does your example, which runs on KNIME 4.7, also work on newer KNIME versions? Or should I consider installing KNIME 4.7 instead? The newer KNIME nodes seem quite promising for streamlining the overall structure, looking forward for your 5.X example.

In ChatGPT, there’s an effective way to improve results by using it as a Prompt Engineer. I am sure this should work on llama quite similar. You can see a German and English version of the starting prompt on this pastebin: ChatGPT Prompt Engineer - Pastebin.com. I’ll gladly share the prompt I end up using once I’ve experimented with it extensively.

Regarding your point about needing different prompts for different banks, you might be right. It could heavily depend on the specific prompt we use. Ideally, one prompt would suit all, but I understand there might be differences based on layouts or whether the content is about stocks or bonds. Would you recommend asking Llama to determine if the PDF is about stocks or bonds and then running Prompt A for stocks and Prompt B for bonds?

It could be beneficial to compare both tables at the end to get info if some values differ significantly. Using better prompts, such as providing the correct JSON column names directly within the prompt, might help keep the values consistent. Would it be a good idea to move the local files based on whether both models gave the same results or differed? For example, if both Llama and Mistral provide the same result for bank1.pdf, move the PDF to Folder X; if they provide different results, move the PDF to Folder Z for manual review since we usually only want the results if we are quite certain they are correct - same with extracting these values to further use them in our process.

I found a thread about system requirements from 2022, but it didn’t consider using LLM locally. Are these CPU-intensive or GPU-intensive? I’d like to try it on an AMD Ryzen 9 3900X with a low-end graphics card (just to get a hang of KNIME and see if it works) and maybe later upgrade to an Apple M2 or similar if needed.

Thank you!

1 Like

I have a collection of how to start learning KNIME. You will benefit not only with regard to LLMs:

All older KNIME workflow will also work on newer versions of KNIME. Version 5.3 is due around 10th of July (2024). I did it in 4.7 for these reasons:

  • some companies are still on older versions due to internal processes. So they can also benefit from using LLMs with KNIME
  • and then I sometimes found that although the (new) KNIME Gen AI nodes are great sometimes they struggle to keep up with the pace of developments, namely regarding GPT4All (currently I am struggling to combine PDF Parser and the Chroma Vector Store node with your example …)

Then: letting a LLM write the prompt is maybe a good idea. The interesting thing will be to bring this in a system where you can process several files in a row. Also I have found that adding some instructions can help (if it does not bring too much overhead). So for example I instructed the LLM to check the JSON structure and not to use nested fields. Too much instructions then might confuse the model and you might have to keep track on which changes did help and which did not. In the example the question is always stored together with the response in a KNIME table.

This should also be possible if you force the LLM to answer with a yes or no (in a JSON again). It will add one more layer to the construct. Maybe keep it simple at first. Also the model might have enough knowledge stored to make the distinction itself. Also since your files seem to be from a German background you could experiment with a dedicated german LLM. Ollama also seems to have such models but I have not yet tried them.

Hyper scalers (think Azure) will also provide advanced methods to process all sorts of data (PDF, images, voice) and bring them into vector stores. But of course they will charge you for that and then your data is on someone else’s computer…

Only way to know is give it a try. Ollama and GPT4All seem to support all kinds of GPU but you might have to try a few things. Support for Apple Silicon seems to be quite decent (50+ tokens / second on my M1). This field is always evolving. Some people use external GPUs. It will depend on your business case. Running a few hundred cases overnight might be OK. If you want thousands you might have to set up something else. Also think about error handling and restore loops if something goes wrong - KNIME can help you with that also.

Thank you. I’ll check it out within the next week if my workload allows.

In other words, your solution for KNIME 4.7 might be more stable and less prone to breaking due to the rapid development pace? It seems better to have a stable version than one that occasionally breaks due to changes in GPT4ALL or other updates, right?

That’s my hope too. Having one prompt that works for all would be ideal for simplicity, but it will depend heavily on the layouts I encounter. Since not every bank provides digital invoices as of 2024, this solution won’t work for all invoices, but it should handle the digital ones, significantly reducing the workload.

For now, but I’m not sure if that will always be the case. I’m not very familiar with the stock market, but customers could receive invoices in English from banks outside of Germany. Therefore, having an LLM that handles both, like Llama, would be ideal. AI is advancing quickly, and newer models like this one seem quite promising for the future: Meta Unveils Chameleon: Early Fusion AI Outperforms GPT-4 in Text and Image Tasks.

You’re right. I’ll need to review that article to build a more robust system that can handle errors effectively.

I’ll try it out and post my results here. Can I see Token Speed within KNIME? I’m fairly certain we’ll upgrade to an M2 in the future and integrate it into our data center.

1 Like

I hope so. Since it uses more standard nodes (one example just REST-API without fancy Python). Though you might want to run it under the new 5.2 or 5.3. The two dependencies would then be either GPT4All or (mostly) Ollama.

In GPT4All you have a token speed in the down right corner. I use that as an indicator. It also matches my impression when running the thing in KNIME which itself does not have such an indicator as far as I know.

In general: there are fancy developments and hypes all around. The challenge is to bring this into an (initial) system that produces results. Maybe reduce manual efforts by some 20% or so and then take it from there. Projects that I see failing think they can just throw their data infrastructure they have neglected for years at a LLM and be done with it.

And always think about: could this have been a RegEx or a parser. Or maybe force someone in the process chain to provide the data as a regular dataset (maybe CSV). Parsing and extracting data from Pdfs is … well … maybe not the optimal thing to do …

I’ll probably wait for the release of 5.3, which is just days away, and then install it to try your provided workflow. Although using the newer nodes you mentioned seems attractive, a more robust solution might be preferable. However, if one doesn’t update GPT4ALL or similar on a production system but instead uses a second development system to test new things and updates, working with the newer nodes should be quite feasible.

Understandable. We have had a working and established infrastructure for years and are now looking to reduce the manual workload of entering these values into our program, which is quite time-consuming for us and other users.

I’ve tried using this and other tools set up in Ubuntu before asking here if KNIME could offer a better solution. Writing RegEx worked quite well for one or two banks and outputting the results, but as the number of different banks increases, a more flexible solution seemed necessary.

Hi @mlauber71,

I’ve tried importing your workflow into my local KNIME, but it seems KNIME can’t find the “Regex Extractor.” Could you guide me on where to find the correct node for the current KNIME version, or if there’s a need to replace it?

Thank you!

@jannikw99 you need to install Palladian in version 5.2 separately. In think they will bring a version 5.3 in the future.

Best to load the whole workflow group.

Edit: I will try and replace the node with the new Expression node from KNIME

1 Like

Thank you @mlauber71
Did you ever encounter this problem?

image

I thought it might be about file size but it does seem to be about RAM i guess, or? My virtual machine has 8GB of RAM, hoped this is enough for testing.