Local LLM Vector Store from PDFs with KNIME and GPT4All

Dear @mlauber71,

After reading your article on “Creating a Local LLM Vector Store from PDFs with KNIME and GPT4All,” I am fascinated and curious to know if it’s feasible to apply the same approach to analyze data from, let’s say, 100 different PDF files with varying page counts to derive quantitative insights and visualize the results.
Or is it possible to ask a same question from 100 pdf files, then quantitative and visualize the answers simultaneously

Could you suggest any modifications to the workflow outlined in your article to accommodate this scenario?

Thanks in advance!

@reza_nadaf I think this is possible. The path could involve this:

If you want to use this locally a problem might have to be fixed with the GPT4All nodes

https://forum.knime.com/search?q=AP-22323

You will have to prepare and process the text from the pdfs. If you need pictures or tables there are packages that can help with that but that would involve some work and Python coding.

You will then have to find the right LLM and find prompts that would be able to extract results while providing content from a RAG store. You will have to calibrate how much a local model could handle in a reasonable amount of time. Check the settings of the nodes. There are some pre-defined limits, temperature, number of lines, basic prompt patterns etc.

One idea could be to let a LLM do the prompting. So ask the model what would be a good prompt; or use two step approaches like feeding a condensed and cleaned version to another prompt.

If you want results as re-usable results you might have to ‘teach’ the model to answer with a defined JSON format - so you will be able to collect the information.

You could try and ask for python code for some graphics. I am not sure a model other than ChatGPT 4 (or similar like Gemini) can handle that.

You wolud have to bring that all together in a process that can run in a loop and collect the results.

You could discuss this plan with a LLM and see if there are other ideas. This all may also depend on the topic of your PDFs.

If the data is there in a (semi) structured way using Regex and some text cleaning might be better. Or do that first and maybe extract tables with data (think the Camelot Python package) and possibly feed them as information to the model.

2 Likes