I’m playing around with the AI nodes and vector store and already worked through some documentation, articles and examples. My goal is to talk with my data a little bit. As far as I can see, all examples are following this logic:
Define question
Ask the vector store
Create an augumented prompt
LLM Prompter
As long as the question is very specific and the data is accurate, this works in most cases. For very generic questions on a dataset with ~10k entries, I want to have the process structured in another way, as it struggles with the results from the vector store. Why? Because the data in the VS doesn’t match to the questions directly but still it is there.
My ideal process would look like:
Define question
LLM Prompter
LLM looks up the VS for additional data
Answer to the question including the enriched data
Example to make it clearer: The data contains information about let’s say 10k different hammers. The question is “how to drive a nail into a wall?”. The response w/o the data would be “Take a hammer”, the result with the data shall be “Take hammer AB5000, it fits best for this job”.
Did I make a mistake in my way of thinking or do I have to check for another process or logic, e.g. the new “Agentic AI” tools?
I understand that you are trying to make your RAG give more accurate answers and struggling because the LLM is not smart enough to give you the answer you are expecting, because the information (i.e. the added value) is “hidden” in the data (i.e. the raw table).
Since LLM are definitely not smart entities, maybe you can think about providing your LLM with some tools, that will at least make them look intelligent.
Back to your example: what defines the AB5000 the best hammer? Maybe it has the best value for money? Expecting the LLM to calculate the value for money of all the hammers in the table and then pick the best one is..just unrealistic but you know that already.
So the first thing that you could do is to add some more data into the VS. For example, get the price-quality ratio and save it in the table. Now it’s just about retrieving the one with the highest ratio, adding the entry to the prompt and let the LLM formulate the appropriate answer. And that’s more or less what you were asking for.
But I’d like to bring you one step further. Let’s provide the LLM all the means to
Retrieve a table of products of the same category (hammers, knives, screwdrivers…)
Operate custom row-by-row expressions on a given data table (sum the values of columns, calculate a ratio etc…)
Sort the table (by date of issue, by customer review, by value for money…)
Select the first n entries of a table
In this way, once you have explained the LLM how to define the best hammer, it will be able to do the job of
getting the list of hammers
calculate the value for money for each of them
sorting for the best value for money
selecting the first hammer of the list
elaborating an answer
And this can of course be repeated and adapted as soon as new questions come in.
Basically you have provided your LLM tools that are able to perform transformations on the data, to extract the information that are needed at that specific point. The cool thing about doing it with the new KNIME Agentic Framework, is that all the deterministic stuff (i.e. selecting all the hammers, calculating the ratio between the price and the quality of each hammer and selecting the one with the highest value) is all done by KNIME nodes, which are reliable and auditable. This can all be done without exposing data to the LLM and without opening up the stage for hallucinations or misinterpretations, typical of LLMs.
I hope that answers your question! Agents are indeed cool once you understand where they shine and..I think your example just made a very nice use case!
Hi @emilio_s
Thank you for your very detailed answer. I see, I have to get a lot deeper into the details here at my desk, especially with the agents. Not sure, if I will get the desired result but pretty sure I will learn a lot on the way there
So thank you once again and have a nice weekend.
If you have 10k hammers in your data set then finding the right hammer “semantically” (that is what embeddings do") will be quite challenging I think - in the end you retrieve the let’s say 10 most relevant sets of data based on the embedded vector, which will all be hammers.
I think you for each hammer type you had a description that specifies in which scenarios the hammer is most useful and then you send in a question like “I need to drive a 10 inch iron nail through a coated concrete wall that may have some bricks 5 cm in” - the embedding pathway might work better.
if you have 10k Hammers and there is a lot of overlap between what different models can achieve I’d probably try to define say a number of 10 properties, use an LLM to analyse the descriptions and fill in these 10 properties (using structured outputs) and then built a data app on top of that data where the user can select nail length, wall type, whatever else and then you list hammers that fit that criteria ordered by price ascending
Thank you @MartinDDDD. No need to worry, you are not too late. Every contribution is welcome. I’m pretty sure I have to go back to the desk and rethink about my problem first and the possible solution in a second step. That’s what I can read and take away in between your lines, too.