Important Questions about LLM integrations and nodes

@paolotamag Thanks for this amazing presentation in Hamburg. This are the questions that I did not get answer for in the chat. I let them here so that all KNIME community benefit:

  1. Training Data Format: Could you elaborate on the required format for training data in KNIME? Specifically, is there a straightforward method to convert a collection of PDFs into a format suitable for model fine-tuning or it has to be transformed to JSONL before using it for fine-tuning? Or what is happening in the Flow “KNIME AI Learnathon - Build a Chat Bot on a Knowledge Base - Exercise” is not a fine-tuning?
  2. Advanced Features in Image Recognition: Regarding OpenAI’s image recognition capabilities, does KNIME offer a feature for directly importing and fine-tuning image data from a folder for custom model development?
  3. Data Privacy and Security: Lastly, a question on data confidentiality: When we upload training data (in form of PDFs to crerate our Knowledge base), does KNIME or OpenAI retain a copy or any embeddings of our data?

Thanks.

AG.

4 Likes

Hi there,
Thanks for attending yesterday.
Here are a few answers.

Tagging here Adrian Nembach (@nemad) from our dev team, in case he wants to add anything else.

1) Training Data Format and Fine-tuning

KNIME is not offering yet fine-tuning via these nodes. These nodes offer the oppurtunity to customize the AI behaviour without any partial re-training of the LLM. The results are quite promising though: essentially you can design an automated similarity search in your knowledge base (this is text data in no particular format) which gets added to the prompt of the model. In the case of the Agent node there is LangChain in the node which is providing a vector store as a tool for the AI to decide to use or not depending on the user input.

To deep dive in this topic of Agents and Retrieval Augmented Generation (RAG) I recommend this webinar:

In Adrian’s words: “RAG does not fine-tune a model but enhances its knowledge by providing additional information that relates to the user’s prompt.”

Also check this blog:

2) Advanced Features in Image Recognition

The AI Extension does not support image recognition, yet. However, we have examples for VGG you can build or load in KNIME via the keras and tensorflow deep learning integrations that showcase how one could finetune such an old-school neural network with KNIME.

3) Data Privacy and Security

The KNIME AI Extension does not share any data with KNIME (not to be confused with the KNIME AI Assistant, the panel that helps you build workflows).

When you use an OpenAI model (or Azure OpenAI model) with our nodes, then you share the data with OpenAI (or Azure). If and how OpenAI stores your data depends on your agreement with them and is completely unrelated to KNIME. KNIME is not storing anything as you use the nodes as the data flows directly from your laptop to this online models outside of KNIME.

With KNIME AP 5.1 we offered GPT4All, an open source project sharing a common format to use locally a number of open soruce models. You can download the models at “https://gpt4all.io/” and then load them via the KNIME node.

With KNIME AP 5.2 we enhance this further by including a GPT4All Chat Model node and an Embeddings4All Connector node that run models on the local machine.

Performance when using this large models locally is an issue. In case of the Embeddings4All model, this is also fairly performant. Unfortunately, that can’t be said for any remotely useful chat model because these were not designed for consumer hardware.

KNIME is working towards solutions that can easily integrate with future and current models that are open source and can be deployed locally in your organization network. Please hang tight in the mean time and feel free to test on your laptop those open source GPT4all models for now.

3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.