Issue getting local Llama model to work with workflow.

bluecar007 · August 12, 2024, 3:21pm

I download the Use LLama3 as Local Model to perform sentiment analysis on Customer Review to test it out.

To test that the workflow works well, i replaced the nodes to open ai nodes before the LLM prompter note to test out the workflow and i can see that i received a response after running the workflow. Refer to video recording: KNIME Analytics Platform - 12 August 2024 | Loom

However, after i replace the ai nodes with the LocalGPT4All Chat model connector and loaded my local LLama3 model, i noted that the output is “”. Instead of what i received when my open ai nodes were connected. Refer to video recording: KNIME Analytics Platform - 12 August 2024 | Loom

I would like to ask if the issue of “” output is it due to (1)the wrong LLama3 model i downloaded? (2) or the wrong configuration settings in my LocalGPT4All Chat model connector? (3) or simply the LLama3 model 8B is simply not good enough to process this workflow?

My “LocalGPT4All Chat model connector” can be seen in the video recording.

For my Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf, i downloaded it using LM Studio.

Additionally, i like to ask is there any knime connector that allows me to call my local LLM model such as the Meta-Llama-3.1-405B stored on a centralised server via API within my network without going out to the internet. Or can i store it in a file share so everyone of my team can access this model on a shared drive (not sure if its an efficient way as the model is so big).

I would like to explore using the most advance local LLM model but stored centrally so every member running knime can call it. Looking for the best solution to this.

MartinDDDD · August 13, 2024, 1:49pm

That is possible. You can change the base url in the OpenAI Authenticator node (1) to any endpoint that is compatible with the OpenAI API - e.g. instead of using GPT4All you can use LLM Studio to server your Llama3.1 locally and make it use the OpenAI API Structure. After providing the base URL in the authenticator node you then need to provide the model name in the OpenAI Chat Model Connector (2):

(1)

(2)

I run my models via Ollama so I’m afraid I have never used GPT4all and cannot help with that.

bluecar007 · August 13, 2024, 3:14pm

Cool i just downloaded ollama and llama3.1 8b model, its running so slowly its taking around 10mins based on a simple prompt which takes chatgpt4 only 1-3 second to run.

Which model are you using?

MartinDDDD · August 13, 2024, 3:17pm

I’m also running the 8bn parameter model.

May I ask what hardware you are using and what OS you are on (Windows / MacOS)? E.g. I’m running a RTX 4060 laptop version with 8gb of VRAM on Windows and any model up to 8bn parameters typically is having decent response times…

A model that is personally like and is producing good responses despite being “smaller” is phi3 (4b)

bluecar007 · August 14, 2024, 1:03am

May I ask if the speed at which the LLM is processed in this case is based on the hardware specifications of the server hosting Ollama? For example, if Ollama is hosted locally on my own PC versus being accessed via API from a server with better specifications, will the processing speed of the LLM change?

mlauber71 · August 14, 2024, 3:27am

@bluecar007 the performance of a local llm will very much depend on the performance of your hardware.

Concerning ollama I have several examples how to use that with knime. A basic version without vector store will just use a REST-API query to connect to the service running the local llm.

bluecar007 · August 14, 2024, 4:11am

Thanks for sharing.

Basically what i did was i downloaded the workflow 1.1 - Simple OpenAI Prompt

Then i change the details for the first 3 node connectors from to connect to Ollama (model llama3.1:latest) via api hosted on my laptop. I ran the workflow and it took close to 17mins to run.

After that i change the ai connector to use chatgpt3.5 turbo and it took less than 30 secs to run the same workflow.

I have recorded it all down on video. Ollama (model llama3.1:latest) run uptill like 18-18.30mins of the video. Chatgpt3.5 turbo is the last 30 secs of the video.

Need some advise if i am using the wrong nodes or is my settings wrong? its strange the (model llama3.1:latest) run took so long just does not make any sense.

Anyway just uploaded my computer specs. I am using a Thinkpad X1

I dont have any dedicated GPU graphics card, just a standard business laptop with 32gb ram

MartinDDDD · August 14, 2024, 4:33am

Thanks for sharing more information!

I was asking this question about hardware for two reasons:

to see if you have a GPU strong enough to run a 8b model with decent inference times
if 1) is a yes to identify potential problems Ollama may have in using that GPU

I’m afraid to say that without a NVIDIA GPU with at least 8GB VRAM it is not unheard of that it takes ages for a model running only on CPU to take ages… without GPU I don’t think you will get to acceptable response speeds when hosting locally.

Maybe check out huggingface.com if you can set up an endpoint via pay as you go to run opensource models instead…

mlauber71 · August 14, 2024, 4:44am

@bluecar007 you will have to be clear about what is running on your machine and what is running on the infrastructure of OpenAI/ChatGPT - there will be massive differences. Also for the question of privacy if the data leaves your environment or not.

Also local models come in different sizes.

I have quite good local results with Apple Silicon machines, the M1-M3 processors which integrate GPU and CPU and which GPT4All can use.

system · November 12, 2024, 4:44am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.