Poor LLM model performance

This HuggingFace Hub workflow looked great for educational purposes.

However, performance of the model (HuggingFaceH4/zephyr-7b-alpha), when used in KNIME workflow, is really poor as compared to using directly in HuggingFace portal InferenceAPI (here).

Probably this has to do with the configuration of the HF Hub LLM Connector. It would be really great to get this working with performance alike HuggingFace portal.

Thanks!
llm problem.pdf (500.3 KB)

Hey @peleitor, sorry for the late response, but here is my take on the topic.

Short answer: You are using the same model in both instances, but the interface available on Hugging Face doesn’t show what’s really going on behind the scenes with prompt engineering, message templates, configurations and stop words. One could try to get better answers by using the HF Hub Chat Model Connector, changing the model parameters, and setting prompt template placeholders.

Longer answer: Both the HuggingFace Chat Interface and the KNIME Workflow are using the very same LLM, zephyr-7b-alpha. The problem is that we don’t know what kind of prompt engineering is implemented behind the chat interface available on Hugging Face to generate a better answer, which makes it hard to compare. But we know that they do something since the model gives different answers for the same prompt.

You could try to use the HF Hub Chat Model Connector to get better results: Within the chat model connector node, you can specify a system message that the model will use as well as message templates as stated on the model card for zephyr-7b-alpha. The template for zephyr is:

<|system|>
You are a friendly chatbot who always responds in the style of a pirate.</s>
<|user|>
How many helicopters can a human eat in one sitting?</s>
<|assistant|>
Ah, me hearty matey! But yer question be a puzzler! A human cannot eat a helicopter in one sitting, as helicopters are not edible. They be made of metal, plastic, and other materials, not food!

You could also try changing the model configuration, such as adjusting the temperature or token length. Models tend to respond based on the maximum token length specified, whereas on the HuggingFace side, they likely implemented a stop word for text generation.

And as a last note: A translation task might not be the best example for a completions model. I will change that part in the example workflow.

Best, Alex

4 Likes