Topic moddeling, sentiment and correlation calculation (word2vec alternativ)

Mogedin · November 14, 2024, 2:10am

Hello, I am still quite new to the Knime context and would appreciate any help.

I am currently analyzing customer reviews and want to extract the respective service category (one or more) being reviewed and the associated sentiment.
Example review:
“The reception at the hotel unfortunately didn’t help us with our problem and persuaded us to take a day trip that was just planned as a sales event.
The organization with the bus transfer to the hotel worked out perfectly.”

My goal is to get a JSON output with the service categories (Topics) and their respective sentiment:

Transfer: positive  
Reception: negative  
Excursion: negative

A total of 50,000 reviews need to be analyzed.

Next, I want to perform a correlation calculation to determine which negative experiences most frequently lead to low-star ratings.

I initially wanted to work with word2vec, but it seems to have become legacy. Is there a newer method to replace it? I haven’t been able to find anything.
In general, when something becomes legacy, is there a place where one can find guidance or suggestions for current alternatives? And where can these be found?

I also discovered a workflow for my use case that uses GPT4All. However, my company has not approved the use of LLMs because the reviews contain personal data. Our data protection team is currently evaluating whether local LLMs can be approved, and I need to wait for their decision.
Can you recommend a workflow for my goals?

MartinDDDD · November 14, 2024, 7:08am

Hey there and welcome to the forum,

based on what you describe I see a couple of options:

To start with I’d see if any of these two methods work for you - they are part of the examples on the KNIME hub:

If you want to go down the Transformer pathway I’d look into using BERT first - although this already requires you have a GPU available and depending on your use case you may have to fine tune…

This is documented here:

The last option you have mentioned already is LLMs. The best way - if your organisation allows it - would be to use OpenAI with structured Outputs.

To try and convince your oranisation on the data privacy side it might be worthwhile to bring the Presidio Extension to their attention - this helps to anonymise the data that is send into the LLM:

Here is an example workflow:

https://hub.knime.com/knime/spaces/AI%20Extension%20Example%20Workflows/5)%20Use%20Cases/Anonymize%20Sensitive%20Data%20for%20Bank%20Assistant%20Chatbots~33_t9sI8t70UmMhC/current-state

I have written an article on OpenAI Structured Outputs in KNIME:

And made a video about it:

and based on the results I have create an extension that makes using structured outputs a little easier:

If you want to go down the local model route: I think for allowing a model to run that is capable of reliably producing structured outputs, you need at least a CPU with 8GB of VRAM. Personally I prefer Ollama over GPT4ALL and as the Ollama API is OpenAI compatible you can use the standard KNIME nodes to work with it.

There’s a blog post here on how to use Ollama with KNIME:

So as you can see plenty of options to explore for you

mlauber71 · November 14, 2024, 7:15am

@Mogedin I would add these approaches to automatic topic Detection derived from an example by knime and using some R packages. Though currently this will only work under Windows.