I need an aid for a problem of clustering and data mining

Sorry, I am new in this matter. I have taken a brief online course and I am absolutely ignorant of big data science, but I would like to improve my knowledge. I am a philosopher of science and an Italian enterprise of cruises asked me to solve a problem. I wanted to know whether I could take their FAQ and analyze them. I want to know which is the most asked question: so how can I develop a workflow starting from natural language and ending with a categorization and their ordinal order? I want to develop a workflow without knowing a priori which classes can be found.
After that I want to know about the issues “ online check in and boarding pass” which is the sentiment of the costumers. Can you give me advices? I am very new… but I really would like to learn.

There is a bunch of nodes that can help you with Text analysis

Also there is an example for a sentiment analysis

Then of course you can check out the KNIME courses about
“Text Mining Course with KNIME Analytics Platform”

And also you can check out the “Web, Text and Network Analysis” section of the Learning Hub:


I have developed a workflow. But the solution is always “undefined class” for category to class pit.
Can I load my workflow here? In order to check if I made some mistakes.
If it was correct what would I understand from that workflow?

My reviews are written in Italian. Is it a problem? The tokenizer is not in Italian. I chose the Spanish one.

Another question… searching tutorials I found a lexicon based approach . What does it mean? If I wanted to try that method, where could I find positive and negative tags for Italian?

Thanks for the support.

You could load a workflow to this forum best to do it with the data (if that is possible due to privacy rules). Then the community could have a look. Some time back I did the text analysis course, I think the slides were similar to these with (slides, data and workflows).

From what I remember language is interesting because part of typical text analysis with KNIME is to get rid of words that would not carry specific meaning like (in english) the, a … - and to reduce the other words to their basic form so you might classify each occurrence in every form. Also for sentiment detection (not your topic I understand) obviously the language does play a significant role.

1 Like

Thanks for the slides, and thanks for the support

I translated the reviews with the function of google sheets. My result is still undefined class for category to class pit.

KNIME_project5.knwf (33.1 KB)

ah yes, here the data
Foglio di lavoro senza nome.xlsx (13.3 KB)

I adapted a workflow from this webinar by KNIME and tried a few things

Please do note: I do not claim any special knowledge of Text Mining; there might lurk further quirks and (as mentioned later) more data cleaning will be needed- especially if you take into account language (English/Italian).

There remain some tasks and questions. First some more data cleaning seems to be necessary there are still names present (like Costa) where you might want to think if you want or do not want them to be present.

A cloud of the top 100 two word terms might look something like this (not sure what it is with the ‘yellow’):

A ‘pure’ cloud of just words looks like this:

You might take further inspirations from the collection of examples on the KNIME server:

kn_example_italy_text_clustering.knwf (598.5 KB)

1 Like