How can i assign pre-defined topics to verbatims

Hello everyone,

I am quite new with KNIME and do have a challenge, which I hope can be answered by someone :slight_smile: Though I have browsed and checked some workflows, already (e. g. "Topic detection based on movie reviews), I have not found a soultion, yet. And here is the task:

I have a lot of verbatims from social media reviews, already in an excel list. Moreover, I have given clusters such as “friendlyness of staff” or “quality of repair”, also in an excel list. Each cluster is represented by tag words or by a combination of tag words. Now, I need to assign 1-3 clusters to each verbatim, depending on what the text is about, i. e. which tag words or combinations of tag words appear in the text - and hence which cluster applies. The results should be written next to the verbatim column of the original document (e. g. by using the columns next to the verbatims for each topic and by writing “1” in the column, if the topic applies).

I hope, I have explained my question in a clear way and I am thankful for your kind feedbacks.

Best regards

Sebastian

Can Nobody help here?

Hi @fossil78,

you might want to give Dictionary Tagger node a try. There is a workflow on our example server which you can use as template: https://www.knime.com/nodeguide/other-analytics-types/text-processing/sentiment-analysis-lexicon-based-approach

You read the three dictionaries (excel files with tags representing a cluster/class) with an Excel Reader node and apply them with a Dictionary Tagger node each. This way all the terms in your corpus will be tagged based on your custom dictionaries. Afterwards you can perform aggregations based on the tags and assign classes to the twitter documents based on the amount of tags from the different dicionaries.

Does that help?

Cheers,
Marten

1 Like

Hi Marten,

Thank you very much for your reply, first of all! This sounds as if it goes into the right direction. I will have the chance to test it in the next week and I will come back with a feedback.

Cheers, Sebastian

Hi Marten

I have checked what you have suggested and I believe I need to explain myself better: I have a given list of verbatims in an excel table. I have already categorized a sample 500 verbatims manually to 6 different categories. Is it possible to apply a self-leaning model to this assigments and to apply it then to new verbatims. These new ones should be categorized, based on the model, whereby one verbatim can theoretically represent max. 6 categories.

Thank you very much in advance for your thoughts.

Kind regards,

Sebastian

Hey Sebastian,

this sounds like a classic document classification problem.
There is a workflow on the example server (https://www.knime.com/nodeguide/other-analytics-types/text-processing/document-classification).

You can use your 500 samples as a training data set. Train a model and apply the model on your remaining data.
The example workflow trains a Support Vector Machine (SVM) model , but you can also try out other machine learning methods.

If there are any questions, feel free to ask.

Cheers,

Julian

2 Likes

Hi Julian,

Thank you so much, this is what I was looking for. I have tested the workflow with my verbatims and it delivered a prediction accuracy of ca. 75%, which I am happy with. However, I have tried and amended the following, as I did not understand what it is for: In Metanode “Term filtering”, I replaced the row filter with a new row filter which is not getting variable input from Java Edit Variable Node (what is this Node actually for?). It worked with the same result…

But I have problems, applying the written PMML Model to new verbatims (xls list). Am I right that I have to use the model deployment section? I would like to connect an xls table and with new verbatims and an empty categories column in order to run it through the workflow. Target would then be, that the workflow writes the predicted categories next to the verbatims, or even better, to assign more categories to a verbatim, if applicable (or if %-values of likeliness that the categories apply are higher then x). Could you maybe give me a hint, how I could proceed?

Also, I am wondering, what the node “table reader” in the transformation section is for.

Thank you again for your help.

Kind regards & have a nice weekend.

Sebastian

Hey Sebastian,

The Java Edit Variable node is used to calculate the minimum document frequency. The flow variable which is connected to the Row Filter can then be set for one of the settings in the node dialog via accessing the Flow Variables tab. How ever you are right, you can simply discard the Java Edit Variable and set the lower bound in the Row Filter manually.

You don’t need to use the deployment section at all. It is just an example to show, how to use an exported model from the upper part of the workflow. You could also just extend the upper part. However, you still have to transform the data from your xls file to the same structure as the structure you used to train the model. So you need the same document vector to use the predictor node. That’s also why there is the Table Reader node in the Transformation section. It is used to extend the document vector by all vectors used for training.
After transforming your data to the required document vector table, you can predict the category. You don’t need an empty categories column, since a prediction column will be appended by the Predictor node. You can also check the option for the class distributions in the node’s dialog, to get columns containing the confidences for each class. After prediction you can savely append the prediction column to the table created by reading the xls file, since they both should have the same number of rows and no sorting happened.

I hope I could help you a bit.

Cheers,

Julian

1 Like

Hi Julian,

Thank you very much for your comprehensive answer and for taking the time to deal with my problem! The real issue lies with the lack of my knowledge - I am simply to new with KNIME to be able to understand what you wrote, plus, I do not want to further push it, before having learned more about KNIME. Btw, meanwhile, I have made an attempt to solve the issue, using the Palladian Nodes and I must say: it worked. Text Classification did deliver quite a high accuracy and I think, for now, it is ok for me!

Thanks again and have a nice day.

Sebastian

2 Likes