Machine Learning for Topic Classification in Knime

Hello readers.

I am a new user of Knime platform and I am using it for text categorization. My data set includes a large number of item summaries and I have a list of categories with their respected keywords. So far I have successfully categorized more than 1/3 of item summaries using my keywords, and using Reference Row Splitter, I can create two tables, one includes all rows that I have categorized, and another one includes all uncategorized rows.

My next step is to create a machine learning model for example Decision Tree or Naïve Bayes model to apply my current categorization on uncategorized items, but this is where the problem occurs. All outputs from predictors are the category that has the most number of occurrences. Since my data only has two columns, one is the summary, another one is the category. When I look into the model, it seems that the only numerical value applied in the model is number of occurrence, which explains why the predictor output is only the category has most occurrences.

I am currently stuck and I have no idea how to proceed. Can someone please help me. Thank you!

Hello @WWang

this sounds like a nice text processing project!

One of the steps in a text processing project is the transformation of the text into a numerical representation, which the an algorithm can handle. Otherwise each string / item summary just looks like a different value to the algorithm.

This workflows shows you all the steps involved in text classification.

And in this webinar we go through the different steps on the example of sentiment analysis


1 Like

Hi Kathrin, @Kathrin thank you so much for the reply! I will take a deeper look at the resources you post.

Actually I successfully tackled the problem last night. I firstly realized the categories I had assigned were simply strings, so I assigned categorical values first, and then I vectorized the keywords and applied a model on it and successfully generated prediction outputs :slight_smile:

The only problem is to further enhance the algorithm and maybe reduce the number of words since it is running extremely slow

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.