Text classification


I have question related to text classification, I have transaction data which contain text field entered by user asking for certain help, I want to classify each transaction based predefined categories using the the text (by guessing what user is asking for from combination of sequence of words) ( number of categories more than 70+) ,

I don’t have any labeled data set, just un-categorized large list of text around 2M records

I want to know how to change flat tire -> car maintenance
I have lost my key in the shop -> Lost & found
I forgot my book on table -> Lost & found
I can’t find my car -> Security
My car was stolen -> Security
I’m not feeling well -> health
I’m tired -> health
I’m facing problem with my van , it isn’t working -> car maintenance
I’m facing problem with my van ,I can’t find it -> Security
I can’t find my book -> Lost & found

how can be done in KNIME?


Three options:

  1. hand-label some of your data and train a classifier
  2. hand-craft some rules or dictionary, e.g. “text contains ‘feeling’ -> class is ‘health’”
  3. have a look at the “active learning” nodes to ease (1) – never did this, would be interested about any experience with this approach

No matter which option you go: Keep in mind that 70 categories is quite a lot and this makes the classification task difficult.

– Philipp

Thank Philipp,

can Word2vec be used in such above case?
i.e. is it possible to use Word2vec in order to map words that exist in text to certain category
from above examples if text contains flat tire , can word2vec predict category as car maintenance and so on.

  1. Word2Vec will not give you a category, but rather a vector. To map from that vector to a category, you’d still need to perform some sort classification.

  2. You’ll need a word vector model. There’s the Word2Vec Learner to build such a model, or alternatively you could probably use an existing model and read it with the Word Vector Model Reader. However, I’m skeptical that the latter option will give you any good results, as the existing models are obviously trained on an entirely different domain.

Basically, this approach corresponds to #1 in my above post.

– Philipp

mmm. I found we have more than 7k predefined categories, it will be difficult task to label any dataset with such huge number of categories.

This is tough. Not to sound pessimistic, but this is probably not the right task for an automated classification. Two ideas:

Would it be feasible to considerably reduce the number of categories?

Maybe a unsupervised (i.e clustering) approach would be a better strategy?

the data in dataset is transaction data for around 125 companies each company has its own categories , some of these companies have 350 categories , other 200, 50 , 10, 1 … etc.

unfortunately ,all categories are needed , so I don’t think unsupervised (clustering) can help here , since already have predefined list of categories.

I might make partition for the transactions based on company name, this might restrict number of categories based on company name, but I’m still don’t know how to classify such data.

shall I follow the active learning as you suggested in your above post , if yes what type of learner/ predictor shall be used.

Here’s an entry point regarding the active learning nodes in KNIME: https://www.knime.com/book/active-learning

However, as stated above, this is surely not the right path to tackle your specific problem with the extreme high number of potential categories.

Rephrasing your problem, you’ll need a two-step classification, first into 125 companies, then into company-specific categories, right? Then I’d consider splitting this classification task into two steps (i.e. 1 + n_companies classifiers). Still, 125 respective 350 categories is huge, and it’ll require a significant amount of training data/time, in case want to create a supervised approach.

Are there any further features which you could exploit beside the textual information?

– Philipp

for now, I don’t think so, but there is relationship between company -> inquiry type -> category
I have 2.5 million records ,
columns [transaction id, company name, inquiry type, inquiry (the text will be used for text classification) , category ( new column, empty , the result of the classification)]

Not sure I understood correctly – the ‘company name’ and ‘inquiry type’ are all missing in your dataset?

The only field I want to fill is the value of “category” field based on classifying the text in “inquiry” field
my dataset contains the following columns
1- transaction Id (number,has data)
2- Company name (string,has data)
3- inquiry type (string,has data)
4- inquiry (string ,has data, contains text that i’m going to use to classify and find category)
5- category (string , NO data ,I want to fill this column as result of text classification )