Predict Label from Description

I have some millions of data in this format:

|Supply of 6KG ABC Fire Extinguishers 80 nos|C|V|00002112|
|SALARY OF B.U. 0907263 FOR JAN-2023|C|V|04022113|
|SALARY OF B.U. 0907060 FOR JAN-2023|D|V|03083113|
|SALARY OF B.U. 0907518 FOR JAN-2023|C|V|93021000|
|SMR IMP BILL BNL 01.11.2022 TO 30.11.2022|D|V|09029518|
|Supply Crimping of Hydraulic hoses|C|V|00002102|
|SALARY OF B.U. 0905330 FOR FEB-2023|C|V|93065200|
|SALARY OF B.U. 0909270 FOR FEB-2023|C|V|11058230|
|Leave salary bill for SK LALSAHEB PF NO 24601975353|C|V|00867002|
|FCC Bill|C|V|00100109|
|Overhauling kit for Governor booster pump motor consisting of the following 5 items 1 Oil seal CR 5068|D|V|05031228|
|local purchase of medicines|C|V|00867002|
|Bill submission|C|V|00867002|
|DIET IMPREST FOR THE PERIOD 04.02.2022 to 22.02.2022|C|V|00867002|
|SALARY OF B.U. 0901833 FOR MAR-2022|D|V|00870921|
|SALARY OF B.U. 0905189 FOR MAR-2022|C|V|07043201|

The Description is natural text, Status and Mode are Binary classifiers containing (C and D) for Status and (V and C) for Mode. Label is 8 character code and applied for each entry and mostly based on description.

Every month, I will be getting around 50K records in this format with Label attached and I want to verify if the Labels are more or less correctly attached through a Model (i am trying to build).

What I have done till now:
Description - I have to converted it into Document, then applied POS Tagger, NE Tagger and Wildcard Tagger and also filtered Stop Words. I can add Punctuation Erasure, Lemmatizer and Stemming (but not yet done).

From here I am able to create Bag of Words and Term to String, but not knowing how can I build a model for Label prediction based on the Description bag of words using some node like Decision Tree Learner or else?

Thanks in advance


Can you offer an example of how you define an “incorrectly attached label”, or what steps someone might take to review whether it was correctly attached or not? Do you have secondary lists / tables that can be used to reference definitions in a test?

1 Like
  • All your (example) data has mode V - what does the mode even mean?

  • What’s the difference between C and D? There’s entries “Salary of X” labeled with C and D - how should that be differentiated through the classifier?

  • I highly doubt a POS or NE tagging or stemming will give you any benefit - the descriptions are no real sentences

Based on the information presented here it’s difficult to give any further advice and throwing NLP and decision trees will likely not bring your further – if you have the necessary domain knowledge I’d probably start with a hand-crafted, rule-based approach, evaluate, and iterate from there.

1 Like

Yes, I am able to do this with Postgre FullText and NLTK in Python. But, I am just wondering how is KNIME different. Thanks for the inputs.

Yes, I do have a small master table of Labels which contains the general description allowed, but its vaguely defined. Example:

Description | Label
Vehicles | 00001280

but in real data, people will make entries like this:

Description | Label
hiring of vehicle | 00001280
hired vehicles | 00001280
charges for vehicle | 00001280

So, what I am trying to achieve is, train a model from the past user data to identify all common keywords appearing in description for each type of label. In most cases, users attach the Label correctly but since the Description is free text, they write natural language.

This should be something like stem and group the terms if any, and extract the required keywords for classification. May be.

@vkgautham I can offer a collection of examples about string and address deduplication and matching as well as fingerprinting as this sounds like a text similarity task. A solution from my perspective will need some more planning and domain knowledge from your side.

  • Do you have a ground truth against which to match.
  • Have you historic examples where such a match has successfully been done that you could use.
  • how many categories are there
  • how will you handle the date and time variables?

Maybe you can provide a full set of examples representing your task without spelling any secrets.

Another approach could be to try and teach a system of categories with a model. There should be examples on the hub about topic detection and learning.


As far as I remember when you have the bag of words you can use document vector node to get your features and then use one of the ML algorithms in KNIME for prediction. Your labels seem to be numbers? so I would first convert the to string (Classification not regression problem)

I am working on directions suggested by @mlauber71 and @Daniel_Weikert and will post what I have achieved, if any.