Rule based categorization of string

Hello,

I am new to Knime and have only a few experiences with it.

I want to create a new column called “main target”.
The columns’ content shall be based on whether a string in another column called “contract” matches or not.
I would like to use a dictionary where all potential matchings are written and what value should be given in case of matching.
I have many different keywords which I want to check for matching, and the number of keywords will be growing.

I saw the Rule-based nodes where PMML can be used. I have zero experience with PMML.

How can I build this with KNIME?

There are dictionary nodes in KNIME. Maybe you could checkout those?
br

Perhaps this node can do the job: https://hub.knime.com/knime/extensions/org.knime.features.base/latest/org.knime.base.node.preproc.binnerdictionary.BinByDictionaryNodeFactory
Use a “Create Table” node to create the dictionary and fill it by hand for testing. Or use any other data source as you wish (e.g. Excel Reader, CSV, DB, etc.).
If the node above doesn’t help, one of these should :wink: Nodes: dictionary – KNIME Community Hub

1 Like

@wantti if it is an exact match can you just use a left join?

Thank you all for the quick help and suggestions.
I went through all possibilities.
It turned out that I can utilize the "String Replace (Dictionary) the best. But it works only for an exact match. How can I use the Dictionary with regex instead of exact match? That would be the solution.

Sounds like a multi-step solution to me. Meening you have to separate it into smaller steps, work with additional columns to store a (temporary) result and find a solution this way. Perhaps first “normalize” your results so in a second step the dictionary would hit?
This one could also be worth a try: https://hub.knime.com/knime/extensions/org.knime.features.ext.textprocessing/latest/org.knime.ext.textprocessing.nodes.tagging.dict.wildcard.WildcardTaggerNodeFactory2

Hello,

the Wildcard Tagger sounds promising, but I have no clue how to use it.
There is no documentation how to connect it. I tried with CSV readers and tables but I do not get into the configuration to start understanding how it should work.

The two step approach you mentioned is a good idea, but my case is the following:

It is about categorization of companies. I have thousands of companies I extract of which I do not exactly know what they are doing. They should all be engaged in car repairs in some way. But some of them are specialized in tires, others in body repairs etc.
Therefore, I try to categorize these companies to have a better picture on their business type. I have a list of URLs of which I know what kind of business type they have. Whenever the “unspecified” company is matching with one of the “specified” URLs, they should get added their category name.
Indeed, I could normalize the URLs to the blank domain name and compare it. But still it would be helpful to use regex for those cases where also parts of the domain are enough to categorize.

I still wonder why there is no possibility to use the string manipulation dictionary in combination with regex. That would be very powerful.

Wildcard Tagger needs a document so you first need to convert your csv string values to document (String to document node)
br

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.