Rule based categorization of string

wantti · January 26, 2023, 3:51pm

Hello,

I am new to Knime and have only a few experiences with it.

I want to create a new column called “main target”.
The columns’ content shall be based on whether a string in another column called “contract” matches or not.
I would like to use a dictionary where all potential matchings are written and what value should be given in case of matching.
I have many different keywords which I want to check for matching, and the number of keywords will be growing.

I saw the Rule-based nodes where PMML can be used. I have zero experience with PMML.

How can I build this with KNIME?

Daniel_Weikert · January 26, 2023, 5:27pm

There are dictionary nodes in KNIME. Maybe you could checkout those?
br

McReady · January 26, 2023, 5:55pm

Perhaps this node can do the job: https://hub.knime.com/knime/extensions/org.knime.features.base/latest/org.knime.base.node.preproc.binnerdictionary.BinByDictionaryNodeFactory
Use a “Create Table” node to create the dictionary and fill it by hand for testing. Or use any other data source as you wish (e.g. Excel Reader, CSV, DB, etc.).
If the node above doesn’t help, one of these should Nodes: dictionary – KNIME Community Hub

mlauber71 · January 26, 2023, 11:29pm

@wantti if it is an exact match can you just use a left join?

wantti · January 30, 2023, 2:48pm

Thank you all for the quick help and suggestions.
I went through all possibilities.
It turned out that I can utilize the "String Replace (Dictionary) the best. But it works only for an exact match. How can I use the Dictionary with regex instead of exact match? That would be the solution.

McReady · January 30, 2023, 4:06pm

Sounds like a multi-step solution to me. Meening you have to separate it into smaller steps, work with additional columns to store a (temporary) result and find a solution this way. Perhaps first “normalize” your results so in a second step the dictionary would hit?
This one could also be worth a try: https://hub.knime.com/knime/extensions/org.knime.features.ext.textprocessing/latest/org.knime.ext.textprocessing.nodes.tagging.dict.wildcard.WildcardTaggerNodeFactory2

wantti · February 20, 2023, 9:48am

Hello,

the Wildcard Tagger sounds promising, but I have no clue how to use it.
There is no documentation how to connect it. I tried with CSV readers and tables but I do not get into the configuration to start understanding how it should work.

The two step approach you mentioned is a good idea, but my case is the following:

It is about categorization of companies. I have thousands of companies I extract of which I do not exactly know what they are doing. They should all be engaged in car repairs in some way. But some of them are specialized in tires, others in body repairs etc.
Therefore, I try to categorize these companies to have a better picture on their business type. I have a list of URLs of which I know what kind of business type they have. Whenever the “unspecified” company is matching with one of the “specified” URLs, they should get added their category name.
Indeed, I could normalize the URLs to the blank domain name and compare it. But still it would be helpful to use regex for those cases where also parts of the domain are enough to categorize.

I still wonder why there is no possibility to use the string manipulation dictionary in combination with regex. That would be very powerful.

Daniel_Weikert · February 20, 2023, 5:17pm

Wildcard Tagger needs a document so you first need to convert your csv string values to document (String to document node)
br

system · May 21, 2023, 5:17pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.