Replacing a phrase or term using a part of that term

I want to replace several phrases with different words, but I don’t want to enter the phrases completely, I want to address them by a part of the phrase. For example I want to replace “reading book” and “reading box” and “read a book” by “entertainment” and I want to address all these phrases only by “read”. How can I do that?
Thank you

Hi there,

this sounds like an interesting problem. Attached is a mini example workflow, which might solve your problem. The main idea is that we first create a Bag of Words and then perform an inner join between the bag of words and the lookup table ( word <-> category, e.g. read and environment). In addition I do a bit of text pre-processing, e.g. stemming and changing everything to lower case.

If you share a bit more information about you dataset we can try to adapt this idea to your problem.


ExampleForum.knwf (10.7 KB)


Thank you very much for your help, I will try this on my data set.
My data set is a list of different user interests that I want to put them in different interest categories according to what I found as interest categories in Yahoo and Google ad words. I will attach both my data set and the interest categories, and also the workflow that I created to extract the unique interests.
interest to categories.knwf (51.3 KB)
Data-1K-interests.xlsx (45.9 KB)
Yahoo interest category.xlsx (22.3 KB)

1 Like

Hi Narges,

After reading your question and further explanations and checking the workflow @Kathrin has provided and the datasets you’ve uploaded, I guess the hard point in your question (now that @Kathrin has provided the workflow) is creating the dictionary. You are going to assign a few super categories of interests to many different sub categories of them. So in Kathrin’s workflow you simply find the terms for each sub category and then you need a dictionary which I think you can create it by using a dataset you already have and Kathrin’s workflow plus some extra nodes.

You already have a dataset containing super categories and sub categories (not the terms but a whole phrase). Use a column filter to feed the sub categories to the “String to Document” node in Kathrin’s workflow. Then after the “Term to String” node use a “String manipulation” node and use this function: regexMatcher($Term as String$,"[a-zA-z]{2}.*" ) which labels the terms starting with at least 2 letters as true and the rest as false. After that you can filter those rows with false value. Then join the output of the last row filter node and the initial dataset containing both super and sub categories using the column containing sub categories as joining column and pick terms and super categories columns. And at the end use a “GroupBy” node to group data on terms and aggregating super categories by mode. (I guess you have exclude some general terms as well to create an appropriate dictionary)

Now you have your dictionary and you can use it to feed the joiner node in Kathrin’s workflow.


1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.