How identify similar words based on length

Hello all,

I would like to determine related values based on a list of values.

Example:

  • Jacket
  • men’s jacket
  • Women’s jacket
  • Driver jacket
  • Additional jacket
  • Jacket add-on
  • Second jacket additional
    => All these matches should be grouped

I imagine that a length of characters is defined which must be identical and follow each other (example: 3 characters). If the hits are too imprecise, I would increase the number of identical characters.

Output of the identical values then in a separate column. Here I could then filter the identical values.

Does anyone here have an idea how best to implement?

Many thanks already.

Hello @Spacesurfer,

and welcome to KNIME Community!

There are couple of ways to address this in KNIME. Do you know your categories/grouping values? And do you have possible list of values associated with them? If you are able to share some data (don’t share confidential data please; dummy works just fine) that would be great as we all love data here :wink:

Br,
Ivan

1 Like

Ideally I don’t want to use categories/grouping values as a basis, so that I don’t have to work them out manually.

Attached is an example of data that I would like to summarize.

Thanks a lot

Processing: Example.csv…
Example.xls (23 KB)

There is a string similarity node. Maybe you could use this to calculate the distance aka similiarty of your strings
br

2 Likes

Hello @Spacesurfer,

I can’t read data into KNIME. Looks like upload wasn’t finished and you already replied. Can you give it another try?

Br,
Ivan

Hi @ipazin , it’s not that the file is not finished, it’s just an old Excel format. If you try to open in Excel, it will open. But Knime can’t read this file. As an alternative, after opening it in Excel, you can do a Save As .xlsx, and then have Knime read that new .xlsx file. This will work.

In terms of solution to the challenge here, @Spacesurfer in theory, you could just extract terms or words, and define common words such as “women’s” or “men’s” or “additional” etc, and once you extract the words of each line, you can then ignore/remove the common words that were extracted. You will be left with the essential values. You can then do a count and see what’s most popular.

1 Like

I see. Tnx @bruno29a. I could do that but I don’t have Office. So whenever someone shares Excel file I use KNIME to check its content :sweat_smile:
Ivan

Hi @ipazin , no problem. Here’s the converted file: Example.xlsx (9.9 KB)

1 Like

Hello Daniel,

unfortunately, the string similarity node only compares the content of two columns and shows the distance.

I am looking for a solution that shows me which similarity individual cells in the column have and groups them via a characteristic.

Any other ideas?

Thank you and best regards
Christian

Hello bruno29a,

which node would you use to do this? I think this could bring me closer to the solution.

Thank you and best regards
Christian

Hi @Spacesurfer , there are a few ways to do this - that’s how it usually is with Knime.

I think the main challenge here is to determine/build the list of values/groups. Once this is done, the rest is pretty much easy.

Also, an additional challenge is that your data is not as clean. For example:

Row0  | 2.  Fahrerjacke (Herren/Damen)
Row23 | 1x zusätzliche  Fahrerjacke (Herren/D
Row37 | 2.  Fahrerjacke zusätzlich (Damen/Her

It looks like some data got cut off. Or it also looks like there might be some cases of Herren/Damen vs Damen/Herren. It’s up to you how you want to handle these types of “dirty” data - either fix them manually or add them to the ignore list (see further down for ignore list).

And an another additional challenge for me in particular is that I’m not in that field (clothing/jacket) and also I don’t read German, so I can’t really determine what’s common name vs the values that we want.

Nevertheless, I’ve put something together that can give you an idea of which direction to go, and you can tweak it based on your knowledge of German and clothing/jacket.

Let’s start building the values (I call them “terms”):

First step is to try to clean the data as much as possible:

strip(replace(replaceChars(lowerCase(strip($A$))
  , ".,;&", " ")
  , "  ", " ")
)

Results:

I extract each word (a word being something separated by a space):

And I merge all of the extracted words into 1 column:

I then apply some base rules to filter out some values:


MISSING $Terms$ OR $term length$ < 3 OR $Terms$ IN ("", "-") OR $Terms$ MATCHES "^[0-9].*" => TRUE

So, I’m filtering out any missing values, values whose length is less than 3, removing empty values, removing values that are equal to “-”, and finally removing values that start with a number - Note: You can tweak this part. I basically checked what the results of the previous node was, and tried to filter out what I think do not make a proper term.

After that, I count the number of occurences of each term - this will also show me what the unique terms are:

I choose to assume that anything that occurred only once does not qualify as a group - you might try to ignore and not apply this rule first and see what results that you have in case that this is not a correct assumption; this will also depend on what are the keywords you want:


Results:

I highlighted some terms together as they might actually mean the same thing - again, it’ll be up to you how you want to handle these (you could create a mapping table to have them renamed to the same name)

At this point, I can create my ignore list for additional filtering. There are 2 ways I would consider here. Either you can create a table manually via a Table Creator, especially if you are going to re-use the workflow and you can add new words to the table - you can tweak this, and you probably will as you most likely might not get it perfect the first time:
image

Or you can do this interactively via a Table View where you can select what to keep or remove:

Pros and cons:
Table list is tedious to manually add the terms, but you will only to do this once as it is permanent, meaning if you re-run the workflow, you don’t need to re-add the terms, while the interactive option, easier to click to select than manually entering the terms, but will reset if you re-run the workflow, meaning you will have to re-select the terms.

I’ll include both options. Then it’s just a matter of matching the terms.

The final workflow looks like this:

So, the Column Filter (Node 15) has the same results as the 2nd output port of the Joiner (Node 16). They are the 2 options of either creating a list via a Table Creator or selecting via the Table View (obviously they will be the same of the same terms are selected).

Here are the results after matching:

You will get duplicates in my case, as some of the items match more than 1 group. For example, “jackets” will match both “jacket” and “jackets”. Or “mats and jacket” will match both “mats” and “jacket”. You can tweak it based on your needs. For example, you may just combine both “jacket” and “jackets” under the same term, in which case you can add “jackets” as an ignore word, or you can simply apply a duplicate row filter and choose only 1 match per item. I’ll leave that part to you.

Here’s the workflow (Excel file included): Identify groups based on similar words.knwf (159.2 KB)

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.