Fuzzy match rows in one column

Hi,

I have attached some example data and the expected outcome (second worksheet).

I want to compare the string values in one column to see whether there’s similarities among them. Please note my original file has thousands of lines and a multitude or descriptions that I’m not really aware of.

My goal is a first attempt to categorize/group similar lines together for me to further analyze the content of the file.

What would be the best way for handle this?

I already tried the STRING MATCHER and SIMILARITY RESEARCH whereby the source and comparing column are identical. But what happens is that it’s only picking up the exact same values. I want to check on similar things and not exact matches.
Recharge examples.xlsx (10.6 KB)

You can try

node.

1 Like

I tried that one, but I don’t know how this would help. I only have one column with data and I want to have that grouped somehow. The String similarity is comparing two columns, which I don’t have :frowning:

Hi @robvp
You could take @izaychik63 idea and send the same data into string similarity and increase the neighbor count.
Then you get more then just the same as matching

br

I can’t get it to work. I always get 1.0 as you can see in the screenshot :frowning:

If you feed the same data twice you would always get 100% similarity for the same record. If you take more neighbors into account then you could filter out the 100% and take the second one
br

@robvp I once created this workflow that would group addresses without a ground truth against which to match. Maybe you can adapt that.

If you apply this it looks something like this:

You can edit the threshold which would constitute a similarity (Similarity Search – KNIME Community Hub) and maybe also configure the method.

If you set the threshold to 0.33 (instead of 0.25) the result would be this:

What you could do would be to try change the order of the words so that similar words would have other positions.

String Deduplication without Ground Truth - KNIME Forum (75366).knwf (192.9 KB)