Fuzzy match rows in one column

Hi,

I have attached some example data and the expected outcome (second worksheet).

I want to compare the string values in one column to see whether there’s similarities among them. Please note my original file has thousands of lines and a multitude or descriptions that I’m not really aware of.

My goal is a first attempt to categorize/group similar lines together for me to further analyze the content of the file.

What would be the best way for handle this?

I already tried the STRING MATCHER and SIMILARITY RESEARCH whereby the source and comparing column are identical. But what happens is that it’s only picking up the exact same values. I want to check on similar things and not exact matches.
Recharge examples.xlsx (10.6 KB)

You can try

node.

1 Like

I tried that one, but I don’t know how this would help. I only have one column with data and I want to have that grouped somehow. The String similarity is comparing two columns, which I don’t have :frowning:

Hi @robvp
You could take @izaychik63 idea and send the same data into string similarity and increase the neighbor count.
Then you get more then just the same as matching

br

I can’t get it to work. I always get 1.0 as you can see in the screenshot :frowning:

If you feed the same data twice you would always get 100% similarity for the same record. If you take more neighbors into account then you could filter out the 100% and take the second one
br

@robvp I once created this workflow that would group addresses without a ground truth against which to match. Maybe you can adapt that.

If you apply this it looks something like this:

You can edit the threshold which would constitute a similarity (Similarity Search – KNIME Community Hub) and maybe also configure the method.

If you set the threshold to 0.33 (instead of 0.25) the result would be this:

What you could do would be to try change the order of the words so that similar words would have other positions.

String Deduplication without Ground Truth - KNIME Forum (75366).knwf (192.9 KB)

@robvp any thoughts on my suggestion/workflow?

Honestly I was overwhelmed by your workflow. I didn’t know where to start…My skills are not that advanced :frowning:

@robvp I had hoped that the descriptions given in the workflow might have helped. If you want you can ask further questions or I can try to add more information. My impression was that this approach might be able to help you - if the task still is relevant.

The examples could also be adapted for future tasks involving the matching of similar strings.

I would recommend also to look on topics extraction lines of conversation,

The task is obsolete in the mean time, but that doesn’t mean I don’t want to understand because it can benefit me in future situations.

I’m always getting an empty table at the rule based row filter in the workflow. Logic in there is TRUE=>FALSE, so it makes sense. Am I missing something?
image

This is deliberate to create an empty table before the loop that would then store the intermediate results.

Ah, clear!
I’m running it now, but in total it’s over 200.000 lines. Would there be a possiblity to speed things up because looking at the speed it might take more than a day…

@robvp the problem could be that since this is an approach without a ground truth it will have to search all possible matches in each iteration (the ones not yet taken) so it will always need the whole pool of (remaining) options.

If you would have a list of ground truths against which to match you could try to parallelise it.

You might want to check general options how to speed up your KNIME experience:

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.