Similarity Search / String Similarity

Is it possible to add Jaccard similarity for string comparison. I experimented with n-gram, Jaro and Lowenstein they works worse than Jaccard for names comparison. The result worse about 10%. May be KNIME can provide a simulating component.

Thank you

Not sure if you’re talking about this one here?

If yes, afair the “n-gram overlap” is actually a Jaccard similarity (been a while, but I can double check the code). If it still doesn’t behave as you expect, can you post an example?

If no and you mean a different node, my apologies :slight_smile:

Thanks,
Philipp

Yes, @qqilihq, I mean this node. The lowest distance with 2-gram about .69 with 3-gram about .78. At the same time Microsoft Fuzzy Lookup utilizing Jaccard algorithm lowest value .92 as well as it match reversed First and Last names.

Could you post some example data for me?

Thx.

Here’s comparison of Excel Fuzzy Search with 2/3-gram.Similarities

Cool, thanks. I’ll have a look!

Hi both, @qqilihq

Have you come up with a solution that compares to Microsoft Fuzzy lookup? I am need to perform Fuzzy grouping of Names but I am finding it difficult with the workflows that are available online…

I remember doing some research on how the MS fuzzy algorithm works, but there’s not much documentation to be found. I’d guess they probably use some word corpus for weighting the terms. At the end I did not have time to further evaluate or revese-engineer this.

If your can formulate your task in much details, some suggestions will come up.
The simplest approach is to make cross join and then filter on similarity value from String Similarity node output.

I’ve recently found this: https://www.microsoft.com/en-us/research/wp-content/uploads/2003/01/bm_sigmod03.pdf

2 Likes

Thanks, this looks like a good starting point. I’ll put it in our Palladian backlog – maybe something to implement and offer in the future.

2 Likes