Similarity Search / String Similarity

izaychik63 · March 12, 2021, 7:17pm

Is it possible to add Jaccard similarity for string comparison. I experimented with n-gram, Jaro and Lowenstein they works worse than Jaccard for names comparison. The result worse about 10%. May be KNIME can provide a simulating component.

Thank you

qqilihq · March 12, 2021, 8:17pm

Not sure if you’re talking about this one here?

If yes, afair the “n-gram overlap” is actually a Jaccard similarity (been a while, but I can double check the code). If it still doesn’t behave as you expect, can you post an example?

If no and you mean a different node, my apologies

Thanks,
Philipp

izaychik63 · March 12, 2021, 8:35pm

Yes, @qqilihq, I mean this node. The lowest distance with 2-gram about .69 with 3-gram about .78. At the same time Microsoft Fuzzy Lookup utilizing Jaccard algorithm lowest value .92 as well as it match reversed First and Last names.

qqilihq · March 13, 2021, 7:49am

Could you post some example data for me?

Thx.

izaychik63 · March 17, 2021, 6:11pm

Here’s comparison of Excel Fuzzy Search with 2/3-gram. Similarities

qqilihq · March 17, 2021, 7:00pm

Cool, thanks. I’ll have a look!

B074534 · August 12, 2021, 5:03pm

Hi both, @qqilihq

Have you come up with a solution that compares to Microsoft Fuzzy lookup? I am need to perform Fuzzy grouping of Names but I am finding it difficult with the workflows that are available online…

qqilihq · August 12, 2021, 5:48pm

I remember doing some research on how the MS fuzzy algorithm works, but there’s not much documentation to be found. I’d guess they probably use some word corpus for weighting the terms. At the end I did not have time to further evaluate or revese-engineer this.

izaychik63 · August 12, 2021, 6:00pm

If your can formulate your task in much details, some suggestions will come up.
The simplest approach is to make cross join and then filter on similarity value from String Similarity node output.

B074534 · August 16, 2021, 12:18pm

I’ve recently found this: https://www.microsoft.com/en-us/research/wp-content/uploads/2003/01/bm_sigmod03.pdf

qqilihq · August 17, 2021, 9:51am

Thanks, this looks like a good starting point. I’ll put it in our Palladian backlog – maybe something to implement and offer in the future.

system · June 2, 2023, 9:39pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.