Fuzzy matching of sequential characters

supersharp · October 31, 2019, 2:28am

I am performing a similar exercise to the ‘address de-duplication’ example with invoice numbers, a bit more challenging than addresses due to the simple fact that invoices can be different by one character, and that makes the fuzzy matching pick up totally unrelated entities if you are a bit lax with the parameters.

Is there a way to ensure that fuzzy matching can be conducted while maintaining the order sequence of the words? e.g. Invoice# 123456 and Invoice# 654321 would be picked up with a high similarity in a regular fuzzy match, but due to the order being different, they really are not similar - I’d like to avoid such matches. However, Invoice# 0123456 should be very similar (maybe the same), differing by the starting ‘0’. This could be a ‘0’, an ‘A’, a ‘B’ at the end, etc. I want to identify such cases. Any tips are appreciated!

PS. I posted this question to the new Hub but it doesn’t look like the hub gets the same attention as the forum yet. Address Deduplication

mwiegand · October 31, 2019, 7:55pm

Hi @supersharp,

with sequence of words do you mean the sequence in the cell or sequence of rows? If the change in the order number always occurs in the same location you might, for string containing seven characters after "# ", remove the first one and then search for duplicates via duplicate row filer.

I read that you sometimes have near identical duplicates, like the example of yours “Invoice# 123456” and “Invoice# 0123456”, which can be filtered out very easily but then I assume also other where you really need some more novel approaches.

Though, breaking down complexity accelerates finding a solution. Would be great if you can provide some example data too making it less abstract.

Cheers
Mike

supersharp · October 31, 2019, 8:20pm

Hi @mw
Here is a sample, I am trying to find the highlighted examples. The fuzzy matching does reduce the dataset to a manageable size, but I need to classify out the real duplicates from the ones which are ‘close’ in terms of similarity:

mwiegand · October 31, 2019, 8:42pm

Thanks for providing some sample data. Since characters, which actually separate identical values, for the yellow highlighted examples can appear at the beginning or the end the approach might work.

Beforehand, I noticed my suggestion about the duplicate row filter, as a classification approach, would only work with your example if you have one column.

Is your data always appearing in two columns are is it actually just one column? By the way, you may apply a harmonization by using a string manipulation to convert word characters to lower case. In addition, if punctuations are not considered to be valid, they might be removed. Or in other words, everything that is not alphanumeric is removed and the remaining string is converted to lower case.

Depends on how your data is generated …

HansS · November 3, 2019, 6:21pm

Hi @supersharp

In addition to the data cleaining suggested by @mw, you can take a look at the String Similarity node.
gr. Hans

supersharp · November 11, 2019, 6:46pm

Thanks @mw and @HansS

Most of your suggestions have been implemented. I am now at the end of the string similarity node, but there are still close matches, and I think it could be as simple as ‘scanning’ the longer length invoice (Y) number to see if the shorter length invoice (X) number exists within it.

It could be just a rule engine of sorts, maybe using a Y “LIKE” “X” => … but may require some funky rule-engine dictionary or flow variable syntax. Any suggestions for how to do this efficiently? I have 100k+ records.

ipazin · November 12, 2019, 12:06pm

Hi there @supersharp,

Check this topic. There you will find another link on similar topic as yours.

Br,
Ivan

ipazin · November 13, 2019, 12:35pm

No it doesn’t.

To get attention you can mentioned workflow creator in your comment - @username. Like this @supersharp

Br,
Ivan

system · May 14, 2020, 12:41am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.