Address Deduplication

Hub · June 29, 2019, 10:05am

The workflow shows the power of the new distance measurement framework - a high prediction correctness of possible matches is achieved with a minimum number of nodes and without any preprocessing by just aggregating some distances on different attributes. The chosen data set is the "Restaurant data set" from http://www.cs.utexas.edu/users/ml/riddle/data.html comprising 864 restaurant records and 112 duplicates. Each record contains a name, an address, a city, a type and finally a class attribute. Records with an identical value in the class attribute point to the same real-word entity or restaurant in our case.

This is a companion discussion topic for the original entry at https://kni.me/w/QiS--QnukXBeL3mZ

supersharp · October 25, 2019, 12:38pm

This is a great workflow for fuzzy matching.

I am performing a similar exercise with invoice numbers, a bit more challenging than addresses due to the simple fact that invoices can be different by one character, and that makes the fuzzy matching pick up totally different entities. Is there a way to ensure that fuzzy matching can be conducted while maintaining the order sequence of the words? e.g. Invoice# 123456 and Invoice# 654321 would be picked up with a high similarity in a regular fuzzy match, but due to the order being different, they really are not similar. However, Invoice# 0123456 should be very similar (maybe the same). I want to identify these cases. Any tips are appreciated!