So basically I have my data which has customer_code and their addresses. The problem is that one customer has multiple addresses and I want to count to number of distinct addresses that a customer has. I can’t do GroupBy customer_code and then do a unique count in manual aggregation for the addresses because the addresses are the same but framed differently.
So I did a Levenshtein similarity search but it’s taking one address and comparing the distance with all the other addresses. Whereas what I want is the similarity test should only be done on addresses of a single customer_id. If I have a customer_code 123456 which is listed 4 times with the same address which is just framed differently, I want the similarity search/test to happen within the customer_code 123456 and the first address of 123456 should be searched with all the other addresses of customer 123456 only.
As you say the address might be the same, but in slightly different format, then you can try splitting the address string into a collection (set) of words with Cell Splitter node. After that you can try applying GroupBy node to aggregate by the IDs counting unique addresses represented as a set.
Customers while filing in their addresses have sometimes added a street name or written their house number in the end.
I tried splitting the address and then running a similarity search but it hasn’t worked out for me unfortunately.
Though I’m glad for your help
@Vaish_navi I have a meta collection about string similarity and address deduplication you might want to take a look:
Here is an example matching strings without ground truth. Since you have a customer number that should help.
Then if you have additions to strings you might think about using only a part of each string (since the customer number might provide enough unique value):