Similarity search within customer_code

So basically I have my data which has customer_code and their addresses. The problem is that one customer has multiple addresses and I want to count to number of distinct addresses that a customer has. I can’t do GroupBy customer_code and then do a unique count in manual aggregation for the addresses because the addresses are the same but framed differently.

So I did a Levenshtein similarity search but it’s taking one address and comparing the distance with all the other addresses. Whereas what I want is the similarity test should only be done on addresses of a single customer_id. If I have a customer_code 123456 which is listed 4 times with the same address which is just framed differently, I want the similarity search/test to happen within the customer_code 123456 and the first address of 123456 should be searched with all the other addresses of customer 123456 only.

Is there a way to achieve this?

Hi @Vaish_navi welcome to KNIME Forum,

Use A GroupLoop for your similarity search procedure. Where customer_id defines the groups.

Gr. Hans

Hello @Vaish_navi

As you say the address might be the same, but in slightly different format, then you can try splitting the address string into a collection (set) of words with Cell Splitter node. After that you can try applying GroupBy node to aggregate by the IDs counting unique addresses represented as a set.


Doesn’t seem to produce the results, I’m looking for. Thanks for your help :slight_smile:

1 Like

Customers while filing in their addresses have sometimes added a street name or written their house number in the end.
I tried splitting the address and then running a similarity search but it hasn’t worked out for me unfortunately.
Though I’m glad for your help :sunny:

@Vaish_navi I have a meta collection about string similarity and address deduplication you might want to take a look:

Here is an example matching strings without ground truth. Since you have a customer number that should help.

Then if you have additions to strings you might think about using only a part of each string (since the customer number might provide enough unique value):

1 Like