Clustering on String Variables (For Example: Address)

vihar · January 25, 2017, 10:34am

Hi, I am kind of new to KNIME. So, pardon me if this question had been asked already.

Basically, I am working on deduplication exercise where a similarity score is being calculated between every two records containing addresses in the table (I am not taking nearest neighbour as 1 as I want the scores of each possible pair). But I have close to 8 lac records in my dataset, and KNIME is taking way too long for this similarity search.

I thought of an alternative to first cluster the records based on the addresses and then apply similary search for calculating similarity scores against each record in each cluster. When I tried running K-Means, it didn't pick up the address but some other numeric field.

Could anyone please suggest on how can I cluster the addresses in KNIME?

Thanks in advance

Vihar

Geo · January 25, 2017, 9:17pm

For a start, use the String Distance node. Wikipedia provides a decent overview of string distance metrics. Combine it with kmeans if needed, though I'm not sure it makes sense given your use case.

Regarding performance, you should consider the complexity of what you are actually doing. As already recommended in another thread of your's, divide your data e.g. by location to add context. If location is not readily available, use the String Manipulation (regex functions) or Column Splitter node to extract location and other contextual data (e.g. zip code, country code ...). If the addresses are not in a clean format, tidying up will be the first priority.