Hierachical Clustering, more weight to start of String

Mapijs · April 23, 2024, 11:35am

I’m working with product data, and I want to cluster data based on article codes.
I"m using string distances, hierarchical clustering (distMatrix) and hierachical cluster assigner.

The start of a string is more important than the end.
Sometimes the beginning consists of alphabetical characters or numbers like “920” or “1.14”, and the length of the important part might differ between products.

in my example below, I want “BZE” to cluster instead of the “080016” part.

name | cluster group
BZE120200 | 1
BZE080016 | 2
BGD080018 | 2
BRF080316 | 2
TGM080016 | 2

So how can I assign more weight to the first characters of a string and less to every character after (descending in weight for each sequential character)?

izaychik63 · April 24, 2024, 1:31pm

Jaro-Winkler distance makes priority for the beginning of the string. Actually 1/3 from the beginning.
Also, see example below

If you add first symbol as a separate column, you can use Aggregated Distance node to set weight.

tomljh · April 25, 2024, 1:01am

Hello,

Would it be easier to split the current column into two columns. For example: BZE120200, split into BZE and 120200.

system · May 2, 2024, 1:01am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.