Different results w/ different K-Means nodes

Hi!
Any thoughts on why the K-Means node and the Weka Simple K-Means node provide such different results? Same data set, 500 max iterations, both nodes using Euclidean distance.

Cluster distribution w/ KNIME K-Means node:
image

Cluster distribution w/ Weka Simple K-Means node:

Thanks!

Hi,

It could come from the structure of your datas. How many columns/variables do you enter. Did you proceeded a PCA before the Kmeans. Tried to dectect an optimal number of groups before running ?

Best regards

Data is indexed then normalized (z-score) before running through K-Means. There are 33 columns/variables… The number of clusters must be 4 as it’s what we deem the field can currently execute reliably.

If you proceed a PCA after your normalization and run the kmeans on the PCA values is there always a difference ?

Sorry @Fabien_Couprie I’m not familiar with the term PCA?

Principal component analysis. With 33 columns potentially correlated, I recommend it. The distance in the Kmeans algorithm is euclidian distance, if your columns are correlated, you are not in cartesian coordinates. Even if Kmeans are robust to it, better to stay in the right way. I would have tested the optimum group number anyway to see if 4 is a good candidate according to the data structure. It can explain things as the algorithm cannot converge if the number of groups really don’t fit the data structure.

1 Like

Hi @Snowy

I am not sure about the K-means node in KNIME, but in general the K-means algorithm starts by randomly choosing a centroid value for each cluster. This may be an explanation for getting different results from both nodes.

gr. Hans

1 Like

Indeed, however over 500 iterations I would expect the cluster centers to align or be very close, which should produce a very similar distribution of cluster assignments.

1 Like

Hmmmm, not really, if there is no convergence, you obtain an alternance of different clustering repeatidly. That means there is no unique solution in that case. For example if your datas are completely spheric and uniformely distributed with the same distance between each point (I know this is rare but this is an example) you have 4 groups => where do you cut ?. You can also build a crosstab of your 2 solutions to identify strong and weak groups. By the way as the case is interesting can you send your datas or is there a confidentiality matter.

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.