How to determine the optimal number of clusters?

Hi,

Regarding to my last topic, Now I wonder if there is a method in KNIME to determine the best number of clusters.

I think there could be a node with the option to choose from the most known methods like Elbow, Silhouette and gap to specify the best number of clusters.

Best,
Armin

1 Like

Hi @armingrudd,

Did you take a look at this post Determine the right number of clusters?
It links to one of our examples (https://www.knime.com/nodeguide/control-structures/loops/loop-over-a-set-of-parameter-for-k-means). This is a pretty straight forward way to it by simply trying out different number of clusters and comparing their entropy scores.

Cheers,
Simon

Thanks for the reply Simon,

I had checked this workflow before.
The Entropy Scorer node needs a reference column. In the example, it is clustering on Iris dataset which has a class column.

How can I use this method to find optimal number of clusters for my dataset which has no class column? I want to cluster the data and use the clusters as class for classification.

Best,
Armin

1 Like

It is tricky to find the optimal number of clusters and depends on many aspects. There is no node in KNIME which performs one of the methods you have mentioned above, but you could build a workflow for it. You may take a look at this post where a user had a similar question:

Cheers,
Simon

Thanks again Simon.

I had read that topic as well and didn’t get what I needed.
I think I’m gonna use some R code for now. But really suggest a node for this. Why not?

Best,
Armin

1 Like

Armingrudd you may be interested in recent post on the topic

2 Likes

Yes we already have such a node on out list for a future release, it would be a really nice new node. However, I cannot make promises in which release we will manage to add this node.
The topic @izaychik63 linked to contains a link to a extension which seems to do what you are searching for. I don’t know the extension and it is not a trusted KNIME Community extension, so you are free to use it of course, but I cannot ensure any correct functionality. Thanks for the link, though, @izaychik63.
If the extension does not provide what you are searching for, using R might be the easiest solution for now.

Cheers,
Simon

2 Likes

and the example workflow has a bug unfortunately. The decisive variable is not connected so it would always come back with the optimal cluster number 3.

https://nodepit.com/workflow/public-server.knime.com%3A80%2F_Old%20Examples%20(2015%20and%20before)%2F011_FlowVarsAndLoops%2F011003_loopParametersKMeans

image

You are linking to an old example. The current version is working for me: https://www.knime.com/nodeguide/control-structures/loops/loop-over-a-set-of-parameter-for-k-means

1 Like

The workflow in examples server doesn’t have this bug. The number of clusters in K-Means node is read from the variable.
Thank you everybody. I hope we’ll have the new node for determining the optimal number of clusters in KNIME soon.

3 Likes

Chapter 8 Clustering is downloadable free of charge at this address: https://www.manning.com/books/practical-data-science-with-r

Very useful read and can be easily applied in KNIME, even without R.

+1 to the suggestion.

1 Like

Hi @badger101,

Actually the Silhouette Coefficient node is available since KNIME 4.1.

:blush:

5 Likes

Thank you Armin, will check it out!

1 Like