Metrics Nodes for Cluster Analysis Missing ?

Hi,

I’m running Clustering K-means to identify personas in my data sets. It is completely unsupervised and don’t have labels of any sort. Each data set is about 250,000 observations.

I want to evaluate the clusters using some standard metrics or implement something like clValid Library which exists in R. Tried this but their implementation fails for big data sets with error: Error in hclust(Dist, method) : size cannot be NA nor exceed 65536

I did not see any nodes in KNIME which do Dunn, DBI, Silhouette etc. Came across this post which basically tells to implement your own. I don’t think it is as straight forward as it sounds looking into the source code of some of these functions.

Is there a quick and easy KNIME way to do this that I’m missing or I need to take the painful non-KNIME path to success ?

Thanks !

Mohammed Ayub

Hi Mohammed,

I have implemented the Silhouette algorithm within KNIME, though I have not made it easy to use and as I remember it accepts only PMML cluster models and only supports Euclidean distance. (Though this is a fast implementation, supposed to work with KNIME grids too.)
Cheers, gabor

Thanks Gabor. How do I take your github effort and import it in my KNIME work space ?
Not sure how much this will help me, I can try.

You can try the following update site: https://github.com/aborg0/com.mind_era.knime.silhouette/releases/tag/v0.0.0_20180709
Usage:

  • Unzip the zip file
  • Help | Install New Software…
  • Add…
  • Local
  • Select the location of the update site folder
  • Select KNIME Silhouette feature (from the only KNIME cluster measures category)
  • Next
  • Next
  • Accept the license if it is suitable for you
  • Finish
  • Accept that it is not signed (OK)
  • Restart KNIME

DO NOT use it in production, its further releases will not be compatible most probably

PS.: Feedback, PRs, and an icon for it are welcome :slight_smile:

On the topic of clustering… I want to evaluate the number of clusters setting by examining:

  • Adjusted Rand Index - How similar the object within a cluster are (Cluster stability)
  • Calinki-Harabasz Index (CH) - measures both the compactness and the distinctness of the clusters

To select the optimal number… but don’t know how to do this with KNIME nodes. Or is this built into the learner node and I missed it.

In the learner nodes I see k-means… can I do k-medians clustering (k-medians clustering - Wikipedia) and Neural Gas (Neural gas - Wikipedia)

1 Like

Are there other metrics (in KNIME) to use to optimize the number of clusters for the k-mean learner?

You could try the Silhouette Coefficient node and wrap your whole worklflow in a parameter optimization loop.
Then your variable is the number of clusters and you optimize the Silhouette Coefficient
(@Community by the way I have not seen the inertia. Is this also available in knime?)
br

1 Like