I’m running Clustering K-means to identify personas in my data sets. It is completely unsupervised and don’t have labels of any sort. Each data set is about 250,000 observations.
I want to evaluate the clusters using some standard metrics or implement something like clValid Library which exists in R. Tried this but their implementation fails for big data sets with error: Error in hclust(Dist, method) : size cannot be NA nor exceed 65536
I did not see any nodes in KNIME which do Dunn, DBI, Silhouette etc. Came across this post which basically tells to implement your own. I don’t think it is as straight forward as it sounds looking into the source code of some of these functions.
Is there a quick and easy KNIME way to do this that I’m missing or I need to take the painful non-KNIME path to success ?
I have implemented the Silhouette algorithm within KNIME, though I have not made it easy to use and as I remember it accepts only PMML cluster models and only supports Euclidean distance. (Though this is a fast implementation, supposed to work with KNIME grids too.)
You could try the Silhouette Coefficient node and wrap your whole worklflow in a parameter optimization loop.
Then your variable is the number of clusters and you optimize the Silhouette Coefficient
(@Community by the way I have not seen the inertia. Is this also available in knime?)