Evaluation of Clustering results

Hi,

I want to compute this measures for evaluate the results of my clustering process:

- Standard Deviation from centroids;

- Sum of Squared Error (SSE);

- Silhouette Coefficient.

How can I do this in KNIME?

Thanks

1 Like

Hi,

what do you see as "Error"? is it the distance of individual data points to their respective centroid? I have prepared a workflow which should point you in the right direction how to solve the first two problems. The silhouette coefficient, I think, is a bit more difficult to do as a workflow.

Regards,

Alexander

 

 

Dear Alexander:

I just downloaded your workflow. Randomly I created a database with 5 cluster to test how your workflow works. I made some minor changes to it. The first thing I did was include a loop to make k Means different groupings, so to try what would be optimal. Everything worked perfect, but I can not really determine whether 5 is the optimal number of clusters for the database I created.

On the other hand, you may have made a mistake.


Can you help me understand or read the information in the final table? If I understand correctly, the optimal number of cluster should be 9 and not 5 as I would have expected.

I enclose my workflow.

Best regards.

Gabriel Cornejo

CHILE

Hello Gabriel,

determining the number of clusters accurately can be pretty tricky. If you would continue to increase the number of clusters further, you would still see a decrease in the standard deviation. Just imagine you have one cluster for each data point. Then the distance of each point to its centroid would be zero and so the standard deviation and error would also be zero. There are many ways of estimating a good number for k and some are explained here: https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set.

Your approach of calculating multiple clusterings in a loop is a good start and I would suggest you use the "Elbow Method" as described on Wikipedia. So instead of calculating the standard deviation and sum of squared errors, you need to calculate the between-group variance, which is basically just the variance of the centroids, and divide it by the total variance. Then you plot this for multiple numbers of clusters and see if you can find the "Elbow".

Regards,

Alexander

1 Like

Hi There, using Gabriel's loop method and randomly generated data, I used the Anova node (to get within and between group variances with respective degrees of freedom automatically from any cluster input variables), and plotted values for 1) the F-test statistic, 2) proportion of variance explained (elbow method?) and 3) within group variances for different values of k. Please note I have added the cluster Sum of Squares together (hopefully this is ok?) and this will only work if all values have had z-transformations.

Can you let me know if this correct? If so, the clusters should be optimal for a given value of k where the f-test value is maximised (at its highest).   

In the following book (Practical Data Science with R), the process of clustering is nicely explained in chapter 8 (free sample full download of this chapter) - btw the book is really well written and should be considered a decent purchase even if the code suggested is in R.

https://www.manning.com/books/practical-data-science-with-r

The authors suggest not to only rely on the aforementioned elbow method (using within sum of squares) but also to calculate the Calinski-Harabasz index. Then choose k where both indicators intersect. To check for cluster robustness or meaning beyond groupby, the cluster bootstrap is proposed, which can be easily implemented in KNIME using the Loop nodes.

Would you be able to implement it with the data file we have provided? One thing to read in a text book but another to implement correctly. Many thanks. 

I know it is definitely possible to implement those suggestions in the book using the KNIME nodes, i.e. in particular Math, Groupby and Loops nodes (for different values of k and for the cluster bootstrap). Unfortunately, it will take you quite a few hours to build the workflow. Feel free to post any precise questions here.

P.S.: the most important thing with clustering is not the technical process, it's the objective of your analysis and the domain knowledge that you bring along. Consider this example: http://adn.biol.umontreal.ca/~numericalecology/data/scotch.html

Hi, can I ask some help. with my undergrad thesis? I have these electric load data for the month of April. And I am using Knime's Kmeans node, to cluster the data to determine its base, mid, and peak load. How can I check the validity of my cluster? An image of my workflow is attached.

Hi,

You can use this component to calculate the distance of every point to its centroid. From there, you can aggregate and calculate the measures you are after:

Hope it helps,
Andrea

2 Likes