Parameter Iteration

luismatosobr · May 29, 2011, 2:13am

Hi everyone,

I want find the optimal k for a k-means algorithm. I already using the expert mode, but i havent been able to iterate over the parameters. How could i possible do this? Basically, what i really need is a workflow that:

1 - Iterate over values for k, until the clusters estabilize.

I think i should use the "Variable loop(data)" and the Entropy Scorer node, but i dont know how to use them. Could someone help please?

Thanks very much

AI · May 30, 2011, 12:47am

ill help u. just no time right now as I did this often... come back in a few days. I'll post it here.

luismatosobr · May 30, 2011, 8:09am

Ok Al, thanks very much for you help.

wiswedel · May 30, 2011, 12:50pm

The public workflow server contains and example workflow. It's under '011_FlowVarsAndLoops/011003_loopParametersKMeans'.

I hope it helps (and saves time for "AI").

luismatosobr · May 30, 2011, 6:33pm

Hello! Thanks for your help..

I think its very similar of what i want, but the main difference is: in the example, the dataset used

has the class of the instances. In my case, i don't know the class. My goal is to find the K that the clustering start to estabilize.. As far as i know, its possible by measuring the medium distance between clusters, and when K starts to estabilize, this distance estabilize too.. How could i do that?

Thanks very much

AI · May 31, 2011, 10:29am

Hi! I programmed my own node for that.

THe k-means clustering in knime doesn't provide an error measure.

You could calculate the sum of all eucl. distances between the cluster centrers and the instances. The error gets smaller for each cluster you add. So to find a good choice you can plot the error. You'll find the "best" amount of clusters (k) in the elbow of that plot.

For example I did reexecute the kmeans for different ks and calculate the error (therefore you need to change the nodes implementation a bit).

Workflow:

NodeStartLoop->kmeans->NodeEndLoop->ErrorNode(chooses best k)->kmeans(again with best k)-> go on....

The information quality used in an example does not suit for u here, since they have a so called reference cluster that u don't have. They calculate how good the new cluster represents the old "reference" classes.

You need to find an error meassure for your clustering itself. You also could calculate the so called "silhouette coefficent" which calculates the homogenity of the clusters and heterogenity between the clusters.

PS: there is a node in the weka plugin called xmeans which automatically finds a good cluster count (using the first error meassure I described). Therefor you have to set a minimum and a maximum cluster size and the node calculates the elbow itself.