Dear KNIMers,
I’m doing Hierarchical Clustering, and would like to know if there is any plan to implement the UPGMA (unweighted average) method. Furthermore, I was puzzled by the number of output clusters that is requested to be defined… I am using another implementation (IO CSV Writer …) and while the one in KNIME still runs and runs, the other has already finished
More in general, are there any plans to extend the hierarchical clustering features in Knime “natively” (of course one can always use R :->).
I would recommend the New Hierarchical Clustering node in KNIME Lab’s distanca matrix feature. It has average linkage method. I might also recommend the HiTS experimental features (providing optimal leaf ordering, change to the opposite order, heatmap with dengrogram, …), but I have to admit I am the developer of those node, so not an independent suggestion… (And I should create more documentation.)
Bests, gabor
I never heard of UPGMA, can you give us a link to it? If it’s not too complicated, we may surely add it.
The definition of output clusters is “only” for assigning cluster numbers to the rows in the end, it does not have an effect on the actual clustering i.e. the dendrogram that is generated.
Concerning the speed, have you tried the “Cache distances” option? Or, like gabor suggested, try out the new hierarchical clustering that works with distance matrices from http://labs.knime.org/.
UPGMA stands for Unweighted Pair-Group Method with Arithmetic mean.
Assume that there are three clusters called C1, C2 and C3 including n1, n2 and n3 number of
records. Clusters C2 and C3 are aggregated to form a new single cluster called C4.
The similarity between cluster C1 and the new cluster C4 in the example above is calculated as
sim (c1,c4)=asim(c1,c2)+bsim(c1,c3)
where
sim = the similarity between the two indexed clusters
a= n2 /( n2 + n3)
b= n3 /( n2 + n3)