cluster analysis on large data sets

ryanmays · February 21, 2012, 10:44am

hey there!

i am currently working on a university project regarding the clustering of materials. i was already working with knime clustering my test data and it worked quite good. but since i got a real data set with about 16000 elements it can´t perform hierarchical methods due to the reached heap scape maximum. i tried to manually change the heap scape but it only worked up to 2048mb.

so isn´t there any possibility to increase the heap scape and run the large data set analysis?

if yes, do you know whether its possible to run a scree test on dendrograms in knime? i couldnt find any suitable nodes.

one last question on kmeans clustering. am i right, it isn´t possible to manually set the initial cluster centres?

big thanks for your support!

best regards

ryan

thor · February 21, 2012, 12:22pm

If you have a 64bit system you should be able to increase the heap space much higher. On 32bit systems about 1.5GB is the maximum. What prevented you from increasing it above 2GB?

ryanmays · February 21, 2012, 2:25pm

thanks for your answer!

just increased the heap scape up to the maximum (4gb) but still suffer the same performance problems.

is there any possibility to boost the performance otherwise?

ryan

thor · February 22, 2012, 9:52am

Hierarchical clustering itself is not really suited for datasets of this size. Not only because of memory but also because of time reasons since the complexity is n³. It took me more than a day to cluster about 10.000 elements some time ago.