50 million rows by 50 attributes with k-means clustering

mgriffiths · June 4, 2015, 10:57am

Dear List,

I am new to KNIME, so please forgive my question if it has been asked before. I have a data set of approx. 50 million rows with up to 50 attributes and I would like to perform a simple k-means clustering algorithm to the data. I have two questions really:

1) is KNIME accessible to multi-threading and parallelisation?

2) what would be the run-time with and without parellelisation for k-means with this size data set?

Many thanks for your time and help

gabriel · July 9, 2015, 3:02pm

Thanks for your questions:

1) Yes, most nodes make use of multiple cores and run their execution multithreaded -- if possible. You might also want to check out the Parallel Chunk Loops to parallelize your workflow execution.

2) Yes, it would work but takes some time, probably between ~10-15min per iteration on 50mio rows on standard hardware. You can easily test this out by using the Data Generator, create 50 columns/universes and 50mio rows and run the k-Means on it.