Is there a limit on the amount of data the k-means node can handle

Hi,

I want to use the k-means node to make a cluster analysis on large amounts of data (millions of rows and maximum 20 columns).

I’m wondering if anyone knows whether there is a limit on how much data the k-means node can handle?

Also what would you recommend in relation to hardware (RAM, CPU, SSD etc.).?

Hi,
I haven’t heard about a limit but you can just play with your dataset. Just use the “row sampler” node and test.

1 Like

Hi @zrd301 ,

I understand that the limit is based from your resources (cpu, ram…).

Normally, I saw that the amount of ram memory should be about 2gb or more if possible, but not more than 50% that you have. It sounds like 25% at all.

For cpu, depends of the number of nucleus that you have. A CPU with 12 nucleus can use it at all but look with memory head hoop from knime if it consumes more that you can expect and set memory and brands as you can, not more that it… If it is a dedicated server/pc, pay attention with it to not break the system and performance. Sometimes you can set from the node to use memory cache OR diskspace (temporary files). With big files and process, I normally change memory by disk cache/temp to make the hard work and not make issues and bad results.

Could me make it clear? I hope so… Anything else, reply here ok?

Thanks, Denis

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.