k-means performance issues

Hi,

I am using the k-means node for clustering text documents and have severe performance issues.

The input matrix for the k-means contains TF-IDF values and consists of 1977 documents with a vector size of 15588. 

My 64bit ubuntu system has 8GB RAM and an Intel Core i5 M 460 @ 2.53GHz x 4.

After 24h of calculation the node is still not finished. Where could be a problem? In the input node I checked "keep all in memory" for the memory policy, still no difference.

 

Thanks in advance!

Tim

Hi Tim,

the runtime of k means is t*k*n*d. Where t is the number of iterations, k the number of clusters and n the number of data points.And d the number of dimensions, so this will take quite some time.

As you are clustering text I would recommend decreasing the vector size beforehand. Do you have word vectors? Than remove all terms occuring in single documents only. You can filter based on POS as Nouns, Adjectives and Verbs has the highest amount of interest.

More generally, the Euclidean Distance suffers from the curse of dimensionality in these high dimensions. If you want to know more about, I recently read this pretty nice article about it (Section 6) http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

Best, Iris

Hi Iris,

thanks a lot for your input! It makes the problem much clearer to me.

Greetings

Tim