Distance Matrix for Hierarchical Clustering 100K docs (tweets)

Hi there,

Need a little help/tip. I'm working with ~100K tweets as documents and created a Word Vector. I already reduced the volume of words used to about 1.5K, and I ran the Distance Matrix Calculate  over 5K tweets as a test and it worked pretty well. But for the whole data set (100K tweets), it's very, VERY slow (I also don't have much memory, unfortunately). :)

Any tips on how to reduce the time taken for the distance matrix calculation?

Thanks!

Gustavo

Hi Gustavo,

the Distance Matrix Calculate computes all pairwise distances and thus the runtime is quadratic. For 100k records 100k^2 / 2 operations need to be computed which is a lot. The node will nicely parallelize all operations meaning that multi cores speed up the runtime.

What exactly are you planning to do with the distances?

Cheers, Kilian