Distance Matrix for Hierarchical Clustering 100K docs (tweets)

gustavo.velho · June 22, 2017, 7:59pm

Hi there,

Need a little help/tip. I'm working with ~100K tweets as documents and created a Word Vector. I already reduced the volume of words used to about 1.5K, and I ran the Distance Matrix Calculate over 5K tweets as a test and it worked pretty well. But for the whole data set (100K tweets), it's very, VERY slow (I also don't have much memory, unfortunately). :)

Any tips on how to reduce the time taken for the distance matrix calculation?

Thanks!

Gustavo

kilian.thiel · June 29, 2017, 7:58pm

Hi Gustavo,

the Distance Matrix Calculate computes all pairwise distances and thus the runtime is quadratic. For 100k records 100k^2 / 2 operations need to be computed which is a lot. The node will nicely parallelize all operations meaning that multi cores speed up the runtime.

What exactly are you planning to do with the distances?

Cheers, Kilian

system · June 2, 2023, 9:46pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.