I am trying to apply clustering on geo data coordinates that represent addresses using the k-medoids node. The problem is that if use a .csv data with more than 20k rows then the execution of that node seems blocked to 0%... also after several hours.
I have already tried to reduce the volume of the data but the nodes execution is successfully completed only if put as input something like 100-1000 rows. If i try to put as input a huge number of rows then the execution remains to zero. I have already tried to setting up partition number or chunk size but without results..
Any suggestions? I don't think that if I have 1mln of row as data as addresses I cannot apply the cluster for that reason. It should be senseless I supposed because in real life the data volume is always really high. So if these nodes work only with very low volume of data are quite unnecessary.
Maybe you should also post this question to the Palladian Forum: https://www.knime.com/forum/palladian-selenium
This might be related to the Geo distances node, so it is better to ask them directly.
Computing distances of 20k (^2) rows is obviously more expensive than 1000^2. That's just complexity theory in practice :)
As you already stated, the nodes work with reasonable amounts of data, and they would probably also work with larger amounts of data, given that you simply wait long enough for performing 20k^2 calculations (and your machine has enough resources). Calling our nodes "quite unnecessary" sounds a bit harsh to me -- but never mind.
To give a constructive input: There's a computationally "cheaper" calculation of the Haversine distance, which works with less trigonometric functions (at the cost of being less exact, which shouldn't matter for clustering though).
Thank you qqilihq for your reply,
probably the words "quite unnecessary" were too exaggerated, but since several days I didn't find a concrete solution I assumed that they weren't...
considering that I want to process something like 1mln of row coordinates, what do you suggest me to do exactly?
Since the k-medoid node cannot handles in my opinion this huge volume of data. I have also tried other clustering nodes instead of the k-medoid, for example the DBSCAN node. But the waiting time becomes really unsustainable. And I don't talk about wait 30/40hours of waiting time... I think that we talk about weeks or months as waiting time.
Regarding the cheaper calculation of Haversine distance where I find it?
apologies, I thought we also had the "simplified" Haversine implemented in the Geo Distances node, but in fact we currently don't. Adding that wouldn't be a big thing in general as we have it in the Palladian lib., but I currently have no free temporal resources.
You could however implement it yourself using the "Java Distance" node and some snippet of code. Google will give you ready-to-use code snippets for "simplified Haversine" calculation. However, don't expect that this will radically speed up your calculation by orders of magnitude, rather by a factor of maybe 5-10x (rough estimate).
Maybe pre-calculating the distance matrix once and inputting it to the clustering nodes helps? I'm no real expert with the KNIME clustering nodes though.