I encountered an error while I was trying to optimise the number of clusters provided by the Hierarchical Clustering (DistMatrix) node using the Silhouette Coefficient and a Parameter Optimization Loop.
Errors loading flow variables into node : Distance threshold must be between 0 and 1
As it can be seen in the attached example workflow, it seems that the Hierarchical Cluster assigner cannot be set to assign clusters based on a distance threshold > 1. Nevertheless in this case I’m using a cosine distance that, as reported also in the Numeric Distance node description, can take range [0,2].
Please, can anybody say if this is a bug or I am missing something?
After going through your workflow, I understand now why it is not working. In fact there is a bit of incompatibility in the way the different nodes you are using have been implemented.
Usually, an angular cosine distance is bounded between [-1 & 1]. However, the -Numeric Distance- node has decided to convert it into a [0 to 2] range, as described in the node help:
The Cosine distance is a measure of orientation and not magnitude: two vectors with the same orientation have a Cosine similarity of 1, two vectors at 90° have a similarity of 0.Two vectors diametrically opposed have a similarity of -1. This implementation of a distance is 1 - cosine-similarity and can take the domain [0, 2].
Moreover, the -Hierarchical Cluster Assigner- node does not follow the same rule and expects distances within [0 to 1] range, as usual. This is explain in the node help:
Assigns clusters to rows based on an hierarchical clustering. You may either select a fixed number of cluster or enter a distance threshold. If the latter is used then all clusters in the dendrogram will be used that have the smallest distance to the threshold but are below it. The threshold is given as a normalized distance between 0 and 1. All distances are normalized based on the maximum distance.
Note that the assigner only assigns the same data that has been used for clustering to clusters. It is not capable of assigning unseen data.
This is why it doesn’t work. The solution is to renormalize the distance bounds in your optimization loop between [0 & 1 ] so that the two nodes become compatible.
Hi @aworker,
Thanks for your answer. I report here the description for cosine distance present in the Numeric Distances node:
Cosine Distance
Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. This distance function is 1 - cosine-similarity and can take range [0,2].
I may be wrong but I think the [0,2] range is because two proportional vectors have a cosine similarity of 1, two orthogonal vectors have a similarity of 0, and two opposite vectors have a similarity of -1.
In any case in the example workflow I upload you will see that roughly half of the distances for a set of randomly generated points (generated random variables) > 1.
I hope somebody else can comment on this to clarify it.
@aworker, thank you for your suggestion.
A couple of points:
I agree with you. It would be good to have the opinion of some KNIME developer on this to understand if there is really something strange or we are missing something else.
How should I renormalise the distance bound? I pass the following 2 input to the Hierarchical Cluster Assigner node:
Input matrix: does not contain distances, only the points and their coordinates.
Cluster tree port: I don’t know how to modify this. Is it possible?
Please, can you be more explicit when you say “The solution is to renormalize the distance bounds”?
Thank you very much in advance.
Best,
Gio
@aworker, thank you very much for you help and suggestion.
Unfortunately still I’m not convinced that this would be a good idea. My doubt: does it make sense to use an artificially halved max distance to avoid a distance threshold > 1 coupling to Hierarchical Cluster Assigner node (and preventing it to fail) when in reality the distance matrix contains roughly half of the distances with a value > 1?
Unfortunately still I’m not convinced that this would be a good idea. My doubt: does it make sense to use an artificially halved max distance to avoid a distance threshold > 1 coupling to Hierarchical Cluster Assigner node (and preventing it to fail) when in reality the distance matrix contains roughly half of the distances with a value > 1?
Yes, mathematically speaking, it is absolutely fine to normalize by any linear norm a distance metric. In general, one can normalize the values of an angular distance metric provided that it doesn’t affect the order of the distances which it is the case here.
OK @aworker, but if I “normalise” the maximum distance by halving it in 2, shouldn’t I normalise in the same way also all the pairwise distances present in the distance matrix? Otherwise the comparison won’t make sense. Unfortunately I’m afraid it’s not possible to manipulate the distance matrix calculated by the Hierarchical Clustering node. Am I wrong?
You did well insisting. You are right and I see know why this didn’t make sense to you. I was biased by the math theory but the way this has been implemented in KNIME goes beyond the maths logic lol
Renormalizing is not enough given that the distance matrix cannot be renormalized. This makes incompatible the -Numeric Distance- node and the -Hierarchical Cluster Assigner (local)- nodes.
Since renormalization cannot be applied to the distance matrix too before using it, I searched for other solutions and fortunately, KNIME has a second node called -Hierarchical Cluster Assigner- (not the local one) which does not impose a maximum threshold when using the Cosine Distance Matrix (bound within the [0…2] range by the -Numeric Distance- node). So this -Hierarchical Cluster Assigner- node should solve the problem.
I have modified your workflow to integrate it. Please have a look at it and let me know if this is fine with you
The resulting optimal threshold is 0.36 and this is what is displayed in the output window of the new -Hierarchical Cluster Assigner- node as shown here below:
@aworker, thanks for your patience and efforts! I appreciate it.
It’s great that you found a solution but unfortunately I don’t have that node in my installation (I would definitely have tried it if I had seen it ). Probably it’s just a matter of extensions version. I’m running the last version (KNIME 4.6.1). Can I ask you where did you find the Hierarchical Cluster Assigner (not the local one ) node?
Thanks in advance!
Thank you very much @aworker!
I could install the JavaScript Views (Labs) extension, found the node and replace it on my workflow. Now it seems it works as due!
I hope this thread can be useful also to other people facing the same problem with the Hierarchical Cluster Assigner (local) node.
Have a great day!