Similarity Search -> Unique output only?


I am trying to use the Similarity Search node to return a unique list of neighbours; however by default, the returned output list provides the nearest neighbour irrespective of whether this has been selected as a neighbour for another search item (molecule).

I appreciate there is a “Neighbor Count” option, but I’m not sure of an easy way to use this to ensure that all neighbours are unique.

Any help is much appreciated,

Hi @srussell

A possible way of answering this question is as follows:

Lets say that one has two list of samples, everyone with its one ID. Lets call the first ones the reference samples and the second ones the target samples.

  1. In the Similarity Search node, we choose as options to get a single NN. I’m assuming here that we are calculating an Euclidean Distance.
  1. Then, we sort the results using the Row Sorter node, based first on target sample ID and secondly based on distance from nearest to farest distance.
  1. From here, we do a groupby using the Group By- node based on target sample ID and we chose to aggregate by ID and by distance with First as aggregation operation.

At this point one should have the unique ID list of the target samples with their nearest reference samples.

If one needs to have them sorted eventually from the nearest to the farest, one just needs to sort them by distance using an extra Sorter node.

Hope this quick answer helps. Otherwise, please let us know and I’ll be happy to share a small workflow with the solution.



PS: If you need to calculate the NN distance between the same list of samples (intra-distance excluding self-distance) instead of between two different sets of samples (Inter-distance) then the solution is slightly different but easy too :wink:


Hi Ael,

Thanks for the response, this is near identical to how I was attempting it - however, when doing the “GroupBy” for the target samples, the “First” as aggregation operation can select the same (non-unique) reference samples as others in the search list - so I have the same problem again. The GroupBy seems to be not enough by itself to ensure that the output only contains a unique list of reference samples, as well as a unique list of similar target samples.

I wondered if there would be some way to select the “most similar” neighbours, then create another pass (via a loop?) where the reference samples have been removed, so that they cannot be chosen again - however this might need a lot of loop cycles and I’m not sure how that could be automated.


Hi @srussell

I understand the problem and I was expecting the question :wink:

There are two possible ways of looking for a unique list of NN neighbours and this depends on which one of the two lists is considered as the “reference” and which one is the “target”. Depending of this, the results can be different. The list that is chosen as reference (1st input port of Similarity Node) is the one that always appears as complete in the results. To achieve what you want, you would need to sawp the two lists in the similarity node, which would give you as “Reference” the second list and as “Target” the first list. Is that what you are looking for ?

Hope this helps.




Thanks very much Ael - switching the inputs has fixed the issue I was having!


Hi @srussell

I’m glad it worked :smile: :+1: !

Enjoy KNIME & best wishes,


This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.