How to use 'Similarity search' ?


I use the 'Similarity Search' node to look for dublicate bitvectors.

I use a column with fingerprints as input and the same column as Representative column.

When I look at the output table the column neares neighbor index should show a increasing index from 0 or 1 to N=the number of molecules/bit-vectors, becaus the Nth molecule is identical to the Nth molecule. However it shows altenatingly 0 and 1.

using Tanimoto distance function

coefficient type similarity

neighbor selection nearest

I tried different options here, but not with the desired result.

Is this a bug or do I use the node in wrong way?


And what is 'Neighbor Count' doing?

I typically use the GroupBy node to resolve duplicates.  It will be much more performant that using similarity search and allow you to resolve the duplicates in any of a number of ways using the available aggregation options. 

Regarding the similarity search node, the Neighbor Count option will limit the number of returned neighbors.  For example, if this is set to 2, and 3 neighbors are found within the given range, only the two closest will be returned. 


Ok, that works. But I am still confused about the IDs that the node returns. The Nth compound should be listed for the Nth compounds, since their are identical...