New feature request: Tanimoto coefficient for continuous variables (byte vectors)

gcincilla · January 21, 2020, 4:35pm

Dear KNIMErs,

The Bit Vector Distances node contains an option (i.e. “Tversky distance (Tanimoto/Dice)”) with which, in combination with the Similarity Search node, one can perform a similarity search based on the Tanimoto coefficient. This type of search is especially important in the cheminformatics context.

Nevertheless this option takes into account only the case where you have dichotomous variables (i.e. a bit vector). Another scenario where the possibility to calculate the Tanimoto distance is also very important is the case where you have continuous variables or count-based fingerprints (i.e. a byte vector). For this reason I think it would be very beneficial for the KNIME community to include the possibility of calculating the Tanimoto coefficient for byte vector. Would this be possible?

Here is the formula:

Here, S denotes similarities, xjA means the j-th feature of molecule A . a is the number of on bits in molecule A , b is number of on bits in molecule B , while c is the number of bits that are on in both molecules. On the left part of the figure there is the formula for continuous variables, while on the right part, the formula for dichotomous variables.

The formula has been defined, inter alia, in the following scientific publications:

Willett P. J. Chem. Inf. Comput. Sci. 1998, 38, 983-996
Bajusz D. et al. Journal of Cheminformatics (2015) 7:20

Thanks in advance for your answers!

daria.goldmann · January 24, 2020, 3:50pm

Hi @gcincilla,
Thank you for the suggestion, we’ll bring it to our development team.
In the meantime there are two things I could recommend. You could use the Python snippet and implement this piece of the rdkit code https://github.com/rdkit/rdkit/blob/master/Code/DataStructs/SparseIntVect.h#L511
Alternatively, you could compute Euclidean distances from byte vectors with the Byte Vector Distances node and then use these distances to generate the distance matrix for further clustering e.g…

Best,
Daria

gcincilla · January 27, 2020, 3:35pm

Hi Daria,
Thank you very much for your nice reply. It would be great if your development team could implement this in the standard node! I thinks KNIME community will benefit from it.
Thank you also for the Python snippet, I’ll try that. Instead in respect the usage of the Euclidean distance from byte vectors, this is what I’m currently using, nevertheless the properties of the Euclidean distance are different from the Tanimoto one and this is why I suggested this implementation.
Best,
Gio

system · July 28, 2020, 3:35am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.