KNIME was GREAT but HELP is NEEDED

Hi All, 

First, I have to design a similarity search protocol that can compare between a binary fingerprint (reference) with a dataset of binary fingerprints (target). This comparison is measured based on Eulidean distance coefficient = (square root of(a+b-2c)). a = number of bit set to 1 in reference fingerprint, b = number of bit set to 1 in target reference, c = number of bit set to 1 (of the same bit position on both reference and target fingerprints). 

Both reference and target molecules are in binary format and have the same length of fingerprints (e.g. 2048 bits of 1s and 0s). These fingerprints were generated using RDKit Fingerprint Node.

I have tried several ways below but failed:

1) Processing using BitSet method in Java Snippet -- Although the fingerprints are "shown" to be the types of bit vector, I was disappointed to figure out there was nothing that I can do to process the "bit vector" fingerprints. The fingerprints were recognized as String value in Java Snippet.

2) Having the fingerprint as String, so I did some conversion and implemented arrays to hold each bit of the fingerprints. The reference array were then compared with each target array based on Euclidean measurement. Looks good..however, to my suprised again, this takes ages (more than 24 hours) to be processed. FYI, i have a massive amount of data in my dataset so I guess this might be the reason why it was very slow.

3) Then I tried to use combinations of nodes, e.g. Expand Bit Vector > Column Aggregator and be able to get both a and value. However, I still can't figure out on how to get my value (as defined above). I have used Joiner, Math Formula..but all failed. Until I decided to post this topic. 

I am really glad and thank KNIME community for developing collections of useful nodes and functions. I am new to KNIME but have already enjoyed my time desiging and coding using it. I appreciate the arts of designing protocols in KNIME, and understand that it could be done in various ways. Thus, I really hope to get some extra ideas, which I couldn't think of due to my limited experience. 

Thank you.

LM   

 

Good question. I guess one node in your workflow would need to be the similarity search node. Regarding the distance: For fingerprints KNIME offers a parameterizable tversy distance (dice, tanimoto). In theory you could also cast your bit vector into a string (as you've done already) and then use a "java distance" node but that's probably tricky with the 'c' value again (I tried but gave up after 20min).

Attached is a workflow that demonstrates what you can do with the standard nodes (similarity search using tanimoto).

Hope that helps (a bit).

Hi Wiswedel, thank you so much for your reply.

Yes I did use Similarity Search node in my other protocol to measure the binary fingerprints using Tanimoto and Cosine coefficients. But this problem arose when I tried to design protocol for binary fingerprint using Euclidean coefficient. Thanks for your attachment and I have looked through it.

If I may ask for one more idea that could probably solve my problem. I tried to expand the binary fingerprints using Expand Bit Vector node. This gave me for example, 2048 columns of 1s and 0s. Cool enough as I can easily calculate the value of and using Column Aggregator node. 

My question is, does anyone know how can I compare each of the column of the reference and target rows one by one? I tried using two Column List Loop Start for both reference and target rows but failed. 

Thanks in advance.