Hi All,
First, I have to design a similarity search protocol that can compare between a binary fingerprint (reference) with a dataset of binary fingerprints (target). This comparison is measured based on Eulidean distance coefficient = (square root of(a+b-2c)). a = number of bit set to 1 in reference fingerprint, b = number of bit set to 1 in target reference, c = number of bit set to 1 (of the same bit position on both reference and target fingerprints).
Both reference and target molecules are in binary format and have the same length of fingerprints (e.g. 2048 bits of 1s and 0s). These fingerprints were generated using RDKit Fingerprint Node.
I have tried several ways below but failed:
1) Processing using BitSet method in Java Snippet -- Although the fingerprints are "shown" to be the types of bit vector, I was disappointed to figure out there was nothing that I can do to process the "bit vector" fingerprints. The fingerprints were recognized as String value in Java Snippet.
2) Having the fingerprint as String, so I did some conversion and implemented arrays to hold each bit of the fingerprints. The reference array were then compared with each target array based on Euclidean measurement. Looks good..however, to my suprised again, this takes ages (more than 24 hours) to be processed. FYI, i have a massive amount of data in my dataset so I guess this might be the reason why it was very slow.
3) Then I tried to use combinations of nodes, e.g. Expand Bit Vector > Column Aggregator and be able to get both a and b value. However, I still can't figure out on how to get my c value (as defined above). I have used Joiner, Math Formula..but all failed. Until I decided to post this topic.
I am really glad and thank KNIME community for developing collections of useful nodes and functions. I am new to KNIME but have already enjoyed my time desiging and coding using it. I appreciate the arts of designing protocols in KNIME, and understand that it could be done in various ways. Thus, I really hope to get some extra ideas, which I couldn't think of due to my limited experience.
Thank you.
LM