CDK Fingerprint Similarity - Bitstring lengths/statistics

Hi All,

 

Whilst for the majority of purposes the CDK Fingerprint similarity node (or for that matter, most of them in KNIME) perform well, I have a set of uncommon coefficients that I wish to use.

 

Whilst I could use a Java Snippet to do this, It seems  wasteful to convert Bitvectors to strings, then reconvert the strings to BitSets in the Java Snippet, in order to perform the similarity calculations.

 

I request a simple modification - allow the output of these additional columns from the fingerprint similarity node:

- number of (on) bits in reference molecule(s)

- number of (on) bits in database molecule

- number of common (on) bits in the database molecule, for each reference

- number of off bits?

 

Generally these are used in most similarity coefficients and that would satisfy my needs.

 

Thanks in advance,

Ed.

Hi Ed,

KNIME already supports bitvectors in different ways. I believe that all the above can be achieved using regular KNIME node functionality. Using the "Column Aggregator" node you can get the number of set bits as well as the union or intersection of two bitvectors.

I have attached a screenshot of an example workflow. All operations on the bitvectors were effectively instantaneous for my test set of 100,000 rows.

Does that work for you?

Kind regards,

Stephan