RDKit Fingerprint node and (CDK) Fingerprints node gives different MACCs keys

_sam · January 13, 2019, 10:43pm

Hi all,

When producing MACCS keys with two different nodes (RDKit Fingerprint node and (CDK) Fingerprints node), two different keys are produced.

These are different in that the RDKit node produces keys with 167 bits and CDK node produces keys with 166 bits.

Also, closer inspection shows that two different bit fingerprints have been produced by the nodes.
For example (taken from compound 10 from the uploaded workflow):

RDKit:
01111110001000011010011010000100100000000110011000000000000100001000101001000011000101010100000000010110110100000001010000001010100000000000000000000000000000000000000

CDK:
0111111000100001101001101000010010000000011011100000001000010000100010100100001100010101010100000001011011010000000101000000101010000000000000000000000000000000000000

Does anyone know why this is happening, or where I might be able to learn more about and/or fix this issue?

I have attatched a workflow that shows the error

Fingerprint_issue.knwf (77.0 KB)

greglandrum · January 14, 2019, 4:35am

Hi,

There are 166 public MACCS keys. The RDKit produces a fingerprint that has 167 bits so that the numbers of the bits (which are always indexed from zero) correspond to the number of the key (bit 0 is always 0). So MACCS key 43 is bit 43 in the RDKit implementation. It would be 42 in the CDK implementation.

The differences in results are not horribly surprising; I would almost always expect different toolkits to produce different fingerprint results even if they are using the “same” definition. Here’s a specific explanation of what’s going on in your case:
The first differing bit is #122 (When KNIME displays fingerprints as bit strings, the lowest numbered bit comes last). The definition of this bit used by the RDKit (and I think the CDK uses the RDKit SMARTS definitions) is:
122: ('*~[#7](~*)~*', 0), # AN(A)A
That is an N atom with three neighbors.
Here’s the first molecule from your table:

If you consider the H on one of the Ns in the imidazole ring to be a neighbor, then bit 122 should be set for this molecule. The RDKit does not consider this to be a neighbor (unless you have added Hs to the molecule), the CDK seems to. To the best of my knowledge, there’s no publicly available authoritative definition of the MACCS keys, so it’s impossible to know which toolkit is doing the “right” thing.

The main message from all of this is that it’s pretty much never safe to compare fingerprints generated with different toolkits.

-greg

_sam · January 14, 2019, 7:30pm

Hi Greg,

Thanks for your reply, that makes sense. I will be sure to stick to the same toolkit when comparing.

Thanks, Sam

system · January 21, 2019, 7:30pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.