CDK Pubchem Fingerprint not matching chemfp fingerprint

Hello,

I imported the CDK package into Java, and converted a base64 string into binary. The length of the CDK generated string was 896 (I noticed that this question was already posted), and the actual length is 881. I took the first 15 bits off, and I got 881.

I also used the chemfp package to generate the binary string for the same compund. I got the FPS output (in Hex), and I did the binary conversion. The string length was 888 (and after chopping off the last 7 bits, I got 881).

The two strings don't match, and they are very different. Aren't they supposed to match? Nelfinavir is the compund.

Could someone please tell me what I am doing wrong? Which string is correct?

 

P.S. If this is not clear, I can post the two binary string outputs.

 

Thank you!

 

Hi there,

 

yes, they are supposed to match. Fingerprints that follow the same definition should always yield the same result. If they don't, than that's because of implementation errors or because the input molecule is not the same, e.g., due to different or wrong representation.

 

I used ChemFP and CDK to generate the 881 bit PubChem fingerprint and also directly extracted the fingerprint from the SDFile of Nelfinavir

  1. The converted PubChem fingerprint is 920 bits long: The first four bytes encode the length of the fingerprint (881), the last seven bits are padding. That's our gold standard.
  2. ChemFP extracts the fingerprint from the PubChem SDFile and returns a string following the FPS format. Because of 'funny' byte ordering, you have to be careful when decoding the string. Subsequent characters need to be transposed and the actual byte strings inverted. The resulting binary string is 888 bits long: 881 significant bits and seven padding. The binary string does not include the 4 bytes for the size of the fingerprint. The trimmed fingerprint is identical with the PubChem fingerprint.
  3. CDK generates a bit string of length 896 in KNIME. If you copy and paste the cell value to your favourite editor, the bit string will be inverted. After inversion and trimming the string to a length of 881 (15 bits are padding), CDK yields a similar PubChem fingerprint. The CDK PubChem fingerprint and our gold standard differ in three bit positions: 248, 250, and 251 (explanation). That has most likely to do with ring perception in CDK and should get fixed soon.

Conclusion: In this case ChemFP gets it 100% right, CDK differs in three bits. That will be fixed of course. :)

 

I hope that helps a bit.

 

Cheers,

Stephan

 

### PubChem

11110000011111110011100000000000010000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000011110001100000110000010000000000000000000000000000000000000000000000001
01100010100000000000000000000000001111000000100000100000000100000000000000000000000
11011011110011100101110110000000011010110010000001111000001111000000000000100000100
01000100000000010001000010101001000010000000000001000001000000000000000000010010000
10100000010000000010001000100010011001100011100000110011001000000010100111011000110
01010100010110101010011001110010100011100010000100001100100110101100001000110111000
10011001100001111011101011001000101100001000111010000000000000000000000100000000000
00000000110000000000000000000000000000000000000000010000000000000000000110000000000
000000000000000000000000000000000000000000000000000

 

### ChemFp

11110000011111110011100000000000010000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000011110001100000110000010000000000000000000000000000000000000000000000001
01100010100000000000000000000000001111000000100000100000000100000000000000000000000
11011011110011100101110110000000011010110010000001111000001111000000000000100000100
01000100000000010001000010101001000010000000000001000001000000000000000000010010000
10100000010000000010001000100010011001100011100000110011001000000010100111011000110
01010100010110101010011001110010100011100010000100001100100110101100001000110111000
10011001100001111011101011001000101100001000111010000000000000000000000100000000000
00000000110000000000000000000000000000000000000000010000000000000000000110000000000
000000000000000000000000000000000000000000000000000
 

### CDK

11110000011111110011100000000000010000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000011110001100000110000010000000000000000000000000000000000000000000000000
00000010100000000000000000000000001111000000100000100000000100000000000000000000000
11011011110011100101110110000000011010110010000001111000001111000000000000100000100
01000100000000010001000010101001000010000000000001000001000000000000000000010010000
10100000010000000010001000100010011001100011100000110011001000000010100111011000110
01010100010110101010011001110010100011100010000100001100100110101100001000110111000
10011001100001111011101011001000101100001000111010000000000000000000000100000000000
00000000110000000000000000000000000000000000000000010000000000000000000110000000000
000000000000000000000000000000000000000000000000000

Dear Stephan,

Thank you very much! This was new to me, and your example was very clear and helpful! Much appreciated!

 

 

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.