I imported the CDK package into Java, and converted a base64 string into binary. The length of the CDK generated string was 896 (I noticed that this question was already posted), and the actual length is 881. I took the first 15 bits off, and I got 881.
I also used the chemfp package to generate the binary string for the same compund. I got the FPS output (in Hex), and I did the binary conversion. The string length was 888 (and after chopping off the last 7 bits, I got 881).
The two strings don't match, and they are very different. Aren't they supposed to match? Nelfinavir is the compund.
Could someone please tell me what I am doing wrong? Which string is correct?
P.S. If this is not clear, I can post the two binary string outputs.
yes, they are supposed to match. Fingerprints that follow the same definition should always yield the same result. If they don't, than that's because of implementation errors or because the input molecule is not the same, e.g., due to different or wrong representation.
I used ChemFP and CDK to generate the 881 bit PubChem fingerprint and also directly extracted the fingerprint from the SDFile of Nelfinavir.
The converted PubChem fingerprint is 920 bits long: The first four bytes encode the length of the fingerprint (881), the last seven bits are padding. That's our gold standard.
ChemFP extracts the fingerprint from the PubChem SDFile and returns a string following the FPS format. Because of 'funny' byte ordering, you have to be careful when decoding the string. Subsequent characters need to be transposed and the actual byte strings inverted. The resulting binary string is 888 bits long: 881 significant bits and seven padding. The binary string does not include the 4 bytes for the size of the fingerprint. The trimmed fingerprint is identical with the PubChem fingerprint.
CDK generates a bit string of length 896 in KNIME. If you copy and paste the cell value to your favourite editor, the bit string will be inverted. After inversion and trimming the string to a length of 881 (15 bits are padding), CDK yields a similar PubChem fingerprint. The CDK PubChem fingerprint and our gold standard differ in three bit positions: 248, 250, and 251 (explanation). That has most likely to do with ring perception in CDK and should get fixed soon.
Conclusion: In this caseChemFP gets it 100% right, CDK differs in three bits. That will be fixed of course. :)