PUBCHEM Fingerprints 896 bits?

Hello,

I have converted pubchem bitvectors to strings and split each bit position into it's own column so that I can use bit positions as predictor variables in QSAR models.  I know which bit positions contribute to the activity and would like to retrieve the chemical features represented by these bits.  Pubchem provides a description of each bit (ftp://ftp.ncbi.nih.gov/pubchem/data_spec/pubchem_fingerprints.txt).  However, the pubchem fingerprint I computed, with the CDK node, has more bits (896) than the 881 bits expected for the pubchem fingerprint.  Therefore, it is not possible for me to map bits to pubchem features.  Also, I tried treating the last 15 bits as padding.  But, the bits that are turned on and off in bit 881 are not in agreement with the structures so I do not believe it is a simple matter of padding.  Any insight into how to map CDK pubchem bit positions to chemical features as describes in above link would be much appreciated.  Thanks,

-Dan

Hi Dan,

the CDK node has 896 bits (its logical length) but the fingerprint is only 881 bits long. Here is the confusing bit: The first 15 bits are the 'padding'. If you remove the first 15 bits and invert the the bit set, i.e., that bit position 896 becomes bit 0, etc., the mapping provided in your link should work.

I checked the mappings for the first 20 odd bits with a few molecules and the bits are set correctly.

Hope that helps,

Stephan 

Hi Stephan,

Mappings make sense after removing first 15 bits and inverting.  Thank you very much,

Dan