Feature extraction from molecular fingerprint


I am working on a bayesian model to predict molecular affinities based on molecular fingerprints. At the current state, I would like to analyse why a particular ligand is scored high or low. In order to do that, I want to look into the hightest contributions from the bitvectors and extract the molecular features that are encoded by the particular bits.

It is going into the direction of QSAR.

Did you ever do that? Or would you recommend another way of doing that?

regards Soren

Hi Soren, 

Going from bits back to substructures is not currently possible in KNIME as far as I know, but definitely an interesting exercise. Maybe somebody else has found a way?

In the mean time, one thing you can try is a matched pairs analysis, where you calculate a distance matrix and then use the new Distance Matrix Pair Extractor to find pairs with very high similarity and large activity differences.  Comparing these structures can give you an idea for which features in your molecule are important. 

I have regularly done exactly as Aaron describes, with a lot of success. 


I am using Morgan fingerprints from the RDKit node. Thats interesting though. I haven't considered using different fingerprints yet! 

Thanks for the links, I will look into that!
The first link is dead.

I am currently using the Fingerprint Bayesian Learner/Predictor nodes.



PS: Which nodes do you use to generate a similarity based distance matrix?

To generate the similarity based distance matrix, simply generate your fingerprints, and then use the calculate distance matrix node.



Besides the Bayesian learner/predictor nodes you may also want consider using the Tree Ensemble Learner Regression nodes to predict actual values rather than just classifications.



What fingerprint are you using? The contributions of specific bits may not be particularly meaningful in the context of a hashed fingerprint (CDK, CDK extended...) but may be more interesting in a key based fingerprint such as MACCS keyrs of the Pubchem fingerprint (also available from CDK and maybe RDKit). 

RDKit is able to identify which atoms contribute to bits in a hashed fingerprint and an interpretation algorithm has been developed looking at the contribution of atoms: http://www.jcheminf.com/content/5/1/43

If you want to look at the molecular feature for a specific bit you will need to use a method such as that in RDKit or Chemaxons toolkit (they have done this for ECFP, https://docs.chemaxon.com/pages/viewpage.action?pageId=14483752). 

Is this a naive bayes model or have you built a BN? 

Do you mean principal component analysis? In principle, yes. Since it would give you the combination of bits with influences the score mostly, however, if you cannot identify the structural elements which are encoded by these bits, what would you gain?

I don't have a "Calculate Distance Matrix" Node. You mean the "Distance Matrix Calculate" ;-)

Oh, I see, there is a type that can be selected Tanimoto, Dice, etc...

Nice, I overlooked this option!



I would think that Primary Principal Component Analysis would be usefull in this case. Correct me if i am wrong, though.

Interesting point. PCA is a statistical aproach, so you do cross from actual features into the land of meaningless constructions.

You are basically reverse engeneering the fingerprint, right? The thing is that the individual features that are represented by a fingerprint do not per-se have to map to something that a human finds logical, nor is it sure that there is any predictive capability. And with PC you are going to lose the connection with the individual features completely.

On the other hand, with a distance matrix on the fingerprint, the distance will tell you the dissimilarity, but it will not tell you anything about how important the features are that are similar or dissimilar.

Still, an interesting excercise.


I guess that the core of the problem in fact is the lookup back from the bit in the fingerprint to the individual features. Right ?