I’ve been trying to search both within KNIME as well as in general for methods to combine ECFP with chemical descriptors which RDKit Can calculate. I came across a post way back from @beginner asking the same but not much else. I’d like to see if some global molecular properties can enhance XGboosted trees, Naïve Bayes, and Keras deep learning etc. performance. Any experience or reference you could provide would be very much appreciated. Maybe I’m not finding it because it can’t be done or perhaps because everyone else knows it can be done and it’s common sense
I don’t really understand what you’re asking. What exactly are you trying to accomplish?
The choice of descriptors depends greatly on the property you’re trying to predict.
The general approach is to start off with all the descriptors that make sense chemically, and then whittle down the selection using methods such as variance and correlation filters.
Thanks for the response. Apologies for not being more clear. The prediction endpoint would mostly be prediction of biological activity. To be more specific I’m wondering about a method to combine fingerprint bits 1’s and 0’s with other physical property descriptors such as MW, logp, TPSA etc. I’m suspecting there are issues because they are different types of input data.
I don’t see there being any issues. There are tons of QSAR prediction papers describing many different modeling techniques which use all manner of descriptions and fingerprints.
Over the next couple months I plan on using a Random Forest to predict pEC50 values for iterative screening purposes, using chiral ECFP6 fingerprints along with descriptors such as E-State Keys, most basic pKa, MW, logP, molecular surface area, etc.
A test run using literature activity data performed quite well.
If you have a reference handy that would be very interesting. All I seem to find is papers comparing fingerprints to descriptor sets not blending them together. Then if you do add a bunch of different descriptors do they need to be scaled or normalized or can they be used raw? Some of the data are continuous and others discrete. Also, if you’d be willing to share if there workflows to help select descriptors?
In this paper the authors use a Random Forest in Pipeline Pilot to predict activity:
Regression models are built using the Random Forest method implemented in R accessed via the Pipeline Pilot ‘Learn R Forest Model’ component.(43) The default settings of this component are used except for the molecular descriptors: ALogP, Molecular_Weight, Num_H_Donors, Num_H_Acceptors, Num_RotatableBonds, Molecular_SurfaceArea, Molecular_PolarSurfaceArea, which are used together with ECFP_6 fingerprints. The fingerprints are not folded and are used as an array of counts.
In this paper the authors compare a range of machine learning algorithms for property prediction:
Compounds were represented using three different methods: extended connectivity fingerprints, chemical/physical descriptors, and molecular graphs. The combination of fingerprints and chemical/physical descriptors were used to train all methods except for the graph convolutional networks that used the molecular graphs. The fingerprints were 1024-bit Morgan fingerprints with radius 2 from RDKit. Ninety-seven chemical/physical descriptors were calculated with the RDKit as well, and these descriptors have previously been described and used with good results. Molecular graphs were constructed as PyTorch tensors. Each node (representing an atom) had 75 features.
In general, feature selection is a basic part of the machine learning workflow. You can probably find machine learning optimization/validation workflows on the Examples server or on the KNIME hub and then modify them to suit your purposes. I don’t know if there are any workflows for selecting chemical descriptors.
I don’t know which post you are referencing but if it is very old, it’s possible it was before the “Expand bitvector” node was available. Now you can simply split the fingerprint into separate columns and combine them with descriptors.
Scaling/normalizing is needed if you do not use a tree-bases algorithms. So for say logistic regression, SVM or neural nets, yes you need to normalize. For xgboost or random forest, normalizing is not needed.
As for what to use, depends on your goals. Fingerprints will tend to lead to a model that will mark very similar compounds as active. eg. scaffold hopping or any other surprises are rather unlikely to occur. This is more likley with descriptors as they really less on the exact “graph structure” of the molecule.
For me fingerprints have the issue of being very high dimensional and each bit itself containing little information. OK, if you use dragon you are also high dimensional but many of these are highly correlated and using appropriate filters should reduce them to a couple 100 at max.
As for fingerprint bit columns you can apply a correlation filter to other bits (usually doesn’t reduce features unless you set threshold low) or what you can also do is set a threshold for fraction of samples that have a specific bit set (or not set). if a bit is only set a couple of times (or not set a couple of times) it contains little information.