Hi, I am running a linear regression for several chemical compounds. I have included the RDkit chemical descriptors. I am looking for a way to incorporate chemical fingerprints of the molecules into this linear regression. I was just wondering if anyone had some advice on this. Thank you in advance!!
Don’t know how well it will work but you could try using the Fingerprint Expander node from Erlwood nodes (community nodes). Then use the resultant binary columns in the linear regression model.
The Expand bitvector node from Knime may do it too.
Expand bitvector works just fine and in general one can build models with expanded fingerprints and descriptors BUT of course care needs to be taken especially with any algorithm that depends on the “scale” of the features.
My questions to Haseeb23 would hence be:
- Did you remove correlated features before creating your model?
- Did you normalize your features before creating the model?
I would first look at this and if the answer to either of the 2 questions is “no”, see how it affects the outcome.
If model still isn’t good enough, before adding fingerprints I would simply try a non-linear model, Random Forest and xgboost would be my choices. And if this still doesn’t work, then maybe I would try adding more features.
Thing is depending on your structures, you will need to generate 1024 or even 2048 but fingerprints meaning that man additional features. You will need to clean them first pretty aggressively as to not have too many features.
Normalization of descriptors is not necessary when using XGBoost (or other tree-based algorithms for that matter):
I have tried it with and without min-max normalization of RDKit descriptors and it does not make a difference, the resulting models statistics were identical. So you can combine f.e. RDKit descriptors and bit-expanded fingerprints together to describe your chemical structures.
True for xgboost but not for linear regression. Also when filtering on low variance, normalization is needed beforehand.
Plus correlated features can negatively impact tree-based algorithms.They need to be removed beforehand.
I removed correlated features and normalized my features before creating the model. I tried the model with and without normalization, there is no difference in r^2. The differences only arise in the errors since the values I’m using without normalization are much larger.
I have a 1024 bit fingerprint. I took a sum of this initially into a single feature but this does not work well since different structures ended up with the same sum. I was thinking of dividing the sum into 32 sums of 32. What do you think of this approach?
You could also try standardization. In KNIME it’s a bit confusing because both (Standardization and Normalization) are done with the normalize node. Of course the errors will be different but also the coefficients of the linear regression. If you have features on very different scales, the feature with the highest value will appear to have the biggest impact. Eg. if you don’t normalize/standardize you can’t compare the coefficients and make any deductions from them. (if that is the goal, if interpretability isn’t a goal, I probably wouldn’t even bother with linear regression)
With the fingerprint you can either use it directly in the Tree Ensemble or Random forest learner or split it up and use each bit as separate feature. Or you can limit the number of bits to what you seem more suitable albeit obviously losing some information.
Still what matters is your goal and the data you have. Fingerprints are more bound to specific structures and make it very unlikely you will get a hit in a virtual screening experiment that is not very similar to the existing actives (eg.unlikely to find novel active scaffolds). If that is the goal, better use descriptors.
I noticed on the tree ensemble that there is an option to use the entire fingerprint rather than splitting to a bitvector. However, is there a way to incorporate chemical descriptors as well? Is the only way to combine fingerprints and descriptors the bitvector approach? Also, I agree with what you say regarding normalization. I have not attempted standardization, but I’ll give it a try. The normalization was working well for me in helping me to compare the errors in different linear regressions.
I don’t see another approach but my previous comment applies. Does it really make sense to mix them?