I am currently trying to generate a set of 3D optimized molecular descriptors for a series of compounds containing several stereoisomers. Among the compounds at hand, I have pentagalloylglucose (PGG), C40H30O26 (MW 926.7), which fails to make it through the RDKit Optimize Geometry node, returning an error "Molecule has no coordinates. Creating empty output". This also happens with several other molecules of the series, the heaviest of them in fact which are all in the same waters as PGG.
Is there a limitation to the size of compounds handled by this module, and any way to overcome this (I suspect I am not that far from the actual treshold)?
- The Optimize Geometry node requires a conformer in order to do its work. If the molecules passed in do not have conformers, it will try to generate one using a set of default parameters (which you have no control over). If this fails, you'll get the message you see.
I think its generally a better idea to be explicit and to generate the coordinates first using the RDKit Add Conformers node.
There isn't a concrete limitation on the size of molecule supported, but it can get to be difficult to generate coordinates for larger molecules, particularly if they contain specified stereo-centers.
There are a couple of parameters that can be provided to the conformer generator to give it a better chance of succeeding with large structures, but unfortunately the only one that's currently available in the KNIME node is "Use random coordinates as a starting point instead of distance geometry" (it's in the "advanced" tab of the RDKit Add Conformers node configuration dialog). It would be worth giving this one a try to see if it helps in your case.
I'll take a look into exposing additional parameters for the conformer generator in a future release.
Thank you for the follow-up. I have tried that and indeed, I can now generate a geometry for all compounds. Sadly, this yields only non-predictive models, but this has more to do with my data than with the KNIME modules at this point. This may nevertheless come up to be handy again in future projects.
Your suggested workflow certainly will not result in something reasonable if you use it for model building. You are calculating 1 set of coordinates and then optimize it which means you get more or less 1 random conformer. And then you want to build a model of this? in one case you catch something close to minimum, in another one it’s >+20kcal. Your comparing apples to oranges.
You will for sure need multiple conformers for each molecule and go from there. But that is ultimately the problem with 3D descriptors. They are different for each conformer and now you need to make a decision how to build a ML model from that. Note: I do not have a answer what is correct, just possible options which all aren’t very nice. Most reasonable IMHO: you can simply use the conformers as data augmentation, eg. 1 row per conformer instead of per molecule. (you should probably make a energy cut-off first)
Still, given the added huge computational effort, the benefit mostly will not exist. You will have to try it out yourself however. (On top compared to neural networks this additional computational effort also applies to predicting as you need to generate and predict multiple conformers of each molecule you want to predict and then add additional logic to decide if it is active or not)