Having some truly weird issues with the calculation of PMI-derived properties.

Hello everyone,
I have been trying to make a workflow that produces ternary 3D-dimensionality plots from sdf files, and I’ve started running into problems with increasing library size.

Here I have attachet my debugging workflow, where I try different methods for 3D coordinates generation, and the errors that pop up for each of them. In this case I used a 70k compound library, and one of the nodes actually worked, but when I tried it on the actual needed library, which is 100k (and I also need this to work for a 200k one later), the same “Execute failed: no match found” error came up. And there are also some parsing errors during the generation of 3D generation, which I also do not know how to fix.

I would be very grateful for any tips on what might be going wrong, and for any advices on how to make this workflow better.

Thanks!

Workflow link: PMI_3d.knwf - Google Drive

For the overall approach, at the very minimum, you’d need optimize the geometry of the molecules (with explicit hydrogens), then calculate PMI properties. I’ve also seen people generate multiple conformers for each molecule within a certain energy range, then either calculate and plot PMI values for all conformers, or they calculate and plot PMI values for the lowest energy conformer. There’s also the choice of whether to ionize the molecules at a given pH before the PMI calculations but that’s harder to do in KNIME.

Some observations:

  • You cannot simply generate 3D coordinates from an imported 2D structure and expect to get meaningful 3D structures.

  • The way you’ve set up the RDKit Add Conformers is a bit weird, but it doesn’t matter since generating and using a single conformer based on a RowID reference isn’t going to be useful.

  • I noticed that your sd file has salts in it. What would happen when you try to calculate PMIs of a salt? What would the result even mean?

  • I’m confused as to why you are calculating npr1 and npr2 in separate nodes, just to concatenate the outputs immediately after.

Now for the troubleshooting.

Have you actually looked at the molecule(s) that are causing the parsing errors in the RDKit Generate Coords node? If not, why not? If so, what did you find? When you find it/them, do you really want the offending molecule(s) in your analysis?

  • I found a boron compound that was giving trouble. Some of the forcefields/definitions used by RDKit (MMFF) have a difficult time with some boron valencies.
  • There’s a Fe-sandwich molecule whose geometry will be hard to optimize, and which I wouldn’t want in a library like this anyway (assuming this is a durg discovery project).
  • There’s an azetidinone that failed geometry optimization. It’s not clear to me why that is.
  • 888 of the molecules are in the V3000 sdf format, which I’ve found the PMI node sometimes has a hard time with. When I convert the entire list to SMILES and then generate 3D structures from those, the PMI node worked fine

I think you need to better familiarize yourself with what’s in the SD file. Sure it’s tedious work, but when errors arise the procedure is to figure out which molecules are causing them, determine what they have in common, then preprocess the input file to resolve the errors. This can take the form of filtering out the offenders, or doing some other manipulation that deals with the specific issues. There really are no shortcuts.

You can download the workflow I used here:

You may need to make modifications depending on what the full SD file looks like.

2 Likes

Thanks for this detailed answer. At the moment, I will add only a couple of extra observations:

  • You can use the Templated Conformer Generation node from our plugin (without a template structure) to add multiple, RMSD-filtered conformers within an energy window. It also has options to ensure H’s are added, used the UTF force-field if MMFF fails, remove H’s when at the end if that is what you want to do.
  • Clean-up - if your structures are in SMILES format, you can do a lot of basic clean-up very quickly with the Vernalis ‘Speedy SMILES’ nodes - this removes the overhead of converting to a chemical toolkit format until later. Have a look through

for the available options. There is also an example workflow here:

showing a basic clean-up. It doesnt use all of the nodes (not all were publically available at the time - a useful one for example which was not available is:

  • The PMI node should be OK with V3000 Mol inputs - if you are finding problems with it with those and can send an example which fails, that would be really useful!

Thanks

Steve

2 Likes

I’ve uploaded a workflow that has 2 examples that fail on my machine:

There’s something about how the SDF blocks were initially created that causes the error, but I haven’t been able to pin it down.

First of all, thank you so much for taking your time and writing a detailed answer!
To answer some of your questions:

  • I realize that I would need to perform some kind of geometry optimization after 3D coordinates generation, but at this stage I was just testing different nodes to see which work, which errors do they produce if not, and so on.

  • The way Add Conformers is set up in my workflow is aimed to speed up the node’s work, so I can try to debug a bit faster.

  • Your remark about salts is on point! I was under the impression that I grabbed the file that was previously de-salted, and It didn’t occur to me that that might not be the case.

  • For some reason I haven’t realized that you can simultaneously calculate multiple properties in one instance of a node, thanks for pointing that out!

  • I had tried to fond the offending molecules, but I got confused to where does the error message point me, and gave up on that…

Thank you once again for your input, you have opened my eyes on a few possible sources of failiure. I will study the workflows that have been attached, and then give an update on my progress shortly!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.