Introducing a knime molecule type

There were some discussion about this at the KNIME UGM  last week and then Simon posted the following request to the RDKit, Indigo, and CDK forums:

With the RDKit, Indigo, and CDK chemistry nodes become more used and expanded, and often used together within a workflow, are there any plans to allow the Molecule to CDK node to accept Indigo and RDKit molecules to save on the number of translater nodes required.

I think this would be a great benefit to KNIME users and could (perhaps) reduce the amount of duplication of work currently going on among the open-source node packages.

An outline of how this could work, as something to poke holes in (apologies in advance if my rough outline doesn't feel like Java... that's the Python/C++ programmer showing):

We create a type KnimeMolCell that supports getSmilesValue() and getSdfValue() as well as some new methods: hasSmilesValue(), hasSdfValue(), hasCustomValue(), getCustomValue() and setCustomValue() [probably need to refine the names a bit]. The custom value methods are used to return specialized molecule types like RDKit, Indigo, CDK, Maestro, etc. They each take an argument defining which custom type is of interest.

Nodes that uses these cells can rely on them having at least a SMILES or SDF value to work with. They can then check to see if there is a more specialized format avaiable and, if it's not there, add it. For example an RDKit node would start by checking if there is an RDKit value available, if so it would use that value. If not, it would create one based on the SMILES or SDF data and store it on the cell.

There would obviously, be a strong constraint on nodes working with these types: modifying the SMILES or SDF once the node is created would not, under any circumstances, be allowed.

This would have the advantage that RDKit, Indigo, and CDK nodes could be chained without converter nodes in the middle, and without having the nodes having to repeatedly re-process the molecules.

comments? thoughts?

-greg

This is an excellent idea and we should use the opportunity to develop a 'cross-toolkit' data type.

The plan proposed is very good. Having a data type that behaves as molecule container would remove the need to convert molecules back and forth. However, in thise case we need to strictly agree on how a new RDKit/Indigo/CDK molecule should be configured to start with. Otherwise it will get confusing for the user. E.g., aromaticity perception.

The obvious, but perhaps least desirable, case is to have three different entities stored in the container data type, all with their own configuration. E.g,. CDK molecule with explicit hydrogens and no aromaticity perceived, Indigo with aromaticity perceived but no hydrogens attached, etc.

In this case the user would need to monitor the state of the molecules and take the properties set into account when building a workflow.

Another point is visualization. Depending on the underlying libraries' input classes for SDF / SMILES conversion, the molecules might be slightly different in some cases. How can we render each molecule accurately without too much overhead? Some ideas: A global option in the preferences to set CDK molecules to be rendered by default; a grid visualization node for quick comparison of all defined molecules types, feature-specific visualisation nodes complementing the ChemAxon SDF renderer.

Stephan

I think it's unlikely that we'd ever reach agreement on a true cross-project molecule representation. I'm proposing that we keep it simple and use different entities stored in one container. 

The user is going to observe different behavior when using nodes from different packages, but that's no different from what they see right now.

With respect to renderers: I agree that there should be a global preference, like there is now for SDF and SMILES, to set the default renderer type for these molecules.

-greg

Hi,

I think this would be great if it becomes possible. Sooner rather than latter. I would be very much prepared to feel some initial pain of having to reconfigure the nodes in my workflows when such a transfer to a standard happens as in the end the benefit would be significant.

As time moves on, workflows become ever bigger and ever more complex, and the number of nodes being developed for RDKit/Indigo/CDK become larger and larger. Delaying any uniform molecule type is only likely to lead to more work for the developer to change over their nodes, and more work for the user to modify their workflows, and likely a larger user base of users too.

I really hope this can progress on to deliver a standard this year.

Many thanks for everyones hard work with the community nodes.

Simon.

Hi all,

I've being asking for the molecule type for some years now. But maybe we have reached critical mass now. As discussed with some of you in Zuerich I would like to get CDK and Indigo nodes to implement toSdfValue() for there molecule types. This would enable subsequent nodes to work on their molecule cells. I don't really like the toSmilesValue idea because many calculations need information about the molecule that is not contained SMILES and would therefore give wrong results. However in SDF format we connot store partial charges, aromaticity and some other information. I think it would be worth thinking about a molecule supertype that can read the information from the existing cell types and has some flags to indicate which kind of pre-treatments (aromatic perception, hydrogens, ...) have been performed.

But in order to implement support for such a type it has to be defined in the Base Chemistry Types Extension. For the render we should talk to ChemAxon or get a basic implementation included in the ChemTypes. The second option has the advantage that we can support extensions to our MoleculeCell type quickly.

Guido

I fully agree that it would be great if the nodes from different toolkit could be directly connected. But I don't think that a new dataype is necessary, because everything that would be needed is already there. The dataype we use internally implements the SmilesValue, StringValue (returns Smiles string), SdfValue and CmlValue interfaces that are already available in Knime and is therefore able to provide the required formats. For instance, we can use our cells directly as an input for the "Molecule to RDKit" node and apparently it does not recognize that it is a much more complex cell type than a simple string format.

Thus, I think it should be enough to implement at least the SmilesValue, StringValue and SDFValue interfaces and ensure that toolkit nodes can use a string representation as a fallback if the toolkit object-type is not available. This apparoach would also prevent thinking about rendering solutions, because these are already present for the different String value types. However, if the community agrees on a new datatype, I will try to adapt our nodes as best as possible to be fully compatible with it.   

Nikolas

I really like the idea of avoiding the many explicit type conversions that are needed today. On the other hand, I am also very happy about the fact that KNIME gives the user full control of what happens to his molecule and is one of few tools on the market that are able to write an SDFile exactly as it was read originally. This can be critical in some applications and is possible only because KNIME does not parse and interpret a molecule by default. I would thus support Greg’s proposal and suggest to stay away from trying to represent a molecule in some common form.

Assuming that also some of the commercial node vendors might use the new cell type, I could end up with (say) five or six different representations in my container if I mix nodes from many sources. For large datasets, this could lead to memory issues. What about a user-defined variable that limits the number of different representations the container would accept? If the limit is reached, the addition of a new representation would remove the oldest “specialized” entry in the container (FIFO). This way, each user could choose an optimal memory-speed tradeoff.

Nils

Nikolas, your approach is quite similar to the one Greg is suggesting. The difference is that you need to do the conversion from SDF (e.g.) to the internal format again and again if a chain of nodes from the same package are used, because the internal format is not preserved. With the new molecule type, nodes can (silently) "append" their internal representation to an existing cell and then use it directly in subsequent nodes. Of course, you really need to take care that the different representations in one cell do not get out of sync.

Nils,

As long as the nodes are implemented correctly, the amount of memory used wouldn't be any more than what it is now.

A simple illustration of why this is true: imagine a workflow like this in the current knime:

         Mol2RDKit ---> RDKitSubstructureSearch

      /

input

      \

         Mol2Indigo ---> IndigoSubstructureSearch

versus this with the molecule type (yes, it's a silly example):

input ---> RDKitSubstructureSearch ---> IndigoSubstructureSearch

In the each example, you have one SDF, one RDKit molecule, and one Indigo molecule for each row of the input table, so the total memory usage is the same.  There is a possible downside that the individual objects will be bigger, but I believe that Bernd said there's a way to still make this work efficiently based on the way knime deals with BLOB cells (we're getting into implementation details now).

-greg

Agreed. But in the first example, the user has full control and explicitly sees what is happening. For example, it is possible to remove a column again if I don't need it any more. In the second example, the table gets bigger and bigger somehow implicitly in the background. For example, if I use a Table Writer, the generated file will be huge and there is nothing I can do about it.

I see your approach as a caching mechanism that introduces a memory-time tradeoff and would like to be able to tune this as I need it. One could also think about a "Remove Representations"-node that works like a Column Filter.

Nils

Hi,

I think the advantage of explicit type conversions is that you get access to the exact input that was used. This helps a lot when debugging a node. But for the normal user it is very confusing. Even comp chemists are not always aware of all the limitations the file formats have.

This problem can only be solved by a common molecule supertype.  If the original input needs to be conserved it could be placed into a StringCell and then parsed into a molecule from there. But defining such a type will take quite a lot of time.

On the short term it would help if the molecule formats not defined in the base chemistry types could implement SdfValue and or Mol2Value. Wherever possible the nodes should allow at least one of these types as input beside their native format. For the RDKit it already works nicely including using the Sdf renderers. CDK and Indigo will hopefully follow soon. 

Cheers

Guido

  

Hi

my 0.02$ from a user prospective. I do really encourage the communities to identify the minimum basic chemistry types chosen from existing ones. No new ones for KNIME. I say this as we need practical solutions and the sooner the better. CDK effort to set themselves as standard in open source has long finished and I would not expect them to follow quickly new solutions from Indigo, RDKit and other forums: on the other hand, commercial vendors could be very sensitive on the topic. And help. It is time to get it done. Kudos to Greg and to whoever raised the topic and to Nikolas for useful practical advice. I am excited about it and would like to hear from GGAsoftware, Schrödinger and Chemaxon what they think.

Cheers

Andrea