Chemistry Data types

We've built some standard chemical readers and writers for common chemistry files types. We'd like to use the chemistry types to provide interoperability with other knime nodes.
At present there are these types supported
* SDF
* MOL2
* SMI
* SLN

Our parser separates data from molecule during the read outputing a table with a single molecular column and separate data columns.

Can we have addtional types for PDB, and MOL (as in Molfile V2000 and V3000) format?
This would help to greatly increase interoperability if others standardise on the same.
MOL is just the molecule where as SDF is molecule plus data fields.

These nodes are part of the nodes4knime effort on sourceforge which we'll be posting a beta of shortly.

PS The chemistry types are a great enhancement by the way![/list][/list]

alemon wrote:
Can we have addtional types for PDB, and MOL (as in Molfile V2000 and V3000) format?
This would help to greatly increase interoperability if others standardise on the same.
MOL is just the molecule where as SDF is molecule plus data fields.

A type for PDB already exists in the org.knime.bio.types-Plugin. As for MOL I'm not sure if there is a real necessity for it. As you said, MOL is SDF without the data block. So it is very easy to make a SDF out of a MOL, you only need to make sure to add the separating $$$$ at the end. Or are there special cases where you really need a MOL and not a SDF?

Regards,

Thorsten

Thanks I'd not spotted the PDB type!

As for MOL yes its like an SDF with no data fields, but SMILES, MOL and SLN are molecular structure types where as SDF is a record type containing fields.
It would be a better design to separate the two, especially when we may process the molecule separately from the data fields.
At present we'd need to check if the cell of type SDF contains data fields or not.

I can't help feeling that the SDF type is an artifact of the method of reading the files, i.e. reading everything between the $$$$ and then passing this block around. In most common software this is usually done by converting that block into a record.
For example

Marvin 02040810313D

19 20 0 0 1 0 999 V2000
-2.6357 -2.4110 1.3269 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.2416 -2.0853 -0.0164 O 0 0 0 0 0 0 0 0 0 0 0 0
-1.5107 -0.9279 -0.1272 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.0869 -0.9974 -0.0869 C 0 0 0 0 0 0 0 0 0 0 0 0
0.6930 0.1866 -0.2848 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0139 1.4223 -0.5492 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.3952 1.5248 -0.6190 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.1464 0.3314 -0.3919 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.5173 0.3529 -0.5141 O 0 0 0 0 0 0 0 0 0 0 0 0
-4.2506 1.0378 0.5211 C 0 0 1 0 0 0 0 0 0 0 0 0
-5.5608 1.1496 0.1332 F 0 0 0 0 0 0 0 0 0 0 0 0
-4.2056 0.3754 1.7229 F 0 0 0 0 0 0 0 0 0 0 0 0
2.1059 0.1262 -0.2280 C 0 0 0 0 0 0 0 0 0 0 0 0
2.7504 -1.0072 -0.2915 N 0 0 0 0 0 0 0 0 0 0 0 0
4.0088 -1.1332 -0.2448 N 0 0 0 0 0 0 0 0 0 0 0 0
4.8045 -0.0969 -0.1235 C 0 0 0 0 0 0 0 0 0 0 0 0
6.0152 -0.2427 -0.0803 O 0 0 0 0 0 0 0 0 0 0 0 0
4.2507 1.1444 -0.0470 C 0 0 0 0 0 0 0 0 0 0 0 0
2.9085 1.2502 -0.0995 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 1 0 0 0 0
3 4 4 0 0 0 0
4 5 4 0 0 0 0
5 6 4 0 0 0 0
6 7 4 0 0 0 0
7 8 4 0 0 0 0
3 8 4 0 0 0 0
8 9 1 0 0 0 0
9 10 1 0 0 0 0
10 11 1 0 0 0 0
10 12 1 0 0 0 0
5 13 1 0 0 0 0
13 14 2 0 0 0 0
14 15 1 0 0 0 0
15 16 1 0 0 0 0
16 17 2 0 0 0 0
16 18 1 0 0 0 0
18 19 2 0 0 0 0
13 19 1 0 0 0 0
M END
>
23431

>
2344

>
C12H10F2N2O3

>
268.22

>
COc1cc(ccc1OC(F)F)C2=NNC(=O)C=C2

>
zardaverine

>
No

$$$$

Converted to a record with fields or in Kniume a table with columns
Names: | Molecular Structure | Row# | ID | FORMULA | WEIGHT | SMILES | NAME | Drug |
Types: | MOL | Int Text | Text | float | SMILES | Text | Text |

Currently the standard SDF Reader generates
| Structure |
| SDF |

Any folllowing nodes that can process a SMILES column can pick up the fields directly.
Similarly I can use a JFreechart node to plot the molecular weights.

I think this would generate improved interoperability and reduce the complexity of workflows.

Especially regarding representations such as MOL and SDF, I think it important that we collectively adhere to a set of conventions in our use of established KNIME data types to help ensure consistency across nodes developed by different groups. Such conventions help fill in the gaps when working with data cells instead of flat files (i.e. SDfiles.)

As alemon points out, the SDfile format consists of a Molfile (MOL) block, zero or more data header + data blocks, a blank line, and the '$$$$' line to separate one structure record from the next. It is worth further pointing out that both the MOL2 and SLN formats support data fields as well (the fields are more constrained in MOL2 but SLN permits user-invented data fields in a manner similar to SDfiles.) Compared to the formats of SDfile, MOL2, SMILES, and even SLN, the MOL format is unique in its requirement that there be only one structure represented per file. To assist users with reading in multiple structures more easily, some software packages have historically offered the ability to read in entire directories of MOL files at a time.

When originally proposed, it was imagined that the SDF type could be used for storing information from either MOL or SDfiles. It still strikes me that, generally speaking, to require that there be 1 data type for every 1 unique file format would impose an awkward constraint, without any clear, universal benefit. An occam's razor approach to establishing new data types is probably warranted.

The convention that we can now observe in nodes authored by Infocom, Schrödinger, Tripos, etc. appears to follow this pattern
* use the SDF type for storing data from either MOL or SDfiles
* one structure representation per data cell (i.e. do not read in a file of multiple structures and merely dump it into a single cell)
* if associated data is found alongside the structure's connection table information, offer a capability to extract that data into distinct data columns (i.e. just what alemon said and this applies to more than just SDfiles)

An area where there does not appear to be a clear convention at present is the matter of what to do with those data fields in the SD record if their values have been extracted into other data columns. Strip the data header + data blocks from the SD record, effectively leaving only a Molfile block? Leave those data blocks in the structure's record to ensure that their original values can be extracted at will by later nodes in a workflow? What if the order of certain data fields was significant to a 3rd party application even though the ctfile specification is silent on this topic? The most general solution would be to expose options to the KNIME user (via a node's dialog) on what is to be done here. If there were a "best" solution, we might yet see a convention adopted around it.

Discussion around establishing (or tweaking) conventions can hopefully guide us all towards a better way of doing things. It can also lead to identifying needs for creating additional data types (such as the topic of V2000 and V3000 giving rise to distinct data types.) If there are shortcomings with the current conventions or established data types, let's draw those out (or give examples) so that we can work through them as a community.

Hi Davin,
Thanks for your well considered reply.
I agree completely that we need standards to enable interoperability, and interoperability is what will multiply the utility of Knime when all our nodes play together and allow the community to make more use of knime.

I'm less familiar with the MOL2 and SLN file format standards only having used them briefly in my career. I general I expect that any cheminformatics applications will need to deal specifically with the chemical structure and therefore expect that this will need to be presented in a single cell type. I agree sometimes you may want to keep the data along with the structure record but I would suggest the table construct enables this without keeping all the data and fields in one cell type.

You have slightly missinterpretted my request.
I'm not asking for a MOLFILE type but actually a MOL (as in MOLFILE block in the CTFile format standard specification) type, having types based on file formats would be a bad idea they are actually all MOLECULE types but SMIILES or SLN indicate the format of the data to facilitate interoperability.
If you recall the discussion we had at the meeting in the UK at Lilly, if we constrast Knime to Pipeline Pilot it offers an internal XML format for molecular structures and when importing chemical information into the PP environment it translates the molecular information presented in a specific file format into an internal representation. This MOLECULAR type is then used everywhere to provide interoperability.
This has the advantage that any node that can deal with this type can process data from any source, but the disadvantage that it relies on a strong interpretation and translation of each format into a common representation used internally. This is a BIG task and leads to many subtle issues of interpretation which are not necessarily specified in each commercial vendors specification.
We agreed that this would be a difficult task for the Knime team to achieve and instead we would aim to standardise on a list of specific molecular types to allow interoperability of nodes which understood the same types. For example SMILES type processed by a node to delivery daylight finger prints. Or CTFile format processed to a daylight finger print.

I would prefer that we standardise on chemical representation standards rather than on file formats. Its subtle but important difference.

This means our standards would be MOL (CTFile MOLFILE block format), SMILES, etc, I'll let you answer in Tripos equivalents.
The reader nodes then read file formats such as SDF, RDF, MOL2, SLN etc and deliver tables with cell types containing the extracted based types for the molecular data and the additional fields, ideally interpretted as STRING, DOUBLE, BOOLEAN etc

We then remove the need to create extractors, the present argument for reading SDF and using field extractors does not seem to solve many issues. The argument was that we'd end up with lots of file readers from different sources, but instead we end up with lots of field extractors from different sources and a more complicated workflow required.

I'd argue strongly that SDF is not a structure representation type but a record representation type. It should therefore not be a cell type.
The reasons a standard has emerged is because one has been defined in the chemistry types and we all want to use it because we don't want to be in a standard with one adopter (us). However any standard like a theory must be challengable and changeable so I guess I'm doing just that.

I propose we define some molecule types
MOL_CT
MOL_SLN
MOL_SMILES
MOL_MOL2

All types which ONLY contain molecular information and are either
- read directly from SDF,
- MOL2, SLN etc
or
- extracfed from SDF MOL2, SLN etc file types

This would allow implementers to choose to create a reading node and extractor or read and extract in one step. It would avoid confusion with the file based types.

At the cost of a little more complexity.

Great having the debate anyhow and I look forward to feedback.

Best regards
Andrew
(aka alemon)

Hi Andrew --

As a quick aside, even though I knew you were alemon, for some bizarre reason my brain still assumed that "alemon" must be derived from a literary figure that I was unable to recognize. I had planned to look up in what book "alemon" appears until I saw your last note which helped me realize what should have been obvious.

alemon wrote:

You have slightly missinterpretted my request.
I'm not asking for a MOLFILE type but actually a MOL (as in MOLFILE block in the CTFile format standard specification) type, having types based on file formats would be a bad idea they are actually all MOLECULE types but SMIILES or SLN indicate the format of the data to facilitate interoperability.

Sorry, I may have been sloppy in my wording. I did interpret your suggestion as being the creation of a cell type, named MOL, to contain a Molfile block. If I have missed something subtle in saying this, please stop me.

If I have understood the rest of your post correctly, I believe we are very much talking along the same lines and I believe I understand your reasoning for advocating the use of a MOL type instead of an SDF type within KNIME. There are a few details (or maybe headaches) we need to think through still:

alemon wrote:

I'd argue strongly that SDF is not a structure representation type but a record representation type. It should therefore not be a cell type.

Agreed, let's call the current SDF type a record representation type. But what to do about the other types (i.e. SLN and MOL2) which are also properly termed record representation types? Any Molfile block can be contained within an SDfile record, but there are not analogous subsets for all other noteworthy structure representation formats.

Another ugly subtlety to consider involves the header block (or more specifically the title line) of a Molfile block. (It is worth noting that a non-trivial number of applications choose to store structure names in data fields in SDfile records as opposed to storing that information in the header block of the Molfile block.) Arguably, a name is not part of a structure representation and so to be pure about it, we might consider instead having a true MOL_CTAB type which stores only the structure representation via the Ctab block (a Molfile block consists of a header block and a Ctab block.) If we take this ultra-pure route with MOL_CTAB, then we should consider being consistent and expecting that SMILES records read from a file should have their RegNames separated from the structure representation. If we did not take the purist route and permitted a full Molfile block to be stored in our MOL type, then we would do so recognizing that the actual structure name might not be retained (because it was in a data field from an SDfile record instead.) We almost need KNIME to construct a dendogram for us to clearly see all possibilities here, but I think two important questions to consider are (1) how much effort is required of a developer of a new chemistry-oriented KNIME node, and (2) how clear is it for a non-CADD chemist to understand the basic implications of a chemical data type when using KNIME?

alemon wrote:

This would allow implementers to choose to create a reading node and extractor or read and extract in one step. It would avoid confusion with the file based types.

Actually, if I understand your comment here, this is precisely what the Tripos reader nodes do already. I think the same is true of the others but that should be double-checked.

alemon wrote:

The reader nodes then read file formats such as SDF, RDF, MOL2, SLN etc and deliver tables with cell types containing the extracted based types for the molecular data and the additional fields, ideally interpretted as STRING, DOUBLE, BOOLEAN etc

As a side issue triggered by the above: At the KNIME workshop in November, I believe Joe (of Symyx) was raising an issue around multiline values in data fields that employed continuation characters. I do not remember this clearly but it might be good if Joe could jump in here.

alemon wrote:

However any standard like a theory must be challengable and changeable so I guess I'm doing just that.

Hooray! I hope my comments above are not interpreted as defending any one particular way of doing things -- these are issues that we all have to navigate and if we can find a better way of solving or getting around them, all the better.

Enjoy,

Davin

One other thing that came to my mind: we should not forget about libraries or external tools that currently need a full SDF record (e.g.). CDK or OpenBabel for example rely on the input cells containing complete SDF records including header, data fields, etc. Of course this issue is solvable by re-recreating the SDF record out of the MOL_CTAB (or whatever there might be in the future) and properties columns, but not really very nice as this requires some additional work in these nodes and also the dialog would be more complex.

Thorsten

Quote:
As a side issue triggered by the above: At the KNIME workshop in November, I believe Joe (of Symyx) was raising an issue around multiline values in data fields that employed continuation characters. I do not remember this clearly but it might be good if Joe could jump in here.

I've been lurking on this thread, but I'll rise to the invitation.
I've been looking at our published File Format document (http://www.mdl.com/downloads/public/ctfile/ctfile.pdf) and what existing readers/writers and Symyx databases are doing. As a number of you are aware, the sdfile section of the File Format document is none to clear about what is and is not allowed.

Relative to contents of the data block I've concluded (and included examples in parenthesis):
1) line lenghts are not constrained (PubChem sdfiles)
2) multiline data is used (supplier information in ACD)
3) blank lines embedded in data is supported (Symyx Direct sdfreader)

And what about the structure block? First, it should be noted that the structure block is not equivalent to a CTAB (connection table)... the clearest instance of this is encountered when storing generic structures, where the structure block contains a number of CTABs.

We typically put molecule names in the datablock so they are available to the sdfile consumer without requiring that they parse the structure block. Many of our sdfile consumers handle the structure block as opaque, and don't need to understand if it is a specific or generic, v2000 or v3000 molfile.

Our tools assume that structure and data are separated, and our nodes (under development) use MOL and RXN datatypes, with the option of casting a MOL to an SDF (by adding a terminal '$$$$'). We are following this discussion with interest, since it is in everyone's best interest if nodes can interoperate.

Hi Joe,
Good to hear from you.
Sounds we're on the same page with respect to the CTFile formats, my post is petitioning Knime to extend their chemistry types to include MOL (and RXN) types along with SDF etc.
As you say we can cast MOL to SDF by adding $$$$ job done.

Point is for interoperability we should extend the base Knime package with these types not create our own. Perhaps this is already planned?

We want to be able create nodes that reads standard file formats and creates a table with correctly typed columns for the base types which we agree is MOL for the structure in an SDF record.

So Knime please can we have MOL and RXN types added to the chemical file types to make them universal or at least start a discussion to define a standard set of chemistry types. This should include the other types such MOL2 and SLN etc which will need some thought as Davin points out.

Looks like we're all close in opinion on this.

Thor wrote:

Quote:
One other thing that came to my mind: we should not forget about libraries or external tools that currently need a full SDF record (e.g.). CDK or OpenBabel for example rely on the input cells containing complete SDF records including header, data fields, etc.

If some software needs the full SDF record as a type then there may be some nodes that read SDF files into the SDF type or else we'd need a tablerow2SDF node to put them back together but I think this would be the exception not the rule in my experience.

Potts wrote:

Quote:
two important questions to consider are (1) how much effort is required of a developer of a new chemistry-oriented KNIME node, and (2) how clear is it for a non-CADD chemist to understand the basic implications of a chemical data type when using KNIME?

Agreed, must be easy to use in knime and a well defined standard for development.
The chemistry types are a start for the latter. Our sinlge reader node is designed to make reading as easy as possible i.e. one node with typed output into columns.
Interoperability is achieved with the types not through internal knowledge on the formats.

As for the name in the header of a Molfile block if its in there its a property of the molecule in my book and therefore subject to whatever software reads the structure block. In the past we did this by having a molecule attribute called name in the same way a molecule an attribute mass.

We should consider the MOL2, SLN and SMILES in more detail to come to a common definition and standard like above with CTFile formats.

Best regards
Andrew
(aka alemon also referred to as jif - but thats a beer conversation)[/b][/i]

Converting a file of molecular formulas from sln to smi.  How to do this?

Try the open babel node

simon