Hi Davin,
Thanks for your well considered reply.
I agree completely that we need standards to enable interoperability, and interoperability is what will multiply the utility of Knime when all our nodes play together and allow the community to make more use of knime.
I'm less familiar with the MOL2 and SLN file format standards only having used them briefly in my career. I general I expect that any cheminformatics applications will need to deal specifically with the chemical structure and therefore expect that this will need to be presented in a single cell type. I agree sometimes you may want to keep the data along with the structure record but I would suggest the table construct enables this without keeping all the data and fields in one cell type.
You have slightly missinterpretted my request.
I'm not asking for a MOLFILE type but actually a MOL (as in MOLFILE block in the CTFile format standard specification) type, having types based on file formats would be a bad idea they are actually all MOLECULE types but SMIILES or SLN indicate the format of the data to facilitate interoperability.
If you recall the discussion we had at the meeting in the UK at Lilly, if we constrast Knime to Pipeline Pilot it offers an internal XML format for molecular structures and when importing chemical information into the PP environment it translates the molecular information presented in a specific file format into an internal representation. This MOLECULAR type is then used everywhere to provide interoperability.
This has the advantage that any node that can deal with this type can process data from any source, but the disadvantage that it relies on a strong interpretation and translation of each format into a common representation used internally. This is a BIG task and leads to many subtle issues of interpretation which are not necessarily specified in each commercial vendors specification.
We agreed that this would be a difficult task for the Knime team to achieve and instead we would aim to standardise on a list of specific molecular types to allow interoperability of nodes which understood the same types. For example SMILES type processed by a node to delivery daylight finger prints. Or CTFile format processed to a daylight finger print.
I would prefer that we standardise on chemical representation standards rather than on file formats. Its subtle but important difference.
This means our standards would be MOL (CTFile MOLFILE block format), SMILES, etc, I'll let you answer in Tripos equivalents.
The reader nodes then read file formats such as SDF, RDF, MOL2, SLN etc and deliver tables with cell types containing the extracted based types for the molecular data and the additional fields, ideally interpretted as STRING, DOUBLE, BOOLEAN etc
We then remove the need to create extractors, the present argument for reading SDF and using field extractors does not seem to solve many issues. The argument was that we'd end up with lots of file readers from different sources, but instead we end up with lots of field extractors from different sources and a more complicated workflow required.
I'd argue strongly that SDF is not a structure representation type but a record representation type. It should therefore not be a cell type.
The reasons a standard has emerged is because one has been defined in the chemistry types and we all want to use it because we don't want to be in a standard with one adopter (us). However any standard like a theory must be challengable and changeable so I guess I'm doing just that.
I propose we define some molecule types
MOL_CT
MOL_SLN
MOL_SMILES
MOL_MOL2
All types which ONLY contain molecular information and are either
- read directly from SDF,
- MOL2, SLN etc
or
- extracfed from SDF MOL2, SLN etc file types
This would allow implementers to choose to create a reading node and extractor or read and extract in one step. It would avoid confusion with the file based types.
At the cost of a little more complexity.
Great having the debate anyhow and I look forward to feedback.
Best regards
Andrew
(aka alemon)