Has any one else experienced the following problem?
I've been processing ChEMBL structures first through Indigo and then through RDKit nodes. A small percentage of the Indigo generated SMILES strings are not parsed by RDKit. They seem exclusively to be strings that contain cis or trans labels external and after the main SMILES string (e.g. | c:5, t:11|). The molecules themselves are usually but not exclusively large macrocycles. (file attached).
I think (but am not 100% sure...) what is happening here is that Indigo is outputing the SMILES strings with support for ChemAxon's SMILES extensions enabled. This tends to be a good thing - as it allows a compact view of the molecule (SMILES) to retain more complex (eg enhanced stereochemistry) that may be present in the incoming representation (eg SDF).
As far as I am aware, RDKit does not support these extensions - and rather than just ignoring them, a SMILES parsing error is raised. I would suggest that it would be better to allow RDKit to process Indigo-modified structures in MOL or SDF rather than SMILES - this would retain as much information as possible, but still allow the RDKit nodes to work with the structures.
James is correct here: the RDKit does not support the ChemAxon SMILES extensions and those are leading to the parse errors. You are likely to have less problems by going by way of MOL or SDF.
That sounds like good advice.