Has anyone come up with a strategy for handling organometallic compounds input as SMILES?
With "unusual" apparent valencies on N, O etc, and the various co-ordination numbers found on transition metals, RDKit typically throws a tantrum... as does CDK and Indigo!
BTW, I can probably suffice without molecules like ferrocene.
Also, as the data I'm looking at comes from a crystalographic DB, many compounds have multiple small ligand/solvation molecules as well as the "compound of interest" - is there an easy way to remove these?
There's really no good way to express organometallics as standard SMILES: the types of bonds that show up are just too varied and different from what you get in organic molecules. There are some extensions to SMILES supported by the ChemAxon tools that may work, but I haven't tried them for organometallics and they wouldn't help with the RDKit anyway (though adding support for at least some of those extensions is on our ToDo list). The best I'm aware of for dealing with organometallics in standard SMILES, and it's pretty poor, is to break all the ligand-metal bonds and just express things as dot-separated structures.
For removing solvents, etc.: the RDKit salt-stripper node can be used for this, but you would need to provide it with the set of species that should be removed.
Would inchi keys work (better)?
InChI may well work better, but the data I have only has SMILES strings.
Perhaps the CIR node, under Community Nodes, Talete, can help convert?
If you want to simply rip out the ligand and solvent stuff, you can cut up the smiles yourself (simply split it on the dot (.) character) and keep only the fragment that is the longest, stringwise.
This worked for me in over 99% of the cases, only very rarely a solvent had a larger notation then a compound.