Is it possible get CDK structures to retain the atom numbering present in the input? I have an SDF when has sites of metabolism listed, but CDK seems to renumber the structure so this information is no longer linkable with the atom.
you are right, the CDKCell stores a SMILES representation of the input. The atom numbering depends solely on the CDK when an AtomContainer is created, not the initial SDFile.
The only way to keep the inital atom numbering is to store them as auxiliary information. Perhaps you can also override or decorate the SMILES generator and parser to accept an initial set of atom numbers.
I had the advantage of I wrote the node I had the problem in. So I was avoiding using CDK Cells and directly taking the Mol or Sdf value. It's the CDK Cell that causes the renumbering not the structure -> IAtomContainer conversion.
Not the CDKCell but the SMILES output order is a depth-first traversal even when not canonical. Using the API directly you can account for this (i.e. get the output order). Ordering between cells is fixed once that's the case.
As Stephan points out, transmitiing this information implicitly (via. atom index) is poor. A more correct way is to use the atom value field of the molfile. You can read/write this in CXSMILES:
Add the AV property from the SMILES in the CDKNodeUtils
/**
* Add the atom value properties to the container
*
* @param mol the built molecule
* @param smiles the smiles string the molecule was built from (with AV values)
*/
private static void addAvProperty(IAtomContainer mol, String smiles)
{
String pattern = "(\\|\\$_AV:)(.*)(\\$\\|)";
Pattern pat = Pattern.compile(pattern);
Matcher matcher = pat.matcher(smiles);
if(matcher.find())
{
String av = smiles.substring(matcher.start(2), matcher.end(2));
String[] values = av.split(";");
for(int i = 0; i < values.length; i++)
{
mol.getAtom(i).setProperty(CDKValue.ATOM_VALUE_PROPERTY_KEY, Integer.parseInt(values[i]));
}
}
}
Changed the adapter cell to produce CDKCell4
Added a new atom numbering option: ATOM_VALUE
Extended the CDKValueRenderer to handle the new numbering option. It displays the atom value property if present or a ? if not
Updated the Depiction node to get the atom with the given value in the atom_value property as opposed to by that index
Options:
Atom values:
Sequential:
Is there a better way of storing this value in the AtomContainer? Is the atom ID suitable for this purpose? The canonical numbering option seems to use the ID and I'm not sure where this is first set.
Cheers
Sam
P.S. It would appear that whenever atom numbering is used the preference should be identified and the right number returned / used. The Smarts matcher doesn't do this and I therefore should update this. There needs to be a default behaviour for "None" having been selected in the preferences.
Sorry for delay, don't get alerts apparently on replies.
You're on the right track but I wasn't suggesting adding all atom indexes as CXSMILES, a more efficient way to do that is to grab the output order (int[] seq) and pass that between cells with the SMILES. What I was actually suggesting was your specific input shouldn't describe metabolism points using the arbitrary indexes.
Just to clarify, I do NOT think you should maintain the atom order between nodes. There are much more robust ways to transmit information. In some toolkits (e.g. OEChem) indexes aren't even sequential! If you really want to do this store it in an custom property (e.g. KNIME_ORIGINAL_ATOM_IDX) and send that between nodes.
Anyways,
CXSMILES is natively supported in the SmilesParser (as of 1.5.13) SmilesGenerator (as of 1.5.14). See https://github.com/cdk/cdk/wiki/1.5.14-Release-Notes#cxsmiles-generation. CXSMILES is actually quite nasty and in general you can't use REGEX, okay _AV: but you need a full blown parser. Fortunately this should all be handled by the CDK core library.
The ATOM_VALUE in general properties are a bit nasty but existing code depends on it being there (e.g. Molfile). Where does the ID get used in canonical numbering - that sounds like a bug? Or do you mean the ID is assign when you compute a canonical order - that's okay.
P.S. It would appear that whenever atom numbering is used the preference should be identified and the right number returned / used. The Smarts matcher doesn't do this and I therefore should update this. There needs to be a default behaviour for "None" having been selected in the preferences.
I don't understand.. you can get the IAtom reference from the substructure match. Again, don't depend on indexes.