CDK Atom numbering

Hi

Is it possible get CDK structures to retain the atom numbering present in the input? I have an SDF when has sites of metabolism listed, but CDK seems to renumber the structure so this information is no longer linkable with the atom. 

Not sure if there's a way around this?

Cheers

Sam

Update: it seems to be the CDK Cell doing this not the initial generation of the IAtomContainer

Update2:

Ah, you don't sore the AtomContainer itself but a decomposition of it?

This scenario doesn't work:

1) Create an IAtomContainer

2) Add some highlight properties to the atoms

3) Create a CDK Cell

The creation of the CDK Cell jumbles the link between the properties and the atom. 

 

Hi Sam,

you are right, the CDKCell stores a SMILES representation of the input. The atom numbering depends solely on the CDK when an AtomContainer is created, not the initial SDFile.

The only way to keep the inital atom numbering is to store them as auxiliary information. Perhaps you can also override or decorate the SMILES generator and parser to accept an initial set of atom numbers.

Cheers,

Stephan

Thanks Stephan. 

I may look into a non smiles based CDK Cell at some point but i suspect that is a rabbit hole I don't wish to head down. 

I've implemented a work around in my node to avoid the CDK Adapter cell if the input type is Mol or SDF. 

Cheers

Sam

sweebb this is offtopic, but could you tell me how you added the numbering of carbons atoms to your structures?

Cheers

Casper

I had the advantage of I wrote the node I had the problem in. So I was avoiding using CDK Cells and directly taking the Mol or Sdf value. It's the CDK Cell that causes the renumbering not the structure -> IAtomContainer conversion. 

Not the CDKCell but the SMILES output order is a depth-first traversal even when not canonical. Using the API directly you can account for this (i.e. get the output order). Ordering between cells is fixed once that's the case.

As Stephan points out, transmitiing this information implicitly (via. atom index) is poor. A more correct way is to use the atom value field of the molfile. You can read/write this in CXSMILES:

CCCO |$_AV:1;2$|

Hi John (I assume) 

Thanks for the response. 

I decided to stop dodging the issue and have had a crack at the extended smiles that Stephan and you have suggested. 

Changes:

  1. Created new CDK Cell: CDKCell4
    1. Calculate extended smiles instead of normal smiles:
      smiles = CDKNodeUtils.calculateExtendedSmiles(atomContainer, seq);
      public static String calculateExtendedSmiles(final IAtomContainer molecule, final int[] sequence) {
              return addAvString(calculateSmiles(molecule, sequence, true), sequence);
          }   
      

      private static String addAvString(String smiles, int[] seq)
      {
      StringBuffer av = new StringBuffer(smiles);
      av.append(" |$_AV:");

          for(int val : seq)
          {
              av.append((val + 1) + ";");
          }
           
          String avSequence = av.toString();
          avSequence = av.substring(0, avSequence.length() - 1);
          avSequence = avSequence + "$|";
           
           
          return avSequence;
      }
      
    2.  Add the AV property from the SMILES in the CDKNodeUtils
      /**
       * Add the atom value properties to the container
       * 
       * @param mol           the built molecule
       * @param smiles        the smiles string the molecule was built from (with AV values)
       */
      private static void addAvProperty(IAtomContainer mol, String smiles) 
      {
          String pattern = "(\\|\\$_AV:)(.*)(\\$\\|)";
      
      Pattern pat = Pattern.compile(pattern);
      Matcher matcher = pat.matcher(smiles);
       
      if(matcher.find())
      {
          String av = smiles.substring(matcher.start(2), matcher.end(2));
          String[] values = av.split(";");
           
          for(int i = 0; i < values.length; i++)
          {
              mol.getAtom(i).setProperty(CDKValue.ATOM_VALUE_PROPERTY_KEY, Integer.parseInt(values[i]));
          }
      } 
      

      }



       
  2. Changed the adapter cell to produce CDKCell4
  3. Added a new atom numbering option: ATOM_VALUE
  4. Extended the CDKValueRenderer to handle the new numbering option. It displays the atom value property if present or a ? if not
  5. Updated the Depiction node to get the atom with the given value in the atom_value property as opposed to by that index

Options:

 

Atom values:

Sequential:

Is there a better way of storing this value in the AtomContainer? Is the atom ID suitable for this purpose? The canonical numbering option seems to use the ID and I'm not sure where this is first set.

Cheers

Sam

 

P.S. It would appear that whenever atom numbering is used the preference should be identified and the right number returned / used. The Smarts matcher doesn't do this and I therefore should update this. There needs to be a default behaviour for "None" having been selected in the preferences.

 

 

 

Sorry for delay, don't get alerts apparently on replies.

You're on the right track but I wasn't suggesting adding all atom indexes as CXSMILES, a more efficient way to do that is to grab the output order (int[] seq) and pass that between cells with the SMILES. What I was actually suggesting was your specific input shouldn't describe metabolism points using the arbitrary indexes.

Just to clarify, I do NOT think you should maintain the atom order between nodes. There are much more robust ways to transmit information. In some toolkits (e.g. OEChem) indexes aren't even sequential! If you really want to do this store it in an custom property (e.g. KNIME_ORIGINAL_ATOM_IDX) and send that between nodes.

Anyways,

CXSMILES is natively supported in the SmilesParser (as of 1.5.13)  SmilesGenerator (as of 1.5.14). See https://github.com/cdk/cdk/wiki/1.5.14-Release-Notes#cxsmiles-generation. CXSMILES is actually quite nasty and in general you can't use REGEX, okay _AV: but you need a full blown parser. Fortunately this should all be handled by the CDK core library.

The ATOM_VALUE in general properties are a bit nasty but existing code depends on it being there (e.g. Molfile). Where does the ID get used in canonical numbering - that sounds like a bug? Or do you mean the ID is assign when you compute a canonical order - that's okay.

P.S. It would appear that whenever atom numbering is used the preference should be identified and the right number returned / used. The Smarts matcher doesn't do this and I therefore should update this. There needs to be a default behaviour for "None" having been selected in the preferences.

I don't understand.. you can get the IAtom reference from the substructure match. Again, don't depend on indexes.