Hi CDK lovers,
While I was working with Largest Chain descriptor I encountered some inconsistencies that I would like to clarify.
As you can see in the attached workflow, the most important problem is that it seems that aliphatic ring atoms are counted as part of largest chain while aromatic ring ones are (correctly) not counted as part of it. By definition Largest chain should not count atoms in any kind of ring.
Using CDK outside KNIME I verified that if before launching largest chain descriptor calculation I set checkRingSystem = true, then this problem does not occur.
A part of this I noticed some additional smaller inconsistencies which I like to clarify. These also are shown in the attached workflow.
In some cases terminal atoms seems not to be counted as part of the largest chain LC (LC Row0 = 0; LC Row2 = 1), while in other cases they are (LC Row5 = 2; LC Row6 =2).
In some cases iso terminal groups are countes as 2 (Row4) while in other cases are counted as 3 (Row8).
Atoms between rings in some case to be correctly counted (LC Row7 = 2), while in other cases they are not (LC Row1 = 0)
Please, can anybody clarify me on this apparent inconsistencies?
I had a look at the CDK documentation to see which options are available for the descriptor. According to the documentation, the LargestChainDescriptor "returns the number of atoms in the largest chain" but it doesn't specify anything further.
Setting the "checkRingSystem" option to true, does indeed solve the problem of not counting non-aromatic carbons in a ring system. If you could please provide me with a reference to the definition, I can change the option in the molecular properties node and add the reference to the node description.
Coincidentally, while going through the code, I also found a "LongestAliphaticChainDescriptor". Have you had a look at that? How does the longest aliphatic chain descriptor compare to the longest chain descriptor with the "checkRingsystem" option set to true?
Regarding the found inconsistencies, I would recommend to contact the CDK mailing list directly. Looking at your examples, I would assume that those are real deficiencies in the actual algorithm (which was written way back in 2006).
Hope that helps.
Thank you for your quick reply. Unfortunately I don't have a reference formal definition for “Largest Chain Descriptor” to provide you with. I was thinking (maybe wrongly) that the word “chain” is in contrast with “rings”. This is why I was thinking it was odd to count ring atoms in the “LargestChainDescriptor”.
I was not aware of the "LongestAliphaticChainDescriptor" and indeed I was thinking that its supposed job was done by the “LargestChainDescriptor”. At this point I don't know what should be the difference between those two. Maybe the original authors can provide more details to clarify this.
I will contact the CDK mailing list as you suggested, trying to point directly at this forum thread. It will be easier if I could explain the supposed inconsistencies providing some examples.
Thanks again for your feedback.
Hi again Stephan,
I saw that the"LongestAliphaticChainDescriptor" is counting only carbon atoms in chains. This make sense as this descriptor probably was designed to provide information of aliphatic chains length (i.e. characteristic of fat acid compounds). This would make it different from a “LargestChainDescriptor” with the checkRingSystem = true parameter. There the chain length would count atoms (both carbon and heteroatoms) in chains (and not in rings). Anyway, as I said, this should clarified by the authors.
I submitted the inconsistencies I noticed to CDK mailing list. They confirmed me that was a bug and the fix will be available from CDK version (1.5.13):
Stephan, to what deals with the LargestChainDescriptor they confirmed me that its purpose is to provide the longest path that contains non-aromatic, non-ring atoms. And now I think they do that just checking that an atom is not in a ring (with the checkRingSystem parameter).
thanks for following this up. The patch is already part of the latest CDK 1.5.13 snapshot.
I have updated the nightly build with the latest CDK version and changed the relevant parameter of the LargestChainDescriptor.
Please double check whether the patch and parameter change fix all the inconsistencies you found.
Thank you for your effort on this. I tested the new version and it seems that all the inconsistencies were solved.
For future reference about this descriptor I want to report here that, according with the Javadoc (http://cdk.github.io/cdk/1.5/docs/api/index.html):
- The LargestChainDescriptor is counting the largest non-ring atom chain
- A chain exists if there are 2 or more atoms. Thus single atom molecules will return 0
The second point can be debated, because terminal atoms or single atoms between ring will not count in this descriptor. This also mean that the value for this descriptor can never be 1. It can be 0 or a value >= 2. Anyway CDK people seems to prefer remaining linked to the above definition.
According to this, the currently corrected value of LargestChainDescriptor for the uploaded example workflow molecules are:
Excellent Gio, thanks for the summary and references.
sorry to resurrect / hijack, but i have a question relating to this thread - in particular chain counting.
I am looking for a node that counts individual chains within a molecule, which the latest paper from Arup Ghose uses in his TEMPO:
There he uses Biovia Pipeline Pilot's node to calculate the number of non-branched side chains (where propyl is one, but isopropyl is two). is it possible to get the CDK properties node to calculate a similar metric, as it already calculates chain length?
the CDK properties node does not support this calculation out of the box.
The underlying LargestChainDescriptor class uses a two step approach where it first determines atoms and bonds that are part of a ring system and then uses pairwise shortest paths on all pairs of atoms that are not part of any ring system (see code on GitHub).
I imagine you can use a similar approach to exhaustively enumerate all chains. Alternatively a depth first algorithm might be more intuitive. Start at a terminal node and explore the tree.
Thanks for your speedy response, I'll take a look and see if I can work out a way of doing it as per your suggestion.