Modification of Sum Formula node

Dear all,

i would like to know if it is possible to implement the following modifications into the Sum Formula node, which would be really helpful for the evaluation of high-resolution MS data:

1) The user should be able to define a set of elements (comparable to the Element Filter node), which should be considered for the formula generation. This should include more or less all known elements, because i think that so far only C,H,N and O are considered.

2) The user should be able to define the mass tolerance limit (maybe in Dalton or ppm)

Best regards,
Sascha

Hi Sascha,

that is a good idea.

1) You can now define a custom set of elements that should be used for the calculation. Initially I also added options to use all elements by default (or all elements minus a custom set) but had to disable these for now. The underlying matrix used for the calculation is of size 2^(#elements). Using many elements very quickly exceeds the memory limit.

2) The user can now define a mass tolerance limit in amu.

I would be grateful if you could test the node and see if it behaves as expected.

Best regards,

Stephan

Hi Stephan,

sorry that it took me so long to respond to your post. Thanks again for the quick implementation of the suggested modifications. I have already tested the new node and it behaves as expected. However, as you already mentioned it, the generation of molecular formulas suffers from the combinatorial explosion. Furthermore, the suggested formulas don´t necessarily make sense. Is it possible to include therefore further (e.g. user-defined) filters as for example tresholds for the C/H ratio?

Best regards,
Sascha

Hi Sascha,

yes, that's possible. Do you happen to have a citation at hand that I can use as guideline to implement sensible filters?

Have you tried the exclude filtered option? That invokes a molecular formula checker loosely based on the 7 Golden Rules by Kind et al, 2007. Currently the following rules are implemented. If any of these are violated, the calculated molecular formula is excluded.

  • MM Element rule (Wiley 500, limited to CHNOPS+Hal)
  • Nitrogen rule
  • RDBE rule

Best regards,

Stephan

Hi Stephan,

the idea with the element ratios and some tresholds for these parametes are also defined in the publication you mentioned (http://www.biomedcentral.com/content/pdf/1471-2105-8-105.pdf). In the table on page 8 you have actually some ranges for these element ratios, which might be helpful to include as filters.

Best regards,

Sascha

Hi Sascha,

thanks for pointing me to this rule. I have added the C/H and hetereatoms/C ratio checks and refactored the node dialog slightly. I haven't had time yet to extensively test the implementation.

I would be grateful if you could take a look and see if it behaves properly.

You can now filter by the nitrogen rule, element ratios, and element occurrences based on the pre-defined sets provided in the article that were derived from the two spectral databases they used.

Best regards,

Stephan

Hi Stephan,

thanks again for your fast implementation of my suggestions. The node behaves still properly as far is i have checked it and i think the results look already much better now.

However, there is still one disadvantage when i want to apply this node for my purpose, the evaluation of mass spectrometry data of organic molecules. All compounds i am interested in have a carbon backbone, but the "Sum Formula" node generates also formulas without a single carbon atom. Therefore, i would like to know if it is possible to include in the node GUI also a table, where the user can define minimum number of atoms considered per element? I dont know if these constraints would also decrease the computational time required to generate the formulas for a given mass (which can be in the range of several minutes for a mass of about 1000), as the general formula space is than bounded by certain tresholds.

Best regards,

Sascha

Hi Sascha,

that's doable. A comprehensive table where element occurrence bounds could be set would be ideal but also require more time than I currently got. I assume restricting Carbon numbers is the most abundant use case, therefore I have added a text field to do exactly that.

However, I have noticed that some calculated sum formulas violate the minimum carbon number limit. I believe this to be a bug in the CDK class and have filed a bug report on the project's SourceForge website: https://sourceforge.net/p/cdk/bugs/1347/

I would be grateful if you could update this thread (or the SourceForge ticket) with any performance gains/losses you experience and how well the carbon number limit actually works.

Cheers,

Stephan

Hi Stephan,

 

it has been some time that I was using the Sum Formula node but now I got back to it. And the first thing that I noticed is the following error message (only when I apply the element ratio rule):

Configure failed (NoClassDefFoundError): org/openscience/cdk/formula/rules/ElementRatioRule$RatioRange

The second observation is about the predictions themselves. I wanted to predict sum formulas for the mass  564.5593 with a mass tolerance of 0.01, the Wiley-2000 element restrictions  and C,H,O,N,S as included elements. However, no sum formula was predicted at all with the CDK node, while another formula calculator from a mass spectrometer vendor gives me C36H72N2O2 (plus 4 other sum formulas). My CDK version is 1.5.3.2015…

Cheers,

Sascha

Hi Sascha,

thanks for your message. The ElementRatioRule is working again. The class is not part of the CDK core library and was left out by accident during the last update.

I could also verify that a mass of 564.5593 yields no sum formulas. The default settings restricted the searchable space to a maximum number of atoms that was below 72 hydrogen atoms. I have modified the node settings dialog to allow you to change the default settings.

If you update your build (nightly), you can now get C36H72N2O2 for a mass of 564.5593 as well as several other molecular formulas.

Please let me know if the update works for you.

Kind regards,

Stephan