BUG: CDK property values depend on calculated property subset

Dear CDK lovers,

Recently I found a behavior of the CDK Molecular Properties node that, to be confirmed, can have major consequences. As you can see from the attached example workflow there are properties, here I found 6 (i.e. TPSA, XlogP, HBA, Largest Chain, Largest Pi Chain, Lipinski RO5), whose calculated values change depending on which properties are chosen to be calculated in the node.

This means that if one of these property is calculated on its own (“in isolation”) by the Molecular Property node it gives a certain value (supposedly the correct one), while if several properties are calculated at once by the node, it gives a different values. Value differences are significant and can not be imputed to value rounding processes.

The degree of incidence of this problem is also large, as in a test considering 100,000 molecules it results that the 95.6% of them differs in at least one property (from the values calculated “in isolation”).

The following are the details of difference:

Descriptor     Differences (%)

HBA     12.1

Largest Chain     94.9

Lipinski   RO5     0.8

TPSA     12.1

XLogP     9.5

Largest Pi Chain     0.7

Any descriptor     95.6

Please, can anybody confirm this bug or comment on this?

Thanks a lot,

Gio

Hi Gio,

which version are you using?

I tried your workflow and cannot confirm this bug. After re-execution of the nodes, I get no differences whatsoever. I had to change the column selection in the Atom Manipulator first though.

Cheers,

Stephan

Actually, I found the culprit. The ALogP and VABC descriptor manipulated the molecule that was passed to the other descriptors. If you re-run your workflow without these two descriptors, everything is in order.

I changed the ALogP and removed the VABC because the VABC descriptor relies on a deprecated method for aromaticity perception.

Hi Stephan,

Thanks for your reply. I saw you already made some changes and published the new (nightly) build. That's great! So effective.

Nevertheless I would have a couple of questions I would like to discuss with you:

1) I saw you commented the KNIME smartALogP descriptor that was in use until yesterday and you replaced it with the “pure CDK” version. What is the difference between those? And more in general what is the difference between the smart versions of the descriptors introduced in CDK-KNIME and their original “pure CDK” versions? I never understood that and neither the needing of these “smart” descriptors.

2) I saw you removed the VABC descriptor. Do you have any plan to adjust it or it will be simply removed? I'm also asking this for other people which may have based some workflow on it.

Cheers,

Gio

Hi Gio,

to answer your questions:

1) The 'smart' descriptors -- forgive the horrible naming -- are copies of the original CDK descriptors that were put in place because, at the time, the original descriptors were buggy or inefficient, e.g. redundant copying of objects, not thread-safe, etc. With the CDK evolving at the speed it does, the custom ALogP class has become redundant (and detrimental). The key message is, the calculated properties shouldn't be different between the pure CDK descriptors and the custom ones (unless there's a bug or a missing feature of course).

2) I try to use the CDK library 'as is' as much as possible in order to reduce work involved in maintaining the KNIME plug-in but also to ensure consistency between the plug-in and the CDK library. CDK doesn't need another fork. I'll submit a patch to the CDK repository for the VABC descriptor to change its behavior. Until that is done, the descriptor won't be accessible in the nightly build quite simply because it does 'mess up' subsequent calculations. So hopefully the descriptor will reappear in the nightly build fairly soon. The stable build (2.x) won't change until the next stable release of CDK. The 3.x branch is mirroring the nightly at the moment.

I hope that clarifies thinkgs a little. Let me know if you have further questions. Also, feel free to grab the code and add improvements where you see a need.

Cheers

Stephan

Hi Stephan,

Thank you for your answers, they clarified me things a lot.

Thanks also to take care of VABC descriptor. I hope it won't be too difficult to fix the problem related with it and that its corrected version can be included in the next KNIME stable release (3.x).

A cheap workaround to have the correct values of all the descriptors using the current stable KNIME-CDK version would be to calculate AlogP and VABC in 2 different CDK Molecular Property nodes, all the other descriptors using a third one, and finally join the results. I have still to test this solution but I think that it would work.

Thanks again for the clarification and to constantly improve CDK and KNIME-CDK nodes.

Cheers,

Gio

Hi Gio,

I have updated the nightly for KNIME 3.x with the latest CDK version. The VABC (and ALogP) descriptors are working again.

Stephan

Hi Stephan,

This is a great news! Congrats!

I'm still using KNIME version 2.12.1 but I see that in the nightly repo I can update the KNIME-CDK package to version 1.5.400.201511282037. Is this the version to which you have referred in your previous post? I can also use that in my current KNIME version, isn't it?

Gio

Hi Gio,

I just tried to update the nightly build in my KNIME version 2.x and it didn't work. The CDK plug-in requires KNIME 3.x. I don't actually think the code is backwards compatible with the changes made to the KNIME API but I might be wrong. I'll check.

Stephan

Hi Stephan,

Now I see. Well in this case I will test those on an version of KNIME 3.x (currently I was not using it for production purposes).

Thanks, I will let you know.

Dear Stephan,

I tested this issue using the last version of KNIME 3.1 that St. Nicholas left on my shoe 2 nights ago ;-)

This KNIME version has a KNIME-CDK package v 1.5.400.201511291629. I didn't replace this with the nightly build version as it seems older (current KNIME-CDK nightly build version is 1.5.400.201511282037).

I tested the issue using 6 descriptors: (HBA, Lipinski RO5 violations, TPSA, XlogP, Largest Chain and Largest Pi Chain). The issue seems to be solved for 5 out of these 6 descriptors. This is a good news as previously all of these 6 descriptors were not aligned. Anyway it seems that Largest Chain descriptor continue to gives different results depending on if it is calculated in isolation or together with other descriptors (I obtained 95% of different results in a 1000 molecule test set).

I think this is still a problem and it would be good to solve it. Do you think it's worth that I extend the test to all the descriptors beyond these 6? I'm concerned this problem could affect also other molecular properties.

Please let me know also if I can help in other ways.

Cheers,

Gio

Hi Gio,

just a quick thought, if you run the tests on the workflow you attached in an earlier post, the largest pi chain descriptor is mapped to the largest chain descriptor (or vice versa) somehwere in the workflow when you compile the results. That tripped me up at first before I corrected the workflow.

In any case, I'll check in detail tomorrow.

Stephan

Hi Gio,

I couldn't reproduce the error with my test set. Can you please share your workflow?

Thanks,

Stephan

Hi Stephan,

I'm sorry, you're completely right, the Largest Chain descriptor was wrongly mapped in my workflow. I have updated it on my first message of this thread, now it is correct.

So now I can confirm you that the issue is solved using KNIME-CDK package v 1.5.400.201511291629, at least for the 6 descriptors in which I identified the problem. This is a very good news. Anyway in this respect I wanted to ask you if you think that the same problem can affect also other descriptors. If this is the case and you think it's worth, I could extended the test in order to check all the KNIME-CDK descriptors. While if you think that the problem was focused only on the descriptors I tested, I would avoid to prepare and additional test and give the bug for closed.

Thanks for your help,

Gio

Hi Gio,

I have already extended the test workflow to take most single value descriptors into account and couldn't find any further differences. None of the CDK descriptors should modify the original molecules that are passed as arguments. I would consider this issue resolved.

All best,

Stephan

Huge! That's a great news!

Thank you for all the efforts you spent in this.

Best,

Gio

Hi again Stephan,

I'm sorry if I come back to this old issue but I have a question related with it. If I use the CDK-KNIME node in order to calculate the molecular properties I found no problems. Anyway if I use CDK (version 1.5.12 taken from central maven repository) outside KNIME I'm obtaining different values of descriptors respect the KNIME version. In particular, as we say earlier, the VABC descriptor remove the aromaticity flag on the molecules and this cause problems in the calculation of other descriptors.

Please, can you tell me where I can find the patch of this descriptor that you submitted to CDK and you mentioned earlier in this forum thread?

Hi Gio,

no worries. The VABC descriptor patch hasn't gone through yet. If you use the CDK library outside KNIME, you could just clone your molecule before passing it to the VABC descriptor. That's in effect what the patch does.

I've attached the (temporary) patch. Ideally the actual algorithm should be changed so that it doesn't change the original molecule.

Stephan

OK Stephan,

Thank you so much for the explanation (and the patch).

Best,

Gio