First of all, I guess this is probably not a specific RDKit in KNIME issue, but that's where I was using it - so here I am!
I have been setting-up some 'light' calculation workflows with the intention of calling these via KNIME server webservices. A quick slogp-calculating service was the first thing I tackled as a demo - and I was pretty pleased with performance (particularly as the RDKit nodes parallelise so nicely!): using the first_5K.smi data file that is part of the full RDKit distribution I was getting a turnaround of ~ 4 secs (input upload, workflow call, checking for workflow finish, result retrieval, tidying-up temp files and deleting workflow).
Then I thought before rollout I should check how the numbers compared with MOE-calculated equivalents... My understanding is that both RDKit slogp and MOE SlogP are based on the 1999 Crippen 'atom contribution' paper(?) Overall, the numbers are identical most of the time - if I remove those that are outside +/- 0.05 units it leaves 1661 examples - and some of these are out by a lot (up to almost 5 units)!
Examining this a bit further shows-up some obvious candidates leading to the differences:
S(VI) -SO3H 3 of these and MOE value is 4.94 units < RDKit; 2 of them and MOE is 3.29 units less; 1 and MOE is 1.65 units less
S(VI) RSO2X lots of sulfonamide, sulfone, sulfonic ester examples show MOE 1.081 units less than RDKit
S(IV) examples - eg sulfoxides - show consistent -0.866 difference
P(V) examples showing -1.07 difference
Lots of mono-alkenes giving +0.109
Lots of mono-nitros giving +0.141; bis-nitros giving +0.282
There are quite a few more consistent differences; and of course some of these are likely to be around aromaticity perception, etc. I should also say that to generate the MOE SlogP values I first converted the SMILES into MOL using the Marvin MolConverter node - this could have some impact.
So I just wanted to raise this to get some advice on (a) are the two calculations both intending to implement the same thing - ie based on the Crippen paper; and (b) if so, is there systematic 'errors' in either implementation leading to the differences, or are these purely due to toolkit molecular interpretation?
I have attached .table and .csv (renamed .txt because the file attacher refused the csv...) copies of the data - including the original smiles strings.
I checked your dataset and the original publication. Wildman and Crippen used MOE SMARTS to identify the atom types but also modified those SMARTS by "!" that was introduced in a later version of MOE.
As you know MOE SMARTS use charge separated notation of S, N, and P in positive oxidation states. So I believe that the differences mainly come from the differences in the SMARTS matching implementations. But there is another issue with your dataset. The original descriptors have been trained on "washed" molecules. The protonation states are crucial and I would recommend using the "Wash" node of the MOE Extensions before the Descriptor node.
In general I think that to evaluate qsar models the same descriptor implementations should be used as were used to train the model. Thus even if there is an systematic error it is systematic for training and test set.
Thanks for the clarification. I agree that for the purpose of model building, the dataset is not ideal - and molecules should be 'washed'. However, all I was doing here was to look at differences in the slogp implementations - which should not depend on this step.
I must confess I have not really ever had a need to use SMARTS in MOE, so I was actually unaware (or had forgotten!) that S,N,P are viewed as charge separated in +ve oxidation states - this would certainly explain the bulk of the differences, and I'm sure that the patterns can be modified accordingly in the RDKit implementation to match the intended behaviour.
On this point, however, are the MOE SMARTS rules documented fully somewhere? Having done a bit more investigation (using the MOE nodes in KNIME) I have found that O=S=O will match [S+1] and [O-0] and [O-1] queries - so presumably represented as [O-]-[S+]=O, whereas CS(=O)(=O)Cl matches [S+2] and [O-1], ie C-[S+2](-[O-])(-[O-])-Cl. I could guess at the rules governing this, but would rather not guess!
As far as I know the MOE documentation is the only documentation available for the MOE SMARTS. This makes it almost impossible to reimplement the matching as reverse engineering is not allowed from the license agreement. Additionally to my PN one of my colleagues told me that in some cases (work was done for another project) it's really tricky to get the SMARTS right. So I believe we have to live with the differences.
At least some of the problem is due to a mistake in (at least) one of the SMARTS definitions used in the RDKit.
It's going to take a bit of research to track this down and fix it, but it'll be in the next release.