We have updated the nightly build and 4.7 stable builds to v1.36.3. this fixes a minor bug in the MMP fragmentation nodes bought to our attention via email.
The bug was that the MMP Molecule Fragment node (MMP Molecule Fragment (RDKit) – KNIME Community Hub) mis-estimated the number of possible fragmentations for the Limit by Complexity
/ Maximum Fragmentations
options when the number of cuts being made was not 1.
As this bug was discovered as part of the result of answering a question about what the setting is meant to do, here is also the explanation given by email to the user, in case anyone else finds it useful:
I added this feature to prevent unexpectedly complex molecules (i.e. molecules which despite fitting within a prefilter by HAC still could be fragmented in a large number of ways) from causing the fragmentation of a set of compounds to grind to a complete halt whilst one or two ‘pathological’ molecules caused the node to wait for a long time to complete. I implemented this by a crude approach as follows:
- Identify all the bonds which it is possible to cut for the specified number of cuts (you can see which bonds those are using the ‘MMP Show Cuttable Bonds’ node)
- As each fragmentation is a combination of those bonds we can use the combination formula
nCr = n! / ((n-r)!r!)
wheren
is the number of cuttable bonds, andr
is the number of cuts to make. There are 2 additional things to consider
- When
r
= 1 when the number needs to reflect that each bond can be cut in 2 directions, e.g.A-B
and result in a ‘key’A-*
and ‘value’B-*
, and also a ‘key’B-*
with ‘value’A-*
- this is why the 23 matching bonds for r=1 fails when the threshold is 45, but passes when the threshold is 46 (= 2 × 23) – I agree this is slightly unintuitive (I had to think quite hard about this when I was looking back at the source code just now), but I think it is correct.- When
r
= 2 we need to add in the possibility of each bond being cut twice if that option is selected, i.e. so thatA-B
becomes ‘key’A-*.B-*
with value*-*
– so in addition to thenCr
combinations from above we need to add a furtherr
combinations. The example has 8 cuttable bonds, which give for 2 cuts 8 + 8!/(6!2!) = 28 + 8 = 36 – which is not the behaviour the node is showing – see below!The caveat is that this approach will sometimes over-estimate molecular complexity when there is symmetry. Suppose an example SMILES string
Fc1ccc(C(F)(F)F)cc1
and the situation where we are only making cuts at bonds toF
– there are 4 such bonds (i.e.n
= 4), and so this method estimates for 2 cuts that there are4 + 4!/(2!2!) = 4 + 6 = 10
combinations. However, in reality, three of the C-F bonds are symmetrically identical and so the actual fragmentations would be one ‘double cut’ to theAr-F
bond, one ‘double cut’ to theArC(F)(F)F
(not 3 cuts to each F separately), one with a cut toAr-F
and a cut to aCF3
C-F
bond, and one with 2 cuts to 2 separateCF3
groupC-F
bonds – i.e. 4 fragmentations. Clearly this is not ideal, but the filter was intended to be a rough filtering step to avoid pathological failures in a workflow.
Hopefully the above clarifies.
Steve