Simple Matched Molecular Pairs (MMP) Example

This workflow provides a simple example of generating matched molecular pairs (MMPs) from a set of compounds and using them to predict models with improved properties - in this case, CYP3A4 inhibition using ChEMBL data. The MMP Molecule Fragment node is configured to make 1 cut, using the original Hussein/Rea schema. As we have not pre-filtered the incoming molecule table, we limit by complexity to 5000 cut combinations, and also filter the fragmentations by the ratio of and minimum number of unchanging atoms. We do not calculate graph distance fingerprints in this example (1 cut only will always return an empty fingerprint), but we do calculate attachment point fingerprints in case we want to restrict the MMP by it's molecular context later. We have passed through all the data columns, and also, for illustrative purposes here, elected to render the fragmentation so we can see what is happening. We are using the ChEMBL Parent ID as the ID (Note therefore that this column appears in the output as 'ID' and not as it's incoming name, even though we select it in the pass-through table). With the fragmentation performed, MMPs are generated. We defined a number of ratio (R/L) and differences (R-L; for log propertied) and a few pass-through properties, including the attachment point fingerprint. NB We also decided to restrict transforms by the change in heavy atom count. The first stage of the node execution is sorting the input table by the 'Key' column, whereafter pair generation is parallelised. After, some filter is performed: we could do a simplistic filter, for any transform which has a negative value (we want less active compounds against CYP3A4!) in the 'PCHEMBL_VALUE (R-L)' column, but that would give us transforms which sometimes improve matters but generally don't. Instead, we use groupby to give the mean and standard deviation. We only want transforms where there are at least 3 examples (the final column in the grouping table), and the mean pCHEMBL value is at least 1 std dev below 0.0. Sorting the table gives the biggest changes first, and we can also look at the effect if tge transform on ALOGP and PSA. Eventually, we apply the transforms with and without filtering: The Apply Transforms node pre-sorts the transforms table, and then applies each transform to the entire molecules table. Two node views show progress in either a simple form, or a more informative format showing the current 'active' transforms. If we try to use the Filter by attachment point fingerprint, we will get a warning at this point as the group/ungroup sequence has lost the Fingerprint column properties which tell the node how to generate the fingerprint. In the below part there is the workaround for fingerprint similarity filtering - use a joiner to attach the properties & fingerprints back to the required set of transforms. Note in this case, a low Tanimoto Similarity threshold is required to get any matching transforms. Notes 1 - The transform will be applied if any of the rows containing it pass the similarity threshold - although the transform is the same, the environments from the molecules it was created from could (will?) be different 2 - If there are multiple matching sites in a molecule, only those which match the environment similarity threshold will be reacted 3 - If there are multiple matching sites, each site will be reacted in turn, with products only resulting from a single transformation returned


This is a companion discussion topic for the original entry at https://kni.me/w/8UU32N_Xgi-fV9kc