Speed of transfering RDKit molecules into Python nodes

kienerj · February 2, 2023, 6:57am

I want to take this up in a new thread. Here my quote from the announcement thread:

I want to take up on this observation again which existed in legacy Python nodes and not is still in the new one.

Transferring rdkit molecules from KNIME to Python is very slow. See my quote above. It’s always a lot faster to use sdf or SMILES and do the conversion in the Python code.

Case with RDKit uses RDKit from molecule before the python snippet, the case “SDF” just uses MolFromMolBlock in python code.

This is for about 3000 molecules. I put the python code at the end of the post.

I tend to go “out of my way” to input smiles or sdf into python nodes over rdkit mols for speed. It can really start to matter with higher number of molecules.

Anyway I was wondering why this great difference exists? I thought there is no more copying of memory with the new nodes but it somehow looks like some form of moving/copying data is happening?

Code with rdkit mols:

import knime.scripting.io as knio
from rdkit import Chem
from rdkit.Chem import Descriptors
from joblib import Parallel, delayed
import pandas as pd

def getMolDescriptors(mol):
    ''' 
    calculate the full list of descriptors for a molecule
    '''   
    res = {}
    for nm,fn in Descriptors._descList:
        # some of the descriptor fucntions can throw errors if they fail, catch those here:
        try:
            val = fn(mol)
        except:           
            val = None
        res[nm] = val
    return res
        
        

df = knio.input_tables[0].to_pandas()
descriptors = Parallel(n_jobs=6)(delayed(getMolDescriptors)(x) for x in df['RDKit Mol'])
desc_df = pd.DataFrame(descriptors, index=df.index)
output_table = df.join(desc_df)
knio.output_tables[0] = knio.Table.from_pandas(output_table)

Code with sdf/smiles:

import knime.scripting.io as knio
from rdkit import Chem
from rdkit.Chem import Descriptors
from joblib import Parallel, delayed
import pandas as pd
from rdkit import Chem


def getMolDescriptors(mol, mol_format):
    ''' 
    calculate the full list of descriptors for a molecule
    '''   
    res = {}
    for nm,fn in Descriptors._descList:
        # some of the descriptor fucntions can throw errors if they fail, catch those here:
        try:
            if mol_format == "Smiles":
                mol = Chem.MolFromSmiles(mol)
            elif mol_format == "SDF":
                mol = Chem.MolFromMolBlock(mol)
            val = fn(mol)
        except:           
            val = None
        res[nm] = val
    return res
        
        
mol_format = knio.flow_variables['Column Type']
df = knio.input_tables[0].to_pandas()
descriptors = Parallel(n_jobs=6)(delayed(getMolDescriptors)(x,mol_format) for x in df['Molecule'])
desc_df = pd.DataFrame(descriptors, index=df.index)
output_table = df.join(desc_df)
knio.output_tables[0] = knio.Table.from_pandas(output_table)

steffen_KNIME · February 2, 2023, 8:31am

Hi kienerj,
that is interesting; thanks for specifying. Sorry, I need to ask the following question: are you using the Columnar Backend extension? Probably yes, but we have to rule that out
(to check: right-click on the open workflow of your test, Configure... -> Select Table Backend -> Columnar Backend)

Thanks
Steffen

kienerj · February 2, 2023, 9:37am

Yes I’m using Columnar Backend

kienerj · February 2, 2023, 10:02am

The difference can also be seen when doing nothing at all in the python script:

import knime.scripting.io as knio

knio.output_tables[0] = knio.input_tables[0]

In fact the relative difference is much bigger here, not surprisingly. the more intense your computation the less this matters for total runtime. Still passing through the rdkit mols is 7.5 times slower than sdf/test.

Also checked SMILES which is a bit faster than sdf (makes sense, shorter strings).

is the rdkit mol passed around as pickle/binary? Shouldn’t that take less space than sdf? or does it get instantiated upon loading into python?

steffen_KNIME · February 2, 2023, 11:05am

Possibly an issue of serialization and not one of size. I opened a ticket (AP-20109) and we will come back to you when we have a fix.

Best regards
Steffen

system · May 3, 2023, 11:06am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.