Announcing the interoperability between RDKit and Python in KNIME 4.7

carstenhaubold · January 26, 2023, 10:02am

Dear all,

We are glad to announce more details about the RDKit 4.7 Feature release compatible with the KNIME Analytics Platform 4.7.

The most notable features are:

native read/write support for RDKit types in KNIME’s columnar backend (KNIME Columnar Table Backend Boosts Performance )
support for RDKit types in the new Python Script nodes (which are “out of labs” since KNIME AP 4.7), including RXN reactions which were not supported with the previous Python Script nodes (now Python Script (legacy))
support for RDKit types in pure-Python KNIME Extensions
support for Apple Silicon added by @ptosco, @manuelschwarze and @greglandrum: you can use RDKit with the Apple Silicon version of KNIME Analytics Platform

Thanks to @steffen_KNIME for support with the implementation, @manuelschwarze and @greglandrum for reviewing, and @Alice_Krebs and @greglandrum for testing!

Below are a few examples to help you get started with RDKit and the other chemistry types in KNIME’s new Python nodes.

Note: These examples require the Python Script node and do not work with the Python Script (legacy) nodes in KNIME 4.7.

1. Converting between molecule types in Python

If you want to convert between RDKit molecules and Smiles or Smarts in a Python Script node, you can proceed as follows

import knime.scripting.io as knio
import knime.types.chemistry as ktchem
import rdkit.Chem

df = knio.input_tables[0].to_pandas()

# Convert SMILES to RDKit molecules, assuming there is a "Smiles" column
mols = [rdkit.Chem.MolFromSmiles(s) for s in df["Smiles"]]
df["RDKit Molecules"] = mols

# Convert RDKit molecules to Smarts.
# Note that SMILES, SMARTS, SDF and many other chemistry types are
# represented as strings in Python. To let KNIME know which type it should be,
# we create SmartsValues (or SmilesValue, SdfValue, ...) from the strings.
df["Smarts"] = [ktchem.SmartsValue(rdkit.Chem.MolToSmarts(m)) for m in mols]

knio.output_tables[0] = knio.Table.from_pandas(df)

The same works in the execute method of a pure-Python node, as we see in the next example.

2. Using RDKit in Pure-Python nodes

When writing KNIME Python Extensions, you can now work with RDKit types as well. One thing you have to do in a pure-Python node is specifying the types of all output columns in the configure method. To use RDKit or chemistry types when creating KNIME table schemas and columns,This text will be hidden simply provide the molecule types or SMILES, SMARTS or SDF values as in the example below.

import knime.extension as knext
import knime.types.chemistry as ktchem
from rdkit import Chem

def is_smiles(col):
    """
    Check if the provided knext.Column contains smiles values. Due to
    technical reasons, the type of the column can be either SmilesValue or 
    SmilesAdapterValue, so here we check for both.
    
    The function knext.logical turns the value types into KNIME column types,
    and is called "logical" as opposed to "primitive" like plain numbers.
    """
    return col.ktype == knext.logical(ktchem.SmilesValue) \ 
        or col.ktype == knext.logical(ktchem.SmilesAdapterValue)


@knext.node(
    name="Smiles To RDKit Molecule",
    node_type=knext.NodeType.MANIPULATOR,
    icon_path="icon.png",
    category="/",
)
@knext.input_table(name="Input Data", description="Input table containing SMILES")
@knext.output_table(name="Output Data", description="Table plus a 'Molecule' column")
class SmilesToRDKitMol:
    """Smiles to RDKit molecule conversion
    
    This node converts Smiles to RDKit molecules
    """

    smiles_column = knext.ColumnParameter(
        label="SMILES column",
        description="Input column containing SMILES data",
        column_filter=is_smiles,
    )

    mol_column_name = knext.StringParameter(
        label="Molecule column name",
        description="Name of the newly created column",
        default_value="Molecule",
    )

    def configure(self, configure_context, input_schema):
        new_col = knext.Column(Chem.Mol, self.mol_column_name)
        return input_schema.append(new_col)

    def execute(self, exec_context, input_table):
        df = input_table.to_pandas()
        
        mols = [Chem.MolFromSmiles(s) for s in df[self.smiles_column]]
        df[self.mol_column_name] = mols

        return knext.Table.from_pandas(df)

Note how in the configure method a column is created with the type Chem.Mol. This suffices to tell KNIME that the new column contains RDKit molecules and you will be able to work with this column in KNIME with other RDKit nodes.

The input column selection parameter filters the columns such that only those containing SMILES values are available for selection. In this example, the column filter is_smiles is provided when defining the parameter smiles_column.

See Contents — KNIME Python API documentation and Create a New Python based KNIME Extension for details of how to write Python nodes.

Note: When working with RDKit data inside pure-Python nodes, the extension needs to have a dependency on RDKit as explained here Create a New Python based KNIME Extension, where the ID of the RDKit KNIME Integration is org.rdkit.knime.feature.

Note: if you want to know which types are available, you could e.g. use a Python Script node and run
import knime.api.schema
print(knime.api.schema.LogicalType.supported_value_types())

Sending fingerprints from Python to KNIME

Assuming you want to compute molecule fingerprints in a Python Script node and send them back to KNIME

import knime.scripting.io as knio
import rdkit.Chem
from rdkit.Chem import rdFingerprintGenerator

df = knio.input_tables[0].to_pandas()

# Generate fingerprints for all molecules
fingerprint_generator = rdFingerprintGenerator.GetMorganGenerator(fpSize=1024)
fingerprints = [fingerprint_generator.GetFingerprint(m) for m in df["RDKit Molecule"]]

# Add fingerprint column to our DataFrame
df["Fingerprint"] = fingerprints

knio.output_tables[0] = knio.Table.from_pandas(df)

Working with KNIME-provided fingerprints in Python

If you want to work with fingerprints in Python that are coming from KNIME, they will first be represented as the KNIME type DenseBitVector for normal fingerprints and DenseByteVector for count fingerprints. You can type-cast those to RDKit fingerprints directly, but for count fingerprints you need to know the specific underlying type:

import knime.scripting.io as knio
import knime.api.schema as ks
import knime.types.builtin as ktypes
import rdkit

df = knio.input_tables[0].to_pandas()

fingerprint_type = ks.logical(rdkit.DataStructs.cDataStructs.ExplicitBitVect).to_pandas()
fingerprint_column = df["Fingerprint"].astype(fingerprint_type)

# could be UIntSparseIntVect, IntSparseIntVect, LongSparseIntVect or ULongSparseIntVect
count_fingerprint_type = ks.logical(rdkit.DataStructs.cDataStructs.UIntSparseIntVect).to_pandas()
count_fingerprint_column = df["CountFingerprint"].astype(count_fingerprint_type)

Working with RXN reactions

You can also send reactions back and forth between KNIME and Python

import knime.scripting.io as knio
import knime.types.chemistry as ktchem
from rdkit.Chem import rdChemReactions

df = knio.input_tables[0].to_pandas()

rxn_column = [rdChemReactions.ReactionFromRxnBlock(str(r)) for r in df['Rxn Reaction']]

df['rxn'] = rxn_column
df['smiles'] = [ktchem.SmilesValue(rdChemReactions.ReactionToSmiles(rxn)) for rxn in rxn_column]
df['molecule'] = [rdChemReactions.ReactionToMolecule(rxn) for rxn in rxn_column]

knio.output_tables[0] = knio.Table.from_pandas(df)

Best, Carsten

greglandrum · January 27, 2023, 10:08am

Thanks @carstenhaubold. I’m really looking forward to being able to create new RDKit nodes using Python!

kienerj · January 27, 2023, 12:57pm

Is there a plan to make it work with SMILES? It seems they need to be “cast” to string beforehand. Same for say sdf or molfiles.

I’m asking because I did some quick checks and I see a similar behavior to the Legacy python nodes. Having rdkit mols in the knime table makes things a lot slower compared to SMILES/SDF and then converting them to rdkit mols inside the python script.

A quick check using the descriptor calculation function from the blog shows me that passing in sdf into the python node vs rdkit molecules directly for about 3000 molecules is 32% faster. roughly 8s vs 11s.

carstenhaubold · January 27, 2023, 1:02pm

What do you mean by “make it work with SMILES”? You can already use SMILES and SDF, as seen e.g. in example 1 above.

SMILES and SDF are passed between KNIME and Python as strings, but in Python they have a lightweight type wrapper (knime.types.chemistry.SmilesValue or knime.types.chemistry.SdfValue respectively). This type wrapper directly extends string in Python, so as long as you pass those values to methods that expect the type to be any subclass of string, they should work immediately.

kienerj · January 27, 2023, 1:45pm

Indeed. I was getting a kernel error. See below. but a knime restart fixed that and it now works.
Due to the error message I assume it was due to the column types not being supported but must have been some “glitch”.

2023-01-27 13:44:39,879 : ERROR : KNIME-Worker-20 : : PythonSourceCodePanel : Python Script : 3:9:0:2 : An exception occured while running the Python kernel. See log for details.
org.knime.python2.kernel.PythonIOException: An exception occured while running the Python kernel. See log for details.
at org.knime.python3.scripting.Python3KernelBackend.putDataTable(Python3KernelBackend.java:447)
at org.knime.python3.scripting.Python3KernelBackend.putDataTable(Python3KernelBackend.java:427)
at org.knime.python2.kernel.PythonKernel.putDataTable(PythonKernel.java:284)
at org.knime.python2.kernel.PythonKernelManager$PutDataRunnable.run(PythonKernelManager.java:494)
at org.knime.core.util.ThreadUtils$RunnableWithContextImpl.runWithContext(ThreadUtils.java:367)
at org.knime.core.util.ThreadUtils$RunnableWithContext.run(ThreadUtils.java:221)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:123)
at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:246)
Caused by: java.lang.ClassCastException: class org.knime.chem.types.AbstractStringBasedValueFactory$AbstractCellReadValue cannot be cast to class org.knime.chem.types.SmilesValue (org.knime.chem.types.AbstractStringBasedValueFactory$AbstractCellReadValue and org.knime.chem.types.SmilesValue are in unnamed module of loader org.eclipse.osgi.internal.loader.EquinoxClassLoader @2162e4a)
at org.knime.chem.types.SmilesCellValueFactory.getValueAsString(SmilesCellValueFactory.java:1)
at org.knime.chem.types.AbstractStringBasedValueFactory$AbstractCellWriteValue.setValue(AbstractStringBasedValueFactory.java:82)
at org.knime.core.data.columnar.table.virtual.WriteAccessRowWrite.setFrom(WriteAccessRowWrite.java:120)
at org.knime.core.data.columnar.table.virtual.WriteAccessRowWrite.setFrom(WriteAccessRowWrite.java:1)
at org.knime.python3.arrow.PythonArrowDataSourceFactory.copyTable(PythonArrowDataSourceFactory.java:196)
at org.knime.python3.arrow.PythonArrowDataSourceFactory.copyTableToArrowStore(PythonArrowDataSourceFactory.java:180)
at org.knime.python3.arrow.PythonArrowDataSourceFactory.extractStoreCopyTableIfNecessary(PythonArrowDataSourceFactory.java:171)
at org.knime.python3.arrow.PythonArrowDataSourceFactory.createSource(PythonArrowDataSourceFactory.java:121)
at org.knime.python3.scripting.Python3KernelBackend$PutDataTableTask.call(Python3KernelBackend.java:740)
at org.knime.python3.scripting.Python3KernelBackend$PutDataTableTask.call(Python3KernelBackend.java:1)
at org.knime.core.util.ThreadUtils$CallableWithContextImpl.callWithContext(ThreadUtils.java:383)
at org.knime.core.util.ThreadUtils$CallableWithContext.call(ThreadUtils.java:269)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)

carstenhaubold · January 27, 2023, 2:03pm

Hm, yes the error looks as if the Java side did not notice the new type support, did you restart after installing the KNIME RDKit and Chemistry extensions?

I’m glad that it works now

kienerj · January 30, 2023, 8:58am

Actually the extensions have been installed for a very long tiem but I did upgrade fromn 4.6.3 to 4.7 just a couple days priror. Still I do not think it is related as I have had similar issues before when “playing” with the python snippet that ast some point it is just corrupt and only a restart fixes it.

daria.goldmann · January 31, 2023, 7:56am

Thank you KNIME Team! So looking forward to implementing it!

gcincilla · March 16, 2023, 1:56pm

Thank you @carstenhaubold, this is really a great news!!
Is it possible to run the examples you provided also with the Python Scripts (Lab) node of KNIME Analytics Platform 4.6? I’m asking this because I’ve tried it and it gives me the following error:

ModuleNotFoundError: No module named 'knime'

To run the node I’m using the default Conda environment generated with:

KNIME -> Preferences -> Python (Labs) -> New environment...

Am I missing something or this is not possible in KNIME 4.6?

carstenhaubold · March 16, 2023, 3:41pm

@gcincilla: unfortunately the RDKit integration does not work with the Python Script (Labs) node in KNIME AP 4.6, there were changes on both sides needed which are only available since KNIME AP 4.7.

One of these changes was cleaning up the structure of the knime Python modules. Those were still called e.g. knime_io (4.6) instead of knime.scripting.io (4.7). All code snippets above are tailored to KNIME >= 4.7.

gcincilla · March 16, 2023, 4:27pm

OK @carstenhaubold, thank you very much for your quick reply!

system · June 14, 2023, 4:28pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.