Crashing Linux with RDKit

richards99 · August 13, 2024, 1:22pm

This seems to have been brought up before 5 years ago by a few people but the issue still remains. @greglandrum
Using RDKit with KNIME on Linux causes a crash of the whole Linux environment inducing a reboot. No errors reported.
I find it always happens when the dataset is large (>1 million rows), and it is with the RDKit One Component and Two Component Reaction nodes.
I have tried running these in loops of 100k rows at a time, and the problem persists.
The Linux has 120Gb of memory, 2Tb of Hard Drive space. KNIME is set to use 80Gb of memory in the knime.ini file.
KNIME never gets blocked with memory with the heap space being stuck or anything like this. The same workflow works perfectly on Mac.
The workflows on Linux always start off well, and then at some point during many loops it suddenly induces the crash.
I have had around 20-30 crashes in the last week, so it easily happens. Unfortunately I cannot share any data.
Example of the table structure is input of 2.3 million rows with 38 columns. All columns are just SMILES, Strings, Numerical Double, Numerical Integer. And 60 SMARTS are fed into the node via Flow Variable and Looped over.

Any advice would be greatly appreciated.

Simon.

richards99 · September 2, 2024, 5:58am

Just to follow up on this after lots of experimenting the setting of the heap space makes a big difference to the overall stability.
The Linux machine has 120Gb RAM.
Setting this at 80Gb results in crashes after running big jobs (even within chunk loops) with RDKit nodes.
Setting this at 110Gb results in very frequent crashes.
Setting this at 40Gb results in no crashes and good stability.

Could anyone more technically minded around the memory management explain why reducing heap space allowance results in better stability than large heap space when using RDKit nodes.
The large jobs I am doing are processing 280 million rows of data transforming the structures using RDKit with either one component reaction, or Remove H’s, or Canon Smiles nodes.

Thanks,

Simon.

carstenhaubold · September 4, 2024, 1:04pm

Hi @richards99,

Thank you for sharing your problem and your insights! Your observations are strikingly similar to what we observed with crashing executors.

It all boils down to the fact that heap space is not the only memory that is used. Heap space is the memory that is used within Java. However, if you are e.g. running Python processes next to KNIME, they require some memory, too.

In your case I assume that RDKit is being called from Java via JNI and allocates memory in its C++ core (@greglandrum could you clarify?). The amount of memory used by RDKit is then not part of Java’s heap space, so the Xmx setting does not limit how much memory is used by RDKit. Maybe counterintuitively, in this scenario a large heap space allows the JVM to use up a lot of memory, leaving less memory to RDKit (or other so-called off-heap memory or external processes), and hence causing crashes.

I don’t know enough about the internals about KNIME’s RDKit extension and RDKit itself, but it would be nice if we could somehow limit the amount of memory it uses.

I hope that helps to clarify your findings, sorry I can’t offer a general solution right now.

Best,
Carsten

system · December 3, 2024, 1:05pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.