Execute failed: ("GenericRDKitException"): null

Dear All,

I have just started having a few problems with knime 2.4.1 on Windows 7.  I was investigating the use of the RDKit fingerprint node in combination with the Erl Wood Fingerprints Expander node on a set of ~ 6500 compounds.

After expanding the fingerprint I wanted to see if PCA was a good way of allowing me to visualise at least some of the chemical space spread in my set...

Anyway, prior to the PCA I thought I would get rid of the bit columns of low variance across the set using the Low Variance Filter node (setting 0.1), followed by a Normalizer on the remaining columns (54).  It is at this point that I seem to be getting some strange errors - either node hanging, knime.exe runtime error, or the following error in the console window:

 

ERROR Normalizer Execute failed: ("GenericRDKitException"): null

WARN PCA Compute column "Bit 1" not found, selected default columns

ERROR SyncExecQueueDispatcher Uncaught exception while queuing events into main thread

 

I'm not quite sure why the Normalizer should be throwing an RDKit exception(?) - also, looking at the log file (see below), the whole thing seems to end with an out of memory problem (which may explain why all my node icons just turned black!).

Apologies if this isn't an RDKit issue - and if it is, I hope the log file excerpt will help track it down!  : )

Kind regards

James

 

2011-08-02 08:05:08,798 DEBUG KNIME-Worker-0 Normalizer : reset 2011-08-02 08:05:2011-08-02 08:05:08,798 DEBUG KNIME-Worker-0 Normalizer : reset

2011-08-02 08:05:08,798 DEBUG KNIME-Worker-0 Normalizer : clean output ports.

2011-08-02 08:05:08,798 ERROR KNIME-Worker-0 Normalizer : Execute failed: ("GenericRDKitException"): null

2011-08-02 08:05:08,799 DEBUG KNIME-Worker-0 Normalizer : Execute failed: ("GenericRDKitException"): null

org.RDKit.GenericRDKitException

            at org.RDKit.RDKFuncsJNI.new_Int_Vect__SWIG_1(Native Method)

            at org.RDKit.Int_Vect.<init>(Int_Vect.java:43)

            at org.rdkit.knime.types.RDKitMolCell2.toROMol(RDKitMolCell2.java:213)

            at org.rdkit.knime.types.RDKitMolCell2.readMoleculeValue(RDKitMolCell2.java:176)

            at org.rdkit.knime.types.RDKitMolValue$RDKUtilityFactory$1.compareDataValues(RDKitMolValue.java:107)

            at org.knime.core.data.DataValueComparator.compare(DataValueComparator.java:123)

            at org.knime.base.data.statistics.StatisticsTable.calculateAllMoments(StatisticsTable.java:273)

            at org.knime.base.data.statistics.StatisticsTable.calculateAllMoments(StatisticsTable.java:218)

            at org.knime.base.data.statistics.StatisticsTable.<init>(StatisticsTable.java:157)

            at org.knime.base.data.normalize.Normalizer.doZScoreNorm(Normalizer.java:285)

            at org.knime.base.node.preproc.normalize.NormalizerNodeModel.calculate(NormalizerNodeModel.java:318)

            at org.knime.base.node.preproc.normalize.NormalizerNodeModel.execute(NormalizerNodeModel.java:281)

            at org.knime.core.node.NodeModel.executeModel(NodeModel.java:524)

            at org.knime.core.node.Node.execute(Node.java:873)

            at org.knime.core.node.workflow.SingleNodeContainer.performExecuteNode(SingleNodeContainer.java:840)

            at org.knime.core.node.exec.LocalNodeExecutionJob.mainExecute(LocalNodeExecutionJob.java:100)

            at org.knime.core.node.workflow.NodeExecutionJob.run(NodeExecutionJob.java:166)

            at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)

            at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)

            at java.util.concurrent.FutureTask.run(Unknown Source)

            at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:124)

            at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:239)

2011-08-02 08:05:08,800 DEBUG KNIME-Worker-0 WorkflowManager : Normalizer 0:0:8 doBeforePostExecution

2011-08-02 08:05:08,800 DEBUG KNIME-Worker-0 NodeContainer : Normalizer 0:0:8 has new state: POSTEXECUTE

2011-08-02 08:05:08,800 DEBUG KNIME-Worker-0 KnimeResourceNavigator : Node message changed: ERROR: Error in sub flow.

2011-08-02 08:05:08,801 DEBUG KNIME-Worker-0 NodeContainer : KNIME_project3 0:0 has new state: EXECUTING

2011-08-02 08:05:08,801 DEBUG KNIME-Worker-0 WorkflowManager : Normalizer 0:0:8 doAfterExecute - failure

2011-08-02 08:05:08,801 DEBUG KNIME-Worker-0 Normalizer : reset

2011-08-02 08:05:08,801 DEBUG KNIME-Worker-0 Normalizer : clean output ports.

2011-08-02 08:05:08,801 DEBUG KNIME-Worker-0 NodeContainer : Normalizer 0:0:8 has new state: IDLE

2011-08-02 08:05:08,801 DEBUG KNIME-Worker-0 Normalizer : Configure succeeded. (Normalizer)

2011-08-02 08:05:08,801 DEBUG KNIME-Worker-0 NodeContainer : Normalizer 0:0:8 has new state: CONFIGURED

2011-08-02 08:05:08,802 WARN  KNIME-Worker-0 PCA Compute : column "Bit 1" not found, selected default columns

2011-08-02 08:05:08,802 DEBUG KNIME-Worker-0 PCA Compute : Configure succeeded. (PCA Compute)

2011-08-02 08:05:08,802 DEBUG KNIME-Worker-0 NodeContainer : PCA Compute 0:0:5 has new state: CONFIGURED

2011-08-02 08:05:08,812 DEBUG KNIME-Worker-0 PCA Apply : Configure succeeded. (PCA Apply)

2011-08-02 08:05:08,812 DEBUG KNIME-Worker-0 NodeContainer : PCA Apply 0:0:6 has new state: CONFIGURED

2011-08-02 08:05:08,815 DEBUG KNIME-Worker-0 KnimeResourceNavigator : state changed to CONFIGURED

2011-08-02 08:05:08,815 DEBUG KNIME-Worker-0 NodeContainer : KNIME_project3 0:0 has new state: CONFIGURED

2011-08-02 08:05:08,815 DEBUG KNIME-Worker-0 NodeContainer : KNIME_project3 0:0 has new state: CONFIGURED

2011-08-02 08:05:08,815 DEBUG KNIME-WFM-Parent-Notifier NodeContainer : Workflow Manager 0 has new state: IDLE

2011-08-02 08:05:08,818 ERROR KNIME Sync Exec Dispatcher-1 SyncExecQueueDispatcher : Uncaught exception while queuing events into main thread

2011-08-02 08:05:08,819 DEBUG KNIME Sync Exec Dispatcher-1 SyncExecQueueDispatcher : Uncaught exception while queuing events into main thread

java.lang.OutOfMemoryError: unable to create new native thread

            at java.lang.Thread.start0(Native Method)

            at java.lang.Thread.start(Unknown Source)

            at java.util.concurrent.ThreadPoolExecutor.tryTerminate(Unknown Source)

            at java.util.concurrent.ThreadPoolExecutor.workerDone(Unknown Source)

            at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

            at java.lang.Thread.run(Unknown Source)

Hi James,

As you've already guessed, the log message indicates that you've run out of memory. The specific error is being triggered by the code that deserializes an RDKit molecule (constructs an RDKit molecule from the binary format that is stored in the knime cells). From the rest of the traceback, it looks like the error is happening in the normalizer node. I don't understand at all why the normalizer node should be deserializing the RDKit molecule column... you certainly aren't trying to normalize that column.

I don't know if you've tried this yet or not, but it might help to run the workflow up to the variance filter node and look at knime's memory useage (look at the bar in the bottom-right corner of the knime window and what you see in the Windows task window). Then execute each of the subsequent nodes in the workflow and see if/when the memory useage climbs.

One question: which version of the rdkit nodes are you using?

-greg

Hi Greg,

Thanks for the feedback - I will keep an eye on the memory usage as you suggest [on that note, it would be really great if the RDKit nodes played nicely with 64-bit Windows, so that the 64-bit version of knime could be used and have more memory available - but I know from previous posts that there are issues - I just thought I'd mention it again as a poorly Windows user!  : )  ]

I am using version 2.0.0.0001029 of the RDKit nodes.

Kind regards

James

I did a bit more investigating here and have discovered some behavior from knime that I just don't understand. Hopefully the knime guys can help here.

The attached workflow reads 10K lines from a file, converts them to RDKit molecules, generates fingerprints, runs the Fingerprints Expander, applies the variance filter, normalizes rows, then does PCA.

Here's the output line for "ps ux" for knime after some of those steps on my linux box. I forced a garbage collection before running each ps.

The columns are: USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME

  1. startup:  glandrum 20533 78.0  4.3 1457876 354988 pts/1  Sl   08:59   0:18 
  2. execute row filter: glandrum 20533 42.1  4.7 1465948 386832 pts/1  Sl   08:59   0:21
  3. create rdkit molecules: glandrum 20533 41.0  5.0 1875464 410432 pts/1  Sl   08:59   0:45
  4. create rdkit fingerprints: glandrum 20533 31.2  5.1 1877528 422400 pts/1  Sl   08:59   0:49
  5. fingerprint expander: glandrum 20533 27.7  6.0 1877528 494956 pts/1  Sl   08:59   0:55
  6. Low variance filter: glandrum 20533 21.9 12.9 2336392 1053028 pts/1 Sl   08:59   1:04 
  7. Normalizer: glandrum 20533 21.0 18.5 2730804 1513236 pts/1 Sl   08:59   1:14

At this point it's clear that the memory useage is starting to really go up. So to do an experiement, I went back, reset the molecule to rdkit node and ran through them again:

 

  1. create rdkit molecules: glandrum 20533 21.0 15.7 2537280 1281620 pts/1 Sl   08:59   1:40 
  2. create rdkit fingerprints:  glandrum 20533 21.1 15.7 2537280 1284192 pts/1 Sl   08:59   1:43
  3. fingerprint expander: glandrum 20533 21.7 15.5 2537280 1270824 pts/1 Sl   08:59   1:48 
  4. Low variance filter:  glandrum 20533 22.2 17.5 2665876 1433780 pts/1 Sl   08:59   1:54
  5. Normalizer: glandrum 20533 22.3 18.3 2766928 1497848 pts/1 Sl   08:59   2:03

Then I restarted knime and configured the RDKit fingerprinter node to remove the source column, so everything downstream of that node no longer has an RDKit dependency:

 

  1. create rdkit molecules:  glandrum 20808  136  4.3 1950580 353356 pts/1  Sl   09:11   0:42
  2. create rdkit fingerprints:  glandrum 20808 34.1  5.2 1956160 426428 pts/1  Sl   09:11   0:47
  3. fingerprint expander: glandrum 20808 30.5  5.5 1956160 453844 pts/1  Sl   09:11   0:54
  4. Low variance filter:  glandrum 20808 30.0  5.2 1956160 428384 pts/1  Sl   09:11   0:58
  5. Normalizer: glandrum 20808 26.0  5.1 1956172 423564 pts/1  Sl   09:11   1:03 

The memory useage is stable... so now the situation is perhaps clear. There are two problems here:

  1. The low variance filter is deserializing columns that it's not actually using. There's definitely a performance penalty associated with this. Could it be that the filter is actually copying data instead of filtering it?
  2. The process of deserializing a bunch of RDKit molecules may be causing a spike in memory requirements, though it doesn't look like it's actually leaking memory. I tested this by repeatedly resetting the low-variance filter and re-running it when there was an RDKit molecule column, the memory useage did not continuously climb.

These were run with knime v2.4.0 and the release version of the rdkit nodes.

Does my analysis make sense? 

-greg

Since RDKit cells are not blob cells they are stored inline in the data table stream. Each node iterating over a table needs to deserialize all such inline cells.

Two possible solutions come to my mind: make RDKitCells blob cells (again) or do lazy initialization of the underlying object i.e. just store the byte[] array in the cell and only create the ROMol object from it when it is needed (saves about 50% of memory).

Thanks Greg and Thor for exploring this further.

@Greg - yes, your analysis makes sense to me - but in terms of suggesting performance improvements, I think it is clear that Thor's input will be much more useful than mine!  : )

Of course, the explanation suggests that in the short-term one could just not pass the RDKit mol column on to other nodes where it is not required (if memory problems are apparent).

Kind regards

James

James,

I'm going to follow up with Thorsten and Bernd about this offline, but you are definitely correct that you can save yourself substantial time by not passing the RDKit molecule column anywhere it's not required.

-greg

Dear all:

Quick update on our findings:

  • There was indeed a memory leak in the RDKit-KNIME integration (not RDKit but the cell wrapper in KNIME), which has been fixed in the nightly build of the RDKit integration. This memory leak only becomes apparent if you either sort by the RDKit column (which doesn't make sense but you could do it) or use the normalizer node on a table that contains an RDKit column.
  • The normalizer node currently calculates some statistics on all of the input columns even if most of the columns are not being normalized. This is an inefficiency that we will need to fix (it doesn't compute garbage or leak memory -- so this is not critical but "only" inefficient).

Thanks to James for reporting this and thanks to Greg for the detailed analysis.

Regards,
  Bernd

PS: To correct Thorstens comment from above: The RDKit cell is saved inline - which makes sense because they are small (~1-2kB on average) and it avoids the blob overhead - and they are only deserialized on demand (which the Normalizer currently does).