Unresponsive Death of Recursive Loop End (2-ports)

Edlueze · March 24, 2016, 3:06am

How do I resolve a Recursive Loop End (2-port) that intermittently dies an unresponsive death and kills my workflow?

I have a very large workflow for crawling website data that currently takes between 6 hours and 12 hours to run. Anything that stops the workflow quickly becomes a critical issue. I am now faced with a Recursive Loop End (2-port) node that can stop at any time - last night it stopped about 4 hours into the crawl.

My knime.log file (log level = ERROR) looks like this:

2016-03-23 20:28:47,918 : ERROR : KNIME-Worker-25 : LocalNodeExecutionJob : Row Splitter : 0:1384:1098:1176:724 : Caught "ConcurrentModificationException": null
2016-03-23 21:02:00,718 : ERROR : KNIME-Worker-27 : LocalNodeExecutionJob : Row Splitter : 0:1384:1098:1176:724 : Caught "ConcurrentModificationException": null
2016-03-23 21:38:49,514 : ERROR : KNIME-Worker-33 : LocalNodeExecutionJob : Row Splitter : 0:1384:1098:1176:724 : Caught "ConcurrentModificationException": null
2016-03-23 21:52:04,064 : ERROR : KNIME-Worker-25 : LocalNodeExecutionJob : Row Splitter : 0:1384:1098:1176:724 : Caught "ConcurrentModificationException": null
2016-03-23 21:58:20,987 : ERROR : KNIME-Worker-31 : LocalNodeExecutionJob : Row Splitter : 0:1384:1115:724 : Caught "ConcurrentModificationException": null
2016-03-23 22:01:16,792 : ERROR : KNIME-Worker-28 : LocalNodeExecutionJob : Row Splitter : 0:1384:1098:1176:724 : Caught "ConcurrentModificationException": null
2016-03-23 22:18:48,666 : ERROR : KNIME-Worker-29 : LocalNodeExecutionJob : Row Splitter : 0:1384:1098:1176:724 : Caught "ConcurrentModificationException": null
2016-03-23 22:33:34,716 : ERROR : KNIME-Worker-27 : LocalNodeExecutionJob : End IF : 0:1384:1115:1110 : Caught "ConcurrentModificationException": null

MY WORKFLOW STOPPED : NODE UNRESPONSIVE : Recursive Loop End (2-ports)

Note that I removed some nuisance errors from the middle of the log but the last End IF ConcurrentModificationException seemed to occur immediately before the Recursive Loop End node became unresponsive.

I've attached some images from the final state of my workflow. You can see that the Recursive Loop is 1996 iterations through when the ConcurrentModificationException is caught in the Reset MetaNode on the left - causing it to appear as if it is still running even though all the lights are green (in fact, all the lights are green upstream of the Unresponsive Loop End suggesting that there is nothing preventing its progress). Digging into the Reset MetaNode you can see where the ConcurrentModificationException was caught - but that apparently didn't prevent the End IF node from completing its job (green light).

Several days ago I alerted the Palladian Node folks about another ConcurrentModificationException error, but I think this new issue is a KNIME issue and not a Palladian issue as you can see here the IF Switch bypasses the Palladian HttpRetriever node (I've also since then upgraded my Palladian Nodes). But for background you can read about that issue here:

https://tech.knime.org/forum/palladian/handling-httpretriever-concurrentmodificationexception-error

I upgraded from KNIME 2.12.1 and installed version KNIME 3.1.1 a week ago. After spotting this Loop End issue two days ago I moved the workflow to a much more powerful machine with 16 GB of RAM. While I don't remember seeing it earlier, it's possible that the issue was there in KNIME 2.12.1 - it's only now that things are becoming really critical for us that I'm trying have KNIME running 24-hours-a-day.

Note that I don't use OpenMS and nothing else I've found on the ConcurrentModificationException seems helpful.

PS If you have any tips on restarting the workflow from this unresponsive state then please also let me know - at least that way I wouldn't lose 4 hours of crawling data by starting again from the beginning.

Edlueze · March 24, 2016, 11:22am

An update - and some very bad news for me! I moved my workflow back to KNIME 2.12.1 and had it running on a different machine hoping that I could at least get some data. But I hit exactly the same problem at exactly the same point in my workflow. This time the crawling stopped after a bit more than 2000 iterations but that's still less than halfway.

The log file shows exactly the same as before:

ERROR End IF               0:1384:1115:1110 Caught "ConcurrentModificationException": null
ERROR Row Filter           0:1384:1098:1176:735 Configure failed (IllegalArgumentException): RowNumberFilter: range start is less than 0.
ERROR Row Filter           0:1384:1098:1176:735 Configure failed (IllegalArgumentException): RowNumberFilter: range start is less than 0.
ERROR Row Filter           0:1384:1098:1176:735 Configure failed (IllegalArgumentException): RowNumberFilter: range start is less than 0.
ERROR Row Filter           0:1384:1098:1176:735 Configure failed (IllegalArgumentException): RowNumberFilter: range start is less than 0.

This time I kept in the nuisance errors (last four lines of the log file) but I believe these are irrelevant to the problem (they are caused by a Row Filter which gets a table of data before it gets the FlowVariable telling it which rows to filter - a natural condition that shouldn't trigger an error).

The lights on the workflow are also exactly as before - with the right MetaNode ticked finished and the left MetaNode showing green lights but a running arrow.

I've saved the log file (ERROR level only) and I've taken SVG snapshots of everything as it was in the final state if anybody wishes to take a look - but all I can see are green lights and inactive branches.

The last trick up my sleeve is to replace the End IF node with a Case Switch Data (End) node. But if that doesn't work then I am finished.

Really hoping somebody has a solution to this problem!

Edlueze · March 25, 2016, 12:57am

Another update - and finally some better news. The replacement of the End IF node with the Case Switch Data (End) node got me past step 2 of my crawling (12 hours later my crawler is still working on steps 3 and 4). I'm frankly surprised as I would have thought the two nodes would have been built off a very similar codebase. It's also not certain that this is a real fix or if it is just a coincidence - or even if the End IF node was really part of the problem.

But I would encourage the core-KNIME folks to take a note of this issue as it touches on several of the fundamental principles of KNIME:

(a) KNIME aims to offer a robust multicore platform

(b) KNIME aims to provide flexible workflow control

(c) KNIME should be aiming for non-stop operation and "three nines" availability (if not five nines)

I'll continue to keep you updated as this issue evolves.

wiswedel · May 3, 2016, 10:13pm

Hi,

Sorry for delayed response. I'm glad you could sort it out. Can you reproduce the problem and then attach the bottom part of your knime.log file. If that's really a node issue then it should be easy to fix ... we just need to know where the problem is.

- Bernd