How do I resolve a Recursive Loop End (2-port) that intermittently dies an unresponsive death and kills my workflow?
I have a very large workflow for crawling website data that currently takes between 6 hours and 12 hours to run. Anything that stops the workflow quickly becomes a critical issue. I am now faced with a Recursive Loop End (2-port) node that can stop at any time - last night it stopped about 4 hours into the crawl.
My knime.log file (log level = ERROR) looks like this:
2016-03-23 20:28:47,918 : ERROR : KNIME-Worker-25 : LocalNodeExecutionJob : Row Splitter : 0:1384:1098:1176:724 : Caught "ConcurrentModificationException": null 2016-03-23 21:02:00,718 : ERROR : KNIME-Worker-27 : LocalNodeExecutionJob : Row Splitter : 0:1384:1098:1176:724 : Caught "ConcurrentModificationException": null 2016-03-23 21:38:49,514 : ERROR : KNIME-Worker-33 : LocalNodeExecutionJob : Row Splitter : 0:1384:1098:1176:724 : Caught "ConcurrentModificationException": null 2016-03-23 21:52:04,064 : ERROR : KNIME-Worker-25 : LocalNodeExecutionJob : Row Splitter : 0:1384:1098:1176:724 : Caught "ConcurrentModificationException": null 2016-03-23 21:58:20,987 : ERROR : KNIME-Worker-31 : LocalNodeExecutionJob : Row Splitter : 0:1384:1115:724 : Caught "ConcurrentModificationException": null 2016-03-23 22:01:16,792 : ERROR : KNIME-Worker-28 : LocalNodeExecutionJob : Row Splitter : 0:1384:1098:1176:724 : Caught "ConcurrentModificationException": null 2016-03-23 22:18:48,666 : ERROR : KNIME-Worker-29 : LocalNodeExecutionJob : Row Splitter : 0:1384:1098:1176:724 : Caught "ConcurrentModificationException": null 2016-03-23 22:33:34,716 : ERROR : KNIME-Worker-27 : LocalNodeExecutionJob : End IF : 0:1384:1115:1110 : Caught "ConcurrentModificationException": null MY WORKFLOW STOPPED : NODE UNRESPONSIVE : Recursive Loop End (2-ports)
Note that I removed some nuisance errors from the middle of the log but the last End IF ConcurrentModificationException seemed to occur immediately before the Recursive Loop End node became unresponsive.
I've attached some images from the final state of my workflow. You can see that the Recursive Loop is 1996 iterations through when the ConcurrentModificationException is caught in the Reset MetaNode on the left - causing it to appear as if it is still running even though all the lights are green (in fact, all the lights are green upstream of the Unresponsive Loop End suggesting that there is nothing preventing its progress). Digging into the Reset MetaNode you can see where the ConcurrentModificationException was caught - but that apparently didn't prevent the End IF node from completing its job (green light).
Several days ago I alerted the Palladian Node folks about another ConcurrentModificationException error, but I think this new issue is a KNIME issue and not a Palladian issue as you can see here the IF Switch bypasses the Palladian HttpRetriever node (I've also since then upgraded my Palladian Nodes). But for background you can read about that issue here:
https://tech.knime.org/forum/palladian/handling-httpretriever-concurrentmodificationexception-error
I upgraded from KNIME 2.12.1 and installed version KNIME 3.1.1 a week ago. After spotting this Loop End issue two days ago I moved the workflow to a much more powerful machine with 16 GB of RAM. While I don't remember seeing it earlier, it's possible that the issue was there in KNIME 2.12.1 - it's only now that things are becoming really critical for us that I'm trying have KNIME running 24-hours-a-day.
Note that I don't use OpenMS and nothing else I've found on the ConcurrentModificationException seems helpful.
PS If you have any tips on restarting the workflow from this unresponsive state then please also let me know - at least that way I wouldn't lose 4 hours of crawling data by starting again from the beginning.