I have a moderate large data set of >12 million rows and two columns. The cells can contain considerably long strings ranging from a few hundred to thousands of characters in length. I used a simple rule engine to identify and overwrite changed values.
NOT MISSING $column2$ => $column2$
TRUE => $column1$
Unfortunately at around 90 % it seems to hang despite two CPU constantly being at 100 %. Trying to cancel after +30 minutes was to no avail. I saved the workflow by accident. Note that I have auto save turned off for that very reason that it interferes with workflow execution in particular for large data sets.
Hi @mwiegand , it seems like this does not just happen with Rule Engine, but just about any node when it’s at that critical state. When it comes to that state, Knime seems to be stuck on the task without giving the chance to cancel. I mean you can click on Cancel, but it feels like Knime is trying to complete the task first before cancelling.
When dealing with huge dataset, I make sure I save my workflow before running it, and if I am going node by node, I would save the workflow after each execution of the nodes, and when this situation arises, I would just kill Knime and restart. Since I saved the workflow after the latest successful run, I’m able to resume the workflow with the data that’s already compiled up to that point.
Not an ideal situation, but that’s been my workaround. Note though, there’s a risk of corrupting your Knime workspace by killing Knime, but I know how to “repair” my workspace. But so far, I’ve not had to kill too often.
I would definitely be interested what others have to say about this situation.
Side note: Not that it will make any difference, and it’s not fixing your issue, you can write your rules like this:
Thanks for your feedback @bruno29a. I have altered to process, following divide & conquer principle, to separate rows which changed via:
Reference row filter
Deleted the apparently deprecated column
Used cell replace to insert the new data
Concatenated with formerly split rows
However, as you pointed out, I’d be interested into the management of critical states as well. In particular with auto save on this troubled me for a long time until I turned it off.
@mwiegand if the data is huge KNIME will need all the ressources it can get.
In your case you might want to check if there is an option to give KNIME more RAM, switch to the new columnar table structure or
split your task in chunks
run the rule engine only on this specific column using a cache node before
see if streaming can help (not doing all rows at a time)
splitting your workflow into a main and sub part where the large rule task is mainly just the large rule node
A combination of the above. Also you might check these hints. Also concerning backup since problems can occur at any time so it is always advisable to save your work.
Thanks for the suggestions @mlauber71. The primary challenge seems to the the handling of critical states. I assume, based on various scenarios, that Knime, during the save routine, requires more resources than available.
As the save operation can not be aborted, would it be an option to check if:
Prior to triggering the save operation, enough resources are available
It’s possible to omit triggering a save if data is still being processed
If data is still being processed to only save the configuration really skipping data
Omit saving notes which are being executed which might also help to prevent these notes becoming corrupt which in turn compromises the entire workflow
Save a copy w/o data during save routine and only merge upon close to prevent the “original” workflow getting compromised?