My scenario is this: I have a table with about 100,000 rows and several columns and I'm trying to apply a Column Aggregator on pairs of columns, which are specified in a second table. My first try was with a Table Row To Variable Loop Start and a Loop End (Column Append). That does work, but in the most optimized version it uses two iterations over the rows per aggregation (one for the aggregation, one for appending). My idea was to reduce that to one iteration over the rows per aggregation by using a recursive loop with the end set to Collect data from last iteration only.
But now after each iteration the Recursive Loop End takes a long break with the progress indicator sitting at 0%. I have no idea what it is doing, so I have no idea if I can optimize that, or if it is a bug. Does anyone have an idea what could cause this?
I do not think you need the recursive loops for this, because you won't need information from the previous iterations? However, can you try activating Collect data from last iteration only in the loop end. This should speed up your results.
Can you post or send your initial workflow? Then I will take a look if we can optimize it.
thanks for taking a look at it. I had activated the option you mentioned, but I still have this waiting period. As my "real" workflow is a bit complex, I'm attaching a reduced version showing the same behaviour. It's true that there might be ways around the loops, but my main concern here is to understand why that waiting period appears at all, so I can integrate that aspect in my planing.
I will add a progress to the recursive loop with the next release. You can see what they are working on by mouse over. They won't help you to speed up, because they need to copy the complete table in every run of the table.
I see how copying the table would slow down the loop, but why is that neccessary if Collect data from last iteration only is selected? From outside it looks like a simple switch of pointers should suffice. Even when collecting all results, why not reuse the table that is already present? I understand that loops in general are some of the slowest ways to accomplish results because of the large overhead they require, and I suppose there's no good way around that other than avoiding them if possible. But (again. from outside) this particular instance looks like an especially inefficient design descision.
Mh... I just wanted to take a look at the code to try to understand the underlying problems. And by "look at" I mean "fiddle with", as I could coincidently use a three-port version right now. Sounds easy enough, but I can only look, but not touch the code because of a "restriction on required library". That might be a real restriction (why?) or some problem with the order in the classpaths or something along that line (any idea?)... Ahh, dependencies, the crowning achievement of software development.
Let me follow-up on the two questions I can extract from your comments:
- Loop ends copying data: That is needed as KNIME is resetting the nodes in the loop body between iterations and hence also discarding its data. So in order to retain that data over loop iterations it's copied in the loop end ... which takes time and which by now is badly reported in the progress bar (bug is open and will be fixed in the next feature release). Note that for some of the large items (e.g. text documents or images) we use "FileStores", which are similar to BLOBs in the DB world. They are stored only once and we also have some more or less smart logic in place that avoids copying them in loop end nodes.
- Find the source code: In your eclipse SDK create a plug-in, declare a dependency to org.knime.base and then locate the org.knime.base.node.meta.looper.recursive.RecursiveLoopEnd2NodeModel or whatever you find interesting.
Hope that helps,
thank you for your detailed answer. I tried again with the information from your second answer, but I don't seem to be allowed to work with the code, only read it. Maybe its for the best.
Now to the other part, your answer #1. I can see the reason why that might have to be this way inside the model. Which logically brings me to the next step of questioning the model. (After all, there is still no real underlying reason for copying data outside the model.)
Here's a basic idea what might be done: A node seems to have one of six basic states now: reset, configured, sheduled, executing, executed, and error. All except for error are very linear. Which is nice until you have non-linear parts. Normally, you can think of execution as a set of "lines" of progress that move from left to right through a model, one for every possible state. In front of the lines, there's a dead zone. In loops, they all jump backwards, killing everything in front of them. But only some need to jump, and sometimes things in front of you are good to keep. So maybe adding a seventh, non-linear state can help not only here, but in a lot of other places. What I'm thinking about is a "stale" state, which is kind of a reset, but where data is kept available. Another way of thinking about it is as a "lazy reset". Now you have another line of "freshnes", that has no kill zone.
- Data is available for recursion
- Data is available to views as long as possible, so complex progress in a loop can be visualised without (much) flickering (the rest could be optimized further)
- Stale data can be saved, and thus partly executed loops can be restarted after loading, without resetting the whole loop
- The sources of errors in loops are easier to find, as the complete data around the point of error is available instead of only the data from the first part of the loop
Even outside of loops, stale data could help. Think of an accidental disconnect. Currently, even with undo, the node has to be re-executed. With stale data, it doesn't. Also, saving important data after a point of error is still possible. So maybe it's not a good idea to make this state obvious to the user, so you don't get a bunch of complaints from users who don't "get" it, but it might be very useful under the hood.
Does that sound reasonable?
Thinking about it, this wouldn't be just one more state. It would be a decoupling of the data state machine from the execution state machine, which are currently more intertwined than necessary. So uncoupling them might be a step towards further separating data flow and control flow. Which would be a very good thing in my eyes, because things like configurable inactive ports or streamable functions are way harder to implement with both flows coupled.
The most important question is of course if the NodeModel interface would have to be changed. I don't think so. It should be possible to handle the data in the background, with the model just used for the state transitions. Custom state in the models is something to keep in mind, though.
(On a related note, I do think the NodeModel has too many responsibilities as it is, and should be split, but that's something that each developer can handle easily on his/her own without interfering with the rest of the system.)
Although this thread is already a bit older - I'd agree with Marlin.
Specifically also on the Recusive Loop - an extremely powerful concept - and how it performs if you do not only use very small datasets. In my case, I run usually tables between 500.000 and 1-2 mio. rows (~50 cols). And in such cases, Recursive Loop (even in 'keep last iteration only' mode) is unusable - as the copying of the data takes enormous amounts of time (e.g. for 30 iterations just renaming columns one one of above datasets - up to 3 hrs).
Is that 'avoid copying' (e.g. as marvin suggests via additional states) something you would consider a future feature?