don't understand flow variables

I believe I don't quite understand how to use the flow variables. And I cannot find any documentation on this topic. Even the sources reveal not much... could you please enlighten me a bit or show me where to find the information. Here is what I would like to do eventually: I am dealing with very large tables (>10,000,000 rows) and I could potentially split those by a categorical value. Now I know that these categorical values are available from the "Specs" tab in one of the output tables. I would like to use those in conjunction with a iteration procedure to first filter for them then do some calculations on individual categories, and then combine the results back together... I thought that flow variable would be the key to answering this question in conjunction with some looping mechanism... How would I do this? Thanks, Bernd das Brot

The documentation on flow variables is rather poor, yes. I guess the best is to start playing with the nodes such as “TableRow to Variable”, “Variable to TableRow”, “Variable to Column”, “Extract Variables”, “Inject Variables”, etc. nodes.

Attached is a very small example workflow that splits the input table by categories and then feeds this into a loop node. Maybe that’s a good starting problem for you to work on.

Great!!! Thanks that looks exactly like what I want to do. I just hope that the groupBy node will not crash because of memory issues. It happened before, but I guess I am using much fewer groups now so let’s hope for the best…

Is it easy for us to reproduce that memory problem in the GroupBy? It was designed to not have such memory problems: The default settings sort the input table on disk and then traverse the sorted table once to aggregate by group. What aggregation methods did you use (on how many columns?)

sorry for the late answer…

I believe the problem arises when I have A LOT (>500,000) of items in one group. If I understood the correctly you handle one group at a time to calculate some stats. But this is not needed for all the stats…

I wrote a small variation that takes as input a presorted node and just counts…

B

OK, that explains it. We have an open “bug” for the group by node (it always sorts by the group column although that is sometimes not necessary, e.g. when using min/max as aggregation). I will append your example to the bug report.

Coming to think about it a bit more, one (hopefully) last question: What was the aggregation method that you were using for the 500,000++ items group(s)?

I just tried it again with “first” and “first value” depending on what the type of the column was…
I was trying to use this to get all unique sequences from an NGS experiment…

I'd be interested in seeing the example workflow using flow variables if possible.

Many thanks

 

Hi,

I was wondering where should I look to find these nodes? and had indeed post a question...

Actually, just editing the knime.ini  did it quite nicely. (read in another post)

add the following after the "-server" line of the knime.ini

"-Dknime.expert.mode=true" and that's it

no point asking anymore, I found 'em!

surprise

 

B