Looping over increasing partition of input data

muthmann · June 30, 2012, 10:32am

Hi,

I would like to process an increasing partition of some data table. I assume this is possible using KNIMEs Loop functionality. So what I want to do is get the first row from the table and run it through some workflow, then get the first and the second row and run it through some workflow, then get the first, second and third ... and so on.

Is it possible to achieve this task with the current nodes, or do I need to implement my own start node? If so, are there examples on how to implement loop start nodes?

Thanks and regards

richards99 · June 30, 2012, 12:25pm

Quite straight forward this one. Simply use the chunk loop start node and at the end of the loop use the loop end node.
In the chunk loop start node you can define how many rows to loop with at a time. By default this is one.

Simon.

muthmann · June 30, 2012, 7:46pm

But this only provides consecutive chunks. What I need are overlaping chunks.

Using the Following dataset

A

B

C

D

I want to get:

Iteration 1:

A

Iteration 2:

A

B

Iteration 3:

A

B

C

Iteration 4:

A

B

C

D

richards99 · June 30, 2012, 8:38pm

Ahh okay.

You can still do this, not as easy though.

Use the Interval Loop Start, configure it to start from 1 in 1 increments as an integer. After this node use a Row Filter node, and choose to "include by row number", then choose flow variables tab and select RowRangeEnd as "loop value" for the variable. Now complete your loop with a Variable Condition Loop End. Choose the Loop Value as the variable and for it to finish, select "=" and the number of the last row. You can do this manually for now, or ultimately you can make it all automated by using earlier in the workflow a GroupBy node or Stats node to calculate number of rows and then use TableRow to Variable to convert the total row numbers into a variable which you could feed into here.

Hope this helps

Simon.

muthmann · June 30, 2012, 9:40pm

Hm,

Ok. That sounds nice but currently I am getting the following error using that setup:

"ERROR Variable Condition Loop End Execute failed: No such variable "currentIteration" of type INTEGER"

richards99 · July 1, 2012, 12:18am

Apologies, I was doing this from the top of my head.

Just tried it out, you only need a normal Loop End node, not a Conditional Variable Loop End.

The end iteration number is specified in the Interval Loop Start, so if you have 20 rows, choose 1 as the start and 20 as the finish in the interval loop start, and 1 as the increment. And remember to choose Integer as the type.

Hope this helps

Simon.

muthmann · July 1, 2012, 5:18pm

Wow. Thanks for your help. That works like a charm. I also tried the proposal for counting the values within the fields, but neither the Statistics nor the Group By node provide me with a row count. So I used a Java Snippet to count the rows, than a Column Filter to get the column with the row counts and a Java Snippet Row Filter to get the last column.

No everything works nicely. Thanks for your help again.

richards99 · July 1, 2012, 5:23pm

Glad it worked out.
You can do a row count with the groupby node by not choosing to groupby any column, and then choosing to aggregate by any column and choosing count as the aggregation type. I am somewhat averse to the snippet nodes so that’s how I do it!
Simon.

tangerooo · June 11, 2013, 5:48pm

I have an issue that is similar to this forum topic.

I'm trying to create a loop within a loop for partitioning purposes. The outer loop cycles through 100 different random seeds that is passed to the partitioning node. The inner loop cycles through x number of times so that the partitioning of data is always different. However, with my current workflow, the partitioned data is always the same and does not vary.

If I remove the outer loop of cycling through random seeds, the inner loop works as intended. I attached a snapshot of my workflow.

knime_partition.png

Aaron_Hart · June 13, 2013, 10:20am

Also, as of 2.7 we have an extract table dimensions node!

swebb · June 13, 2013, 2:56pm

Hi tangerooo

Are you setting the seed in the partition node based on the variable from the outer loop?

If so it will always produce the same partition in the inner loop during that outer loop iteration.

For example, outer loop iterates over the following values: 1, 2, 3, 4

The counting loop is set to 10. It will run 10 times for the value 1, 10 times for the value 2 etc.

So you will make a partition 10 times with the seed 1, 10 times with the seed 2 etc. Therefore you will get 4 sets of 10 identical seeds.

What do you need the inner loop for? Or rather what is it you are tyring to achieve. If you just want to investigate the stability of the model to different partitioning you can just use the counting loop, set no static seed (it will choose a new random seed each run) and collect the results.

Regards

Sam

tangerooo · June 13, 2013, 5:45pm

Swebb, I see what you're saying about the outer loop and yes, I was passing the seed into the partition.

I'm trying to measure the stability of a model by looping various partitions of the same seed. And once the best model with that particular seed is found, I'd like to test that model using that seed on new data.