Chunk Loop by Grouped Column values

Hello,

I am working on a workflow that is generating multiple output files for an ETL process.

For my example the output cannot exceed 5 rows of data per file, but we also want to minimize the number of outputs as the real data I am working with contains thousands of rows. The output cannot split rows on the ID or Record column, or the system will return a duplicate record error, and the process will fail.

My workflow was previously using a Chunk Loop Start node and setting my row limit as my rows per chunk value. How could I accomplish the example below in Knime using a loop?

Example data:

ID Record Dollars
1 A 5
1 A 15
2 AB 15
2 AB 18
3 AC 30
3 AC 15

Example Outputs:
Output 1

ID Record Dollars
1 A 5
1 A 15
2 AB 15
2 AB 18

Output 2

ID Record Dollars
3 AC 30
3 AC 15

I assume that as you say no output can be longer than 5 rows, but also a single output cannot split a set of like records. Can you confirm? Also is this sample data representative of your larger data set or can there be more than 2 consecutive same ids/records?

3 Likes

Correct, the output has a maximum row limit, and outputs cannot split a set of like records. There can be more than 2 consecutive same ids/records as well as some id/records having only 1 row.

1 Like

Is it possible to have more than 5 like records? If so how would you handle that?

2 Likes

It is possible to have more than 5 like records in the actual data but the actual data also has a 5000 row limit. The most I have seen is 15 rows for one record so there should never be a case where one record has more rows than the limit.

The 5 row limit here was just for example purposes.

So the real limit for a single id is 5000?

1 Like

What about using the Group Loop Start instead of the Chunk Loop and set it to Group By = ID?

1 Like

Hi @miguel_silva3 , certainly @McReady 's suggestion is the pragmatic solution for your sample data, but if in your actual data you have a 5000 row limit per iteration but in reality no group exceeds 15 rows, then you understandably may not want to have a minimal number of rows per iteration.

Can I clarify something. You said “The output cannot split rows on the ID or Record column”, so it’s not just that it cannot split on the combination, but also cannot split on the individual columns

ID Record Dollars
1 A 5
1 A 15
2 A 15
2 B 18
3 B 30
3 AC 15

None of these rows can split because either ID or Record is common with the next row? Is that right?

1 Like

Hi @takbb , unfortunately Group Loop Start does not work for my workflow due to real data having 16 columns in total and over 3000 unique ID and Record combinations. Group Loop Start would run and generate several more upload files than our current manual process creates.

To clarify, ID AND Record are the grouping columns. Using your example data I would treat the following as groups where the final output files would be 2 files made up of file 1(Group1-4) and file 2(Group 5)

Group 1

ID Record Dollars
1 A 5
1 A 15

Group 2

ID Record Dollars
2 A 15

Group 3

ID Record Dollars
2 B 18

Group 4

ID Record Dollars
3 B 30

Group 5

ID Record Dollars
3 AC 15
1 Like

Hi @miguel_silva3 , thanks for clarifying the example, and I breathed a sigh of relief at your response. :wink:

I don’t think it’s a trivial problem in the sense that I cannot immediately think of a way to achieve it just using no-code nodes. Possibly a recursive loop, but that might impact performance across a large number of rows.

I believe I have a solution though involving a small piece of code which I’ve bundled into a component for ease of use.

If you’d like to give this a try, it will hopefully assist for your use case:

You configure it, specifying the grouping columns and then maximum number of rows per chunk. It then generates a “chunked grouping identifier” which you can use as the group identifier in a Group Loop…

So the workflow takes this form:

Feel free to open the component to look to see how it works. Most of the nodes there are to make it work dynamically (i.e. allow you to specify the grouping columns etc). The heart of it (the piece which actually generates the grouping identifier) is a java snippet.

A brief summary of what the component does is: it groups the rows to get a count of the rows for each grouping (eg. ID+Record). Then (and this is the java snippet), it starts at the first row, and sums the number of rows per grouping. If the cumulative sum to that point exceeds the maximum rows per chunk, it adds 1 to the grouped-chunk-id for the current grouping and then resets the cumulative count for the current group.

In this way, it assigns a “chunk identifier” that each group gets assigned to, and this can then be subsequently used in a group loop.

Obviously this needs testing, so please give it a go, and if you find any problems with it let me know.

btw, for the additional sample data that I asked about, it returns this:

which I believe matches your expectation of the final row being in its own group.

If you set the chunk size too low, the component will fail with the following message. In the following example I set the chunk size to 1 just to demo, and here it is saying that the group 1-A has 2 rows, but the maximum chunk size is 1, which of course means it cannot continue:

4 Likes

Yes! Thank you so much, this component worked for me to use the Group Loop Start node and all I had to add after that was a Column Filter to get back to my desired output.

image

2 Likes

Great. You’re welcome @miguel_silva3 , and thank you for indicating the solution. Also, thank you for giving clear answers to our questions. It is always so much easier to assist when there is clarity.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.