Chunk Loop - Chunks of varying size

CMackey · November 26, 2024, 9:54am

Hi,

I’m trying to process a group of excel files to produce a single table containing all the relevant data. My issue is that the input files are not always the same number of rows long.

How do I set up a loop to deal with varying sizes of input? What I’d ideally want is some way of taking the output from the Excel Reader (the combined sheets), searching for a specific row, and splitting them according to that.

takbb · November 26, 2024, 11:39am

Hi @CMackey, what defines the “varying sizes of input”? A chunk loop is designed to pass you every n rows regardless of what they contain.

If you want varying sizes of chunks according to some other criteria, that sounds more like you want to use the Group Loop, based on one or more grouping columns.

If that’s not it, please can you provide an example of what you are trying to do.

Alternatively is a loop even what you are requiring? If you want to split a table according to a specific criteria, there is possibly the Table Splitter node (assuming KNIME 5.x) , which can split a table in two based on a value in a given column.

CMackey · November 26, 2024, 1:11pm

Hi @takbb - thanks for your response.

So I have a collection of Excel files that can vary from 12 rows long to 18, with the average being 16. These are formatted for visual appearance, as they’re usually intended to be Certificates of Authentication.

What I’m trying to do is extract the relevant data from each, in turn, and present it as a single table - this is just so at the end of a project we don’t have to search through every file to find the right data points.

My current script works using a Chunk Loop and assumes every form will be 16 rows long, but I know it’ll fail if a different size appears. That’s what I’m trying to account for. Perhaps the Group Loop will help, as you suggest, though I don’t understand how to configure it.

Unfortunately I can’t show you an example of the input forms for proprietary reasons.

iCFO · November 26, 2024, 2:55pm

Group loop should do it. You just need to create a column that identifies the loop groups before hand if one doesn’t exist, then point the group loop toward it. You could use the Expression or Rule node to build one (as well as others).

As a personal general rule I tend to warn against using the Chuck loop node in data cleanup / pre-processing. It assumes a consistent rigid data structure, and often those structures change as systems are updated. I prefer using a rule based / group loop approach even when a chunk loop would suffice for the initial workflow construction. It will be more flexible to data structure changes and quicker to adapt later if issues arise. I typically only use Chunk loops to control the timing / processing load of tasks in the workflow itself.

Is the group Identification in Excel “formatting only”, or can the be identified via logic in the data itself? If it is formatting only, then upload an excel sample file with dummy data so we can take a look.

CMackey · November 26, 2024, 4:34pm

After fiddling with it a bit longer I realised I’d overlooked the “Append Path as Column” function in the Excel Reader node. With this added, I now have a consistent value that can be treated using Group Loop. Thanks for prompting me to think of it!

iCFO · November 26, 2024, 6:09pm

Good call. I use that setting every time I read from Excel as a matter of habit. You never know when it will come in handy during data prep!

system · December 3, 2024, 6:10pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.