Segmenting information in several csv files maintaning integrity

Hi everybody,

I have some input data that comes from a table creator which should be grouped by EQP_NAME (equipment name). All these equipments have different number of lines with different additional information and the # of lines can be seen as a value counter result, depicted below.

image

What I need is a way to persist the equipment information on csv files, without the risk of creating files that split equipment information. What I mean is each csv file should have one or more equipment information as a whole, where I can parameterize how it should behave: for instance, I can inform how many equipments I want on each file or how many lines each file should have, without the risk of information loss (for the last one it should add more lines to complete the equipment information, if that is the case).

This behaviour is quite similar to the chunk loop, except by the fact the chunk loop allows the information to be splitted, by only concerning about the # of files or lines, not looking at the type of information.

Here’s the workflow I started to draw:

Segment n groups.knwf (26.1 KB)

I appreciate if someone could give me a hand on this. Please let me know if I need to be more clear.

Thanks in advance!

Gilmar

you could use the Group Loop Start node to do things per group. This will provide you with a flow variable that contains the current value of your group you could use to create a CSV file name.

As a result, your data will be split by the group you have defined.

image

If you need to split the data further into chunks of a special width (500 lines per CSV and then the rest of each group into the last file). You might take some inspiration from this workflow:

1 Like

Hi mlauber71,

This is an option. But in real life I need to deal with a large number of equipments, more than 8k, so I need to have a situation where, instead of dealing with one equipment at a time and persisting its information on a single csv, to choose for instance 500 or 800 equipments to be grouped and persisted in a single csv.

So, if I have 8k equipments at input, I could set the workflow to create, for instance 10 files with 800 equipments each or 5 files with 1600 equipments each, depending on the parameter used.

Thanks

This is why I have included the second example. Here you could define a batch size and would have x chunks of the same size and one overspill - you could do this per group. So you could keep your equipment together while splitting it into several parts. You might include a switch where this only happens once a chunk is too large.

It should be very possible to do this in KNIME. It would involve some planning.

As an additional remark. CSV is widely used but not compressed by default and would not preserve your column types. If you can you might want to think about alternative storage types like parquet - which is also available in KNIME.

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.