Write multiple input CSV files to single CSV file and the end of a flow


I have a KNIME flow with a loop that 100 JSON files (each has between 7’000’000 and 12’000’000 rows) from a folder. At the end of the flow a series of rule based row splitter nodes split the data into a number of streams based on an ID value in one of the fields.

My problem is that the CSV writer at the end of the stream is generating a new file for each of the 100 original files so instead a final result of a single file for each stream, I end with 100 files for each stream.

The columns/fields for each stream is identical so there is no need to worry about row header mismatch, short rows or anything like that for the out put from a single stream.

Uploading to an online service was considered I don’t think any online service would appreciate me dumping 350Gb into their systems on a weekly basis.

I could write the whole lot to a DB with a table for each stream and then fetch each table back to write to a CSV but that is just a lot of rework, overhead and time.

Is there a way for me to append/merge/concatenate the end result of each stream at the final CSV Write node into a single CSV file so that I end up with a single file for each ID.


Maybe a stupid question but could you just append the data you want to an existing CSV file. The node should offer that possibility.

Other than that parquet might be a format to explore since it compresses data to a certain extent, and also you could see if a local big data environment could be used with some compression. It might be more flexible to accept partitions.


This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.