Write multiple input CSV files to single CSV file and the end of a flow

Hi,

I have a KNIME flow with a loop that 100 JSON files (each has between 7’000’000 and 12’000’000 rows) from a folder. At the end of the flow a series of rule based row splitter nodes split the data into a number of streams based on an ID value in one of the fields.

My problem is that the CSV writer at the end of the stream is generating a new file for each of the 100 original files so instead a final result of a single file for each stream, I end with 100 files for each stream.

The columns/fields for each stream is identical so there is no need to worry about row header mismatch, short rows or anything like that for the out put from a single stream.

Uploading to an online service was considered I don’t think any online service would appreciate me dumping 350Gb into their systems on a weekly basis.

I could write the whole lot to a DB with a table for each stream and then fetch each table back to write to a CSV but that is just a lot of rework, overhead and time.

Is there a way for me to append/merge/concatenate the end result of each stream at the final CSV Write node into a single CSV file so that I end up with a single file for each ID.

tC/.

Maybe a stupid question but could you just append the data you want to an existing CSV file. The node should offer that possibility.

Other than that parquet might be a format to explore since it compresses data to a certain extent, and also you could see if a local big data environment could be used with some compression. It might be more flexible to accept partitions.

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.