How to split data into multiple csv files?

Hi,

I’m new to KNIME and I want to split data into multiple csv files. I hope someone can help.

I have a dataset of 20,000 rows (could be 30,000 next week). I want to split this dataset into multiple csv files of up to 2500 rows. The dataset contains information about multiple customers. It is important that a customer’s data stays together in one csv file. A csv file will contain multiple customers.

Thanks in advance!

Hello @TomUsingKNIME and welcome to the KNIME forum.

There are many post around the forum with this related topic challenge; we recently had the Weekly Challenge (Challenge 2: Split File and Save into Subfolders) ‘Just KNIME It!’ as well; where you can find many different user approaches to this problem. Mine one is a highly recommended one :sunglasses:
https://hub.knime.com/search?tag=%23justknimeit-2

Some other related posts that could be useful

If you can’t get enough insights from it, you can provide some more details of your use case and we can help you to design a custom approach to it.

BR

1 Like

Hi @TomUsingKNIME ,
this should be a solution KNIME_project – KNIME Hub
This workflow saves one file per customer and a log file with customer name, customer file path, row count per each customer. You must

  • replace the test “customer data” table with your own data file
  • replace the path in “output folder path” with a path of your filesystem

Hi,

Thanks for your quick response!

The examples mentioned don’t quite match what I need.
Hopefully this will make it clearer;

I want csv files with about 2,500 rows. Customer data must always remain together in a csv file.

The Chunck Loop Start node would be perfect, if i could select “don’t break bij Customer”.

Regards,
Tom

Hi @TomUsingKNIME , and welcome to the Knime Community.

It seems like you have 2 conditions, and in some cases, they might not be able to live together, meaning that if that happens, one condition would have to trump the other.

The 2 conditions that you have are:

  1. Batch of 2500
  2. Customer’s data stays together in one csv file.

Question:
What happens if one Customer’s data overlaps the 2500 limit? Does the data get split, or does the limit of 2500 expands to include all the data?

And there are different scenarios to this:
i) If customer’s data has more than 2500 rows (already mentioned above)
ii) Customer A’s and B’s data reached 2000 rows. Customer C’s data is supposed to go in next, and Customer C’s data has 1000 rows. Though the 1000 alone is less than 2500, there are only 500 rows if the total for the file is limited to 2500. Shoud the 1000 be split into 500 in this file, and the next 500 in the other?

As you can see, there are scenarios where both rules cannot be applied at the same time.

Hi @bruno29a,

Thanks!

The maximum number of rows per customer in the dataset is 7.

I would prefer to have a maximum of 2,500 rows per csv file. Suppose you now have 2,495 rows in csv1 and the next customer has 7 rows, then they have to go to csv2.

Did I answer your question?

Hi @TomUsingKNIME , yes it does.

I put something together that allows you to determine which Customer data goes in which file. You can then implement the rest to write to the CSV files.

The workflow looks like this (description of the nodes included):

I used a variable to define the max of rows per file, in your case it will be 2500. Since I don’t have a huge sample data, I set mine at 5:
image

I created a sample data of this:
image

And the results at the end is this:
image

As mentioned above, I set my workflow for a max of 5 rows per csv file, so in my sample data, Customer A’s and B’s data will go in the same batch (3A + 2B = 5).

Customer C’s data will go in a batch, and nothing else can go in, since C has 4 rows, and D has 3 rows. Therefore, Customer C’s data will go in a new batch, and Customer D’s data will go in another new batch.

From there, you can use a group loop on Batch, and write to csv. You can use the Batch number to create your csv file name dynamically.

Here’s the initial workflow:
Split data into multiple CSV based on some conditions.knwf (32.0 KB)

Let me know if you have trouble implement the write to csv files. Alternatively, you can also try to write to the csv files inside the current Loop. Let me know if you need help.

EDIT: I got bored and wanted to implement the rest for writing to the csv files :slight_smile:

Updated workflow looks like this:

I also defined a file name template for the split files, that way you can change it easily there:
image

Just keep the ##NUM## which is what will take the numbering of the files.

After running it, it created the 3 csv files:
image

image
image
image

Here’s the updated workflow: Split data into multiple CSV based on some conditions.knwf (36.8 KB)

3 Likes

@TomUsingKNIME ,
I’ve inserted a chunk loop into my workflow KNIME_project – KNIME Hub. I set the chunk value (i.e. max rows per file) = 45 since I don’t have a large customer file (3 customers, with varying number of items). The result is saved in this log file (count=rows per file):

"customer","filename","count"
"A","C:\Users\dustb\Dati\output\test\a_0.csv",45
"A","C:\Users\dustb\Dati\output\test\a_1.csv",15
"B","C:\Users\dustb\Dati\output\test\b_0.csv",45
"B","C:\Users\dustb\Dati\output\test\b_1.csv",45
"B","C:\Users\dustb\Dati\output\test\b_2.csv",33
"C","C:\Users\dustb\Dati\output\test\c_0.csv",45
"C","C:\Users\dustb\Dati\output\test\c_1.csv",45
"C","C:\Users\dustb\Dati\output\test\c_2.csv",9

Is this a correct solution to your problem?

Hi,

I’ve looked into the different workflows, thanks for your good help!

The workflow you created @bruna29a is exactly what I need, thank you so much!

I’m glad you got bored :wink:

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.