equally divided file

Hi

My goal is to divide a file data equally. For instance in attached workflow I want to divide it into 3 files with following conditions:

  1. each file will have equall parts of rowes in this case group by months
  2. I want as equaly file as posible
  3. I want to use all date so that nothing will be left - everything has to be assingned to worker

image

I have 4 months that I want divide them to workers so everyone will get equall part of september, october, november and december.

equally divided file.knwf (178.3 KB)

Now I stuck on divided part:

image

I have only 3 workers and in this case it gives me 4 parts.

So the main problem is to divide it equally and to specified number of files in that case 3.

Hi @89trunks

I created a solution equally divided file 2.knwf (251.3 KB) that uses a Partitioning node twice, within a GroupLoop (for every month). So every month there are 3 equal (more or less) groups.

Hope this helps.
Gr. Hans

3 Likes

Hi @89trunks and welcome back to the KNIME community forum,

In addition to the solution by @HansS, Here is a workflow in which the number of files for division can be specified. So you can divide your file into any number of files, and the number of rows in all files are almost equal.

equally_divided.knwf (112.0 KB)

Hint:
If you want to divide dates in such a way that each file contains dates sequentially (e.g. the first 10 days in the first file, the second 10 days to the second file and the third 10 days to the third file), then you need to use a Sorter node after the Group Loop Start node.But if you want the dates to be shuffled before division, then use a Shuffle node after the Group Loop Start node and a Sorter node before the CSV Writer node.

:blush:

2 Likes

Nice I didn’t know that partioning can be used like that. But what if the data is much bigger and I have to divide it into 400 equally parts? Can I do something like that ?

Thanks for the hints :slight_smile:

1 Like

@89trunks

In that case, try the Auto-Binner node, in this example equally divided file 3.knwf (179.2 KB) creating 20 equal groups per month. Only the last bin (in this case Bin20 ) is not equal to the other bins.


gr. Hans

I will try this in my datebase but I think this is it :slight_smile:

Thanks

1 Like

Hi I was hopeing you may have an idea or hint how to get out of another problem.

image

It happens that I need to divide table as ealier into equally divided files but the next condition is that if the end of that part is a first/middle or any other row of the same values it has to take them whole. to that part . In above picture we have numbers of ID and dates that some occures more than once. If for instance I devided this data into two parts I would like to that both rows go to the same part. So let’s pretend that equally divided file will have rows from 0 -14 and 15- 30. But row 15 has the same ID and date - so it should go to the same part.

I do not know if this is good visualisaton of this problem.
My data is big and I wolud like to split it as equally as possible into X parts but when the last row of occures in next row I want it to be placed into the same “bin”.

Data:
image

dividing into 3 equally parts: (21 rows / 3 = 7 rows per part)

part 1
image

part 2
image

part 3
image

but my goal is to have something like:

part 1
image

part 2
image

part 3
image

I do not know if it can be done … I did not find good solution.
Best solution will be

  1. that it will take always x rows (acording to dividing) but if the last one occures again it takes till the last one (stops when the new one showes up).

  2. the same thing as 1) but it counts desired number of rows and if for instance 7th is repeated n times it omits it and takes next single one. The next part will start with what is omitted ang goes on.

Do you think that something like that can be achived?

I tried group loop/ chunk loop start/ partioning and other nodes with no results.

Hi @89trunks

How about adding a Rank node an rank your ID. Then do a Binning on the ranked values. The only condition to be met is that you have your table sorted by ID.
binner
See this example equally_dived_file.knwf (17.9 KB)

gr. Hans

4 Likes

Hi,

The solution by @HansS works great. I think you should rank on Date and IDs then bin the table.

Or

If I have got you correctly and it is fine to remove duplicates then I think there is a straightforward solution for this:

Using Duplicate Row Filter or GroupBy on ID and Date columns makes it possible to remove duplicates then you can do the same approach as before to divide the table.

:blush:

4 Likes

Thanks this is it :slight_smile:
armingrudd thank you too :slight_smile:

I was wondering is it possible to divide one file into x files were each one has different specify number of rows with the similar rule as width binning.

I was experimenting with chunk loop start with row per chunk as variable. But somehow can’t do it.

image

divide file into files each one with specific number of rows.knwf (45.1 KB)

I know how many files I want
image

just for exercise it is 4 each with speciic number of rows.
So the whole data sholuld be splitted into 4 files where first has 4 row, second 10, third 5 and forth 1. If something is left it should leave it be and not include it in any file or create one exrtra file where the excess rowes will be placed.

So my first goal is to divide it according to x number of rows for each file. (this data is small and I have to work on very big ones - so this is just to illustrate the problem)

My second goal is insure rule that this original topic originate. I tried Rank and Auto binding but no result.

If needed I will create seperate topic for this but it somehow relate to this.

Thank you and sorry for mixing up so much.

Hi there @89trunks ,

if I got you right you can avoid loops in this case. So by using Moving Aggregation node you can calculate Cumulative sum i.e. in which row index each file ends. Then transform those numbers into variables and feed them into Column Expressions or Rule Engine node where you apply your logic with simple Ifs. After that use Cell Replacer node to add file name.

If you really want to use loops then recursive loop after Table Row to Variable Loop Start would probably do the job.

Here is workflow on KNIME Hub to check it out.

Br,
Ivan

2 Likes

ipazin you are genius - this should do the trick but in my case I have bigger data and have to divide it into almost 400 files with different numbers of rows. I will have to work on the ifs sentecnes. But will manage it. :slight_smile:

With the loops how can I achive the same result? Just of curiosity I was tying it but it failed. I mean did not get the same result as you.

Thanks anyway :slight_smile:

Hi there,

glad it helps. Regarding loops. Tried it but now not really sure how to do it without making it too complex. Loop to use for processing different number of rows in every iteration is a Group Loop Start but in this case is not applicable as what you are doing is defining “groups”. So, forget about loop in this case :smiley:

Br,
Ivan

Once again thanks :slight_smile:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.