Splitting out a data set for groups of rows but values have to net to zero

taylorpeter55 · October 10, 2022, 3:19pm

Hi all,

I am hoping for some help, please.

I have a dataset that includes many groups of data (grouped by journal number). Within each journal, there are a differing number of line items but each journal nets to zero in the Amount column.

My issue is that each journal number can only have a maximum of 6 lines. I have utilised Java script to determine whether a journal has 7 or more line items (1= No, 2 = Yes).

However, I am struggling with the next bit…If a journal has more than 6 lines (e.g journal 3 in the screenshot below), this needs to be split into two separate journals (3 and 3a - illustration in yellow an orange on the screenshot).

My issue is that the two separate journals need to have the 10 line items split between them (not necessary equally) BUT both journals 3 and 3a need to still balance to zero individually (as per the illustration).

Note- in this example the line items do match up nicely but there will be scenarios where each of the line items has a different value but overall it nets to zero.

Does anyone know if there is a solution to my problem that is achievable within KNIME please? I am beginning to think this isn’t possible but just wanted to ask the question before i give up on it.

Thanks
tp

Daniel_Weikert · October 10, 2022, 4:53pm

Hi,
maybe you can use Group loop as the journal separator the sort within the group (probably by absolute value to ensure the first 6 entries net to zero) then use a row splitter to split after 6 rows.
Just as a first idea to get you started
br and take care

taylorpeter55 · October 10, 2022, 6:06pm

Hi Daniel, Thank you for your reply. I’ll give this a try!

takbb · October 10, 2022, 6:34pm

Hi @taylorpeter55 ,

Taking @Daniel_Weikert’s idea and then instead of a row splitter for the 6 rows, you can put a Chunk Loop with the chunk size set at 6 rows, to pick up journals in batches of 6 (or fewer).

Inside the chunk loop, it can rename the journals for you using String Manipulation and my favourite (kind of undocumented) feature of String Manipulation

string(
 $${IcurrentIteration}$$==0
 ?$Journal No$
 :join($Journal No$, substr("abcdefghijklmnopqrstuvwxyz",$${IcurrentIteration}$$ -1,1 ))
 )

This says if the current iteration is zero, simply use the journal number, but otherwise append a letter taken from the string at the position based on the current iteration number…

Journal Division.knwf (29.5 KB)

taylorpeter55 · October 11, 2022, 9:11am

Hi @takbb,
Thanks very much for your reply and taking the time to create an example workflow.

This works really well when we have an equal and offsetting number of lines however I have scenarios where the journal may equal zero in total but have 11 lines - i.e. one line of -1000 may be offset by 2 separate lines of 500 & 500.

The above is why I have been questioning whether what i need to achieve is possible.

I’ve updated the journal 3 data from the previous example to demonstrate.

When I update the data to include an extra line (but also change around a couple of other figures), Journal 3 provides the following result using your workflow.

And to elaborate a bit further, there is a possibility that any of our journals may have up to 900 lines (i was working to 6 to make it easier to explain and I would extrapolate up any solution).

In essence, we, unfortunately, can’t just split at 6 or 900 lines. If a journal has 950 lines, depending on the value of each individual row, we may need to split the 2 journals as 846 lines/104 lines or 750 lines /200 lines - all dependent on both journals netting to zero individually.

Any help you could provide on the above would be appreciated but I understand it’s a time-consuming and complex issue - I was just trying to understand if it’s even feasible

Thanks again for your reply.

badger101 · October 11, 2022, 11:26am

Hi @taylorpeter55 , maybe you’d like to revise your two rules so that it’s clearer for @takbb when he’s modifying the workflow later. At this point in time, I think you might not realize that these two rules may contradict one another in certain occasions.

In other words, the first 6 rows for the ‘Amount’ column might not net up to zero.

And/Or

The remaining rows for the same group might not net up to zero.

In such cases, you’ll have to forego either one of those two rules & prioritize one over the other.

taylorpeter55 · October 11, 2022, 1:38pm

@badger101 thanks for the heads up that it wasn’t clear.

With regards to the maximum number of line items, this is actually 900 (rather than 6) - but for illustration purposes I reduced it to 6 so we could have a workable example. If we get a working solution I will update it to 900 rows.

In relation to the two separate journals, let me try to explain. If our row limit for a journal is 900. One particular journal entry may have 1,000 rows - meaning we need to split it into two separate journals.

However, when we sort by Amount (as advised by Daniel & takbb above), the rows will not always be in matching pairs as line item values do not always have to offset on a 1 for 1 basis.

We have situations where one row of $10,000 is actually offset by 5 different rows of $2,000.

Now, let’s assume the line item worth $10,000 is row 888 of the journal and we have rows 899 through 903 are the offsetting rows worth $2,000 each.

If we were to split at line 900 as a fixed number, the first journal would have 900 rows - including the $10,000 (row 888) and TWO -$2,000 rows (from rows 899 & 900). This means there would be a balance of -$6,000 left on journal one.

The remaining 3 rows 901, 902 & 903 worth -$2,000 would be transferred to the start of the SECOND journal. The second journal would therefore have 100 rows (1000-900) but will be left with a balance of $6,000 (as the offsetting lines appear in the first journal).

What we would need KNIME to do in this instance is either split the journal at row 887 (meaning all 6 offsetting rows would move to the second journal) or we would need to split the journals from row 904 (meaning the offsetting rows all appear in the first journal). i.e. journal one = 887 rows, and journal two equals 103. Or Journal one = 903 rows and journal two equals 97 rows.

In summary, I need to be able to check whether the line items in journal 1 net to zero. Then, if they don’t, we need to work either forwards or backwards through the rows, adding the amounts to the first journal total until it equals zero with any remaining rows moved to the second journal.

I hope I’ve explained that ok as it’s quite difficult to summarise. Thank you all for reading it.

Thanks.
tp

takbb · October 11, 2022, 2:03pm

Hi @taylorpeter55 , thanks for the update and additional rules. This is certainly going to be more of a challenge and it looks like somewhere within a specific set of entries for a journal, we are going to need to find all the groupings of values within the journal that when summed (or by themselves) make the value of each journal entry. and then determine which groupings are “optimal”.

This part has echoes of a post from almost a year ago, which I think I’ll be reviewing to see if some of the ideas can be pulled in here.

By the way, do you have Python installed in your KNIME environment? That might determine the route we take.

This was last year’s post which you may find of interest too…

Quick question, when you are balancing:
… you have basically a set of negative numbers to be balanced by a set of positive numbers. But is it true to say that there will always be a SINGLE NEGATIVE value balanced by ONE OR MORE POSITIVE values.

So the following “balances” could occur:

-1000 = +1000
-1000 = +600, +400
-1000 = + 500, +400, +100

But the following “balances” will never occur:
-1000, -2000 = +3000
-1000 = + 600 +500 -100

Is that correct?

taylorpeter55 · October 11, 2022, 3:43pm

Hi @takbb,

I do have Python installed but i’ve never actually used it (beyond my capability ) - these are the nodes that I have available:

Thanks for sharing a related thread, I’ll have a read of that later.

All of the following balances could definitely occur:

-1000 = +1000
-1000 = +600, +400
-1000 = + 500, +400, +100

So might the following as well but the logic we work with is that all negatives and positives can be grouped together:
-1000, -2000 = +3000 (this could happen)

-1000 = + 600 +500 -100 would actually be -1000,-100 = +600, +500

We may even see -550, -250 = +200 +275 +325.

And worse still, we have decimal places to contend with also - that have reconcile to 0.00.

So this is possible:

-550.27, -425.25 = +975.52
or
-550.27, -425.25 = +200, +475.52, 300

Essentially, the only truth is that overall a journal will balance to zero across all rows. When we have to split a journal into two, it could be any combination of + rows and - rows that make the two journals but they both must balance to zero.

As you said, it is certainly a challenge and one that I’ve yet to make a significant breakthrough on - so I do really appreciate you taking the time to read my question and providing responses. A problem shared is a problem halved as they say haha

If it isn’t possible to accomplish then that is a valid answer too, I just need to factor that in going forward

Thanks.
tp

takbb · October 12, 2022, 6:33am

Hi @taylorpeter55

Good news and bad news…

Good news I have come up with an algorithm that I think would be able to work out the groupings…

Bad news is that there is no way it would scale as it is to journal sizes of 900… as it will slow exponentially as the set size increases. It appears to work fine small samples.

Better news though… I’m hoping that some knowledge of the real world data would allow for optimisations.

So some questions.

1 what is your typical number of rows per “raw” journal (ie before you break it down into “pages” of 900 or whatever).

What is the likely maximum number of journal lines you would have to balance together to make a match?
What I mean by this is you may have a raw journal of 2000 entries but you can possibly actually group that into blocks of maybe 10 lines that balance each other to zero and maybe in the majority of cases your balancing block sizes are just 2 rows (ie one row balances one other row). The smaller that “maximum block size” the better for optimisation.
In the majority of cases, do you have one row balancing one other row? Again if the answer to that is yes, there is a very good optimisation that can be done to deal with those rows first and reduce processing burden.
Thanks

taylorpeter55 · October 13, 2022, 6:46am

Hi @takbb, Apologies for the delayed response. That is certainly good news - thanks so much!

Let me try to answer your questions.

having journals with line items over the 900 limit is a rarity - we should hopefully see this relatively infrequently but when we do see it, unfortunately there could 10s of 1,000s of lines. The last one we had was circa 34,000 - which had to manually be broken up into groups of 900 or less (hence what we are looking at haha).

However, the majority of journals that we have will likely be under the 900 threshold which in itself is good news as I already have that logic now

& 3) Really, we have two types of scenarios. Ones where there are offsetting pairs of rows (i.e. 1 row offsetting with another) and then others where we would have 5 lines of differing values offsetting with 2 lines of differing values (i.e. it will offset but on an unpredictable amount of rows).

In the scenario where we have offsetting pairs, we can definitely group into blocks that offset with each other - i.e. 10 x -500 and 10 x 500 could be offset.

Unfortunately, we don’t really have a maximum number of rows as we will see lots of different amounts in our journals - however, if we can strip the stuff that we know balances then I imagine that is definitely a help to the performance.

Thanks.
Pete

takbb · October 13, 2022, 6:50pm

Hi @taylorpeter55 , I think I’m getting somewhere with the “theory” and have an idea how to make it return results before the universe goes cold…

Would you be able to upload an anonymized large sample of valid journal data (of the order of a few thousand rows for multiple journals) that would be typical, so that I can give it a whirl and iron out a few issues.

taylorpeter55 · October 17, 2022, 6:38am

Hi @takbb, Apologies for the delayed response, i was out of the office on vacation on Friday.

Thanks so much for your continued help on this

Please see attached example journals as requested. If you need anything else, please let me know.
Example journal balances 1.xlsx (138.1 KB)

Thanks,
tp

takbb · October 17, 2022, 12:46pm

Hi @taylorpeter55 , I hope you had a good day on Friday!

I think that what my workflow managed to achieve a result. Before I upload it, it needs a “little” tidying up as it contains a lot of superfluous debug stuff in it, but in the meantime, here is the output it produced for the sample Example journal balances 1.xlsx sheet.

This assumes a journal “page size” of 900 rows. If you could take a look and see if this output looks ok, I’ll work on tidying up the flow so it is a little more “understandable”. In the meantime, obviously let me know if you think it got it wrong!

Journal Details.xlsx (134.9 KB)

For each original journal no, it reproduces the original journal number, text and amount, plus a “journal sub number” which is “1”,“1a”,“1b” and so on.

By the way, I notice when I read back in Excel, that Excel itself is prone to crazy rounding errors even when performing basic addition. So added together, the numbers which sum to 0 appear in Excel as -1.88244E-09 (i.e. zero to you and me!). I had the same problems with java, which is also very quick to introduce rounding errors on non-integer numbers even when just performing addition. To circumvent this, during processing, I multiply all numbers by 100 at the beginning of the process and deal with them as integers, which then get divided by 100 again at the other end.

taylorpeter55 · October 18, 2022, 8:10am

Hi @takbb, Wow - this is brilliant

I’ve checked your output for the following and it all looks really good!:

Each journal being split into portions of less than 900 lines
Total line items for each journal matches across all portions
Sum of lines in each portion equals zero

I’ve attached my review file for your reference →
Journal Details Logic result check.xlsx (127.2 KB)

I do have one question though regarding Journal 3. The lines are split into the following portions:

3 -900
3a -899
3b - 886
3c - 419
3d - 594

The total of groups 3c and 3d definitely need to be split as their total is over 900 - however, please can you help me to understand why they are in such small groups? i.e. why is the split not a more uneven split with 3c being a number closer to 900 and 3d being a smaller group?

I’ve looked at the line items in 3d and there are some offsetting pairs which I guess could be included in 3c. I’m sure there is a reason but just trying to understand haha

Journal 2 is also similar with a split of 897,773, 260

For comparison, Journal 1 is split (900, 899, 224).

Thanks again for all of your help - what you have achieved is amazing and I’m extremely grateful for your help!

Thanks
tp

takbb · October 18, 2022, 11:01am

Hi @taylorpeter55 , thank you for reviewing the output, and giving valuable feedback. Always nice to see somebody actively involved rather than simply firing off questions and then seemingly walking away.

With regards the journal splitting, this was down to the way the code works in processing “chunks” of journal lines in one go, and I was trying to find a trade-off between performance in terms of the number of records it could handle in one go, and the optimum filling of the pages. I was limiting the maximum size in any one subset, which made it less optimal in terms of filling pages.

The good news is that after some tweeking and discovering an area that I hadn’t optimised the way I thought I had, I can now set the “chunk size” the same as the “page size”. As a result, in a test run I have now done again, with the same input file, I get this result, which I think you will agree is much closer to the ideal of 900 per page:

I have uploaded this to the hub here, as I think it may be something useful to others, and also will allow me to revise it at a later date (and improve the documentation).

It’s not for the faint-hearted … using recursion in a java snippet along with a recursive loop in KNIME

Please take a look. I’ve added annotations to the flow. Give it a try and please get back to me if you have any questions, or find any problems.

It was an interesting challenge. I hope it works for you… Enjoy

When I find time, I may turn a portion of this into a component as I can see it potentially being useful to a wider audience.

taylorpeter55 · October 18, 2022, 12:41pm

Hi @takbb, Thank you so much, i’ll review the workflow over the next few days and get back to you once done.
I know I’ve already said it but I do really appreciate you investing time to look at this for me - I agree it is something that will definitely be of use to other users.

Thanks again.
tp

system · January 16, 2023, 12:42pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.