State if it's a duplicate based on a row

carlos_amaya · May 11, 2023, 11:52pm

Hello all!
I’m fairly new to Knime, hope you can help me identify below situation, grouping by DCN’s, sometimes there are rows that are duplicate (same information on all columns but different reference) and that should be flagged as a duplicate (first line was the payment and second line was another payment made by mistake), this was already solved using Duplicate Row Filter.

Some other times, there’s an additional line with an H on the credit column, which means that the additional payment amount was returned, I’m having issues trying to find a way to validate that line with the H is being a counter-payment for one of the S rows.

Thanks in advance!

HeatherPikairos · May 12, 2023, 8:08am

Hi @carlos_amaya maybe a solution could be to change the columns you are using for duplicate detection in the -Duplicate Row Filter- node?

I have used Dummy data and only two columns but for example, if you use only the DCN column for detection you would get:

If you use DCN and the Debit/Credit column for detection, the result would be:

Does the reference column need to be the same to be a duplicate? Or this column can be different?

Heather

takbb · May 12, 2023, 9:09am

Hi @carlos_amaya,

I see that I may have overcomplicated the requirement, and the solution may be more simple but since I had worked on this already, I will post it anyway!

If my understanding is correct, you are saying that:

a “duplicate S” row is no longer considered a duplicate if there is a valid H row that reverses it, and
an H row is considered valid if it exactly reverses ONE S row
a duplicate S row that is not reversed is still considered invalid
an H row that has no corresponding S row is considered invalid

At the end you want to flag all S rows that are duplicates but not reversed, as well as all H rows that are incorrect reversals.

The solution presumably needs to take into account that there could be multiple H and S rows for any given DCN and that it must take into account the idea that a single H row can only be considered a reversal for one of the S rows. This adds some additional complication with ensuring that we count reversal only once.

Extending your example data set to cover a number of issues, I looked at the following:

DCN	Reference	DETAIL-FIELD-A	DETAIL-FIELD-B	Debit/Credit	comment
10349347	513477894	312	GAL	S	valid
10349347	11816177	312	GAL	S	valid because revered
10349347	413343494	-312	GAL	H	valid reversal
123	27953295688	200	ABC	S	valid
123	45093121452	200	ABC	S	valid because reversed
123	33481082957	200	ABC	H	invalid reversal
456	67057236481	30	XYZ	S	valid
457	57581440870	-30	XYZ	H	invalid reversal (wrong DCN!)
123	432646278	200	ABC	S	invalid because duplicated and not reversed
123	54543643	-200	ABC	H	valid reversal

I have added a comment to the right showing what I believe should be the outcome. Where there are reversals, it is arbitrary which is actually considered the valid remaining S row, as there is no stated rule for that.

I think that the attached workflow will achieve this, and perhaps can be adapted to your data. The numeric data in this workflow are integers. If your real data is non-integer you will need to adjust it accordingly; in particular the math formula node that negates the result is specifically configured to return an integer and would need to be modified or else the joins will fail.

Validating rows for duplicates and reversals.knwf (52.6 KB)

The end result is this:

My Flag column is currently better described as a “Comment” column, but you can adapt this to simply return OK/FAIL or whatever you require. I left it like this as it helps with testing.

You will note that the choice of lines that are considered reversed differs to my “test” table listed earlier, because it arbitrarily matches the S and H lines on a first-match basis, but the net result should be the same.

If it is possible for a single H to reverse more than one S, or a combination of H lines reverse a combination of S lines, then the solution would unfortunately become a great deal more complex. (I have done something like that before but I had to use some java to simplify the process, and it took me over a week to write! I hope therefore that this “simplified” solution helps.

system · May 19, 2023, 3:25pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.