REMOVE DUPLICATE AND GET CORRECT DATA IN KNIME

trafalgarlaw · November 19, 2024, 9:22am

Hello -

May I ask how can I get the correct line for the 3 cases below with duplicate IDENTIFIER?

In case 1, I need to retain the first line because in transaction date, the other two was posted or reversed last October and unique one is september.

In case 2, different to case 1 wherein there is no same month but the unique one should always be the latest month posted hence need to retain last line.

In case 3, the transaction date has the same month which is July however the one should be retained is the last line because it has the unique code (column code) which is 12345, both lines have 54321 code.

Hope you can help me figure out how to get this and ensure all of these 3 cases will be considered in every run of my workflow. Thank you!

Case 1
IDENTIFIER	Type	Code	Company Name	Transaction Date
ABC144896	Posted	12345	ABC	09/30/2024	RETAIN
ABC144896	Reversal	12345	ABC	10/13/2024	REMOVE
ABC144896	Posted	12345	ABC	10/24/2024	REMOVE



Case 2
IDENTIFIER	Type	Code	Company Name	Transaction Date
EFG144896	Posted	67890	EFG	08/31/2024	REMOVE
EFG144896	Reversal	67890	EFG	09/30/2024	REMOVE
EFG144896	Posted	67890	EFG	10/31/2024	RETAIN



Case 3
IDENTIFIER	Type	Code	Company Name	Transaction Date
HIJ320597	Posted	54321	HIJ	07/30/2024	REMOVE
HIJ320597	Reversal	54321	HIJ	09/25/2024	REMOVE
HIJ320597	Posted	12345	HIJ	07/30/2024	RETAIN

@takbb

takbb · November 19, 2024, 9:44am

Hi @trafalgarlaw, l will see if I can find some time to think about this later, but I have some questions which others may be wondering too…

In the rows in case 1, you are saying that the reversal cancels the third row because it is in the same month even though the reversal occurs before the transaction you say should be reversed?
Which row would have been kept in the case 1 example of the date of the first row had been October 1st?
Will you ever have more than 3 rows for a given identifier?
Will there every be more than one reversal for an identifier?

trafalgarlaw · November 19, 2024, 12:25pm

@takbb - thanks for responding…

Case 1
IDENTIFIER	Type	Code	Company Name	Transaction Date
ABC144896	D	12345	ABC	09/30/2024	RETAIN
ABC144896	R	12345	ABC	10/13/2024	REMOVE
ABC144896	D	12345	ABC	10/24/2024	REMOVE



Case 2
IDENTIFIER	Type	Code	Company Name	Transaction Date
EFG144896	D	67890	EFG	08/31/2024	REMOVE
EFG144896	R	67890	EFG	09/30/2024	REMOVE
EFG144896	D	67890	EFG	10/31/2024	RETAIN



Case 3
IDENTIFIER	Type	Code	Company Name	Transaction Date
HIJ320597	D	54321	HIJ	07/30/2024	REMOVE
HIJ320597	R	54321	HIJ	09/25/2024	REMOVE
HIJ320597	D	12345	HIJ	07/30/2024	RETAIN


Case 4
IDENTIFIER	Type	Code	Company Name	Transaction Date
ABC144896	D	12345	ABC	10/31/2024	RETAIN
ABC144896	D	12345	ABC	10/01/2024	REMOVE
ABC144896	R	12345	ABC	10/24/2024	REMOVE



Case 5
IDENTIFIER	Type	Code	Company Name	Transaction Date
EFG144896	D	67890	EFG	08/31/2024	REMOVE
EFG144896	R	67890	EFG	09/30/2024	REMOVE
EFG144896	D	67890	EFG	10/15/2024	REMOVE
EFG144896	R	67890	EFG	10/20/2024	REMOVE
EFG144896	D	67890	EFG	10/31/2024	RETAIN

In the rows in case 1, you are saying that the reversal cancels the third row because it is in the same month even though the reversal occurs before the transaction you say should be reversed?

Answer - I have changed the Type to avoid confusion. See above updated data. So for Case 1, duplicate lines with the same month, should be removed and the one with different month should be retained.

Which row would have been kept in the case 1 example of the date of the first row had been October 1st?

Answer - I have added Case 4 in which if same month for all, the latest date will be retained.

Will you ever have more than 3 rows for a given identifier?

Answer - Yes, thanks for your question, I have added Case 5. But still it should get the latest date.

Will there every be more than one reversal for an identifier?

Answer - Yes there will be though I am planning to do a filtering of R after getting the results for all cases so if the retained lines contain a type R, I will remove or filter it and get only with type D.

Hope I answered all your questions. Let me know if I have missed anything. Thank you!

mlauber71 · November 19, 2024, 1:01pm

@trafalgarlaw you can use H2 local database and define rules which item per group will remain by sorting within this group. If you have string variables you could assign numbers that would determine the rank of the item.

Another option would be to create group ranks and then only keep the first one per this group:

trafalgarlaw · November 19, 2024, 2:01pm

hello @mlauber71, in case 3, will it also be possible to use rank?

mlauber71 · November 19, 2024, 2:09pm

@trafalgarlaw What you can do is make a Group By your IDENTIFIER and then count the number of Codes per this group and then join that back to your data and then use the number of these codes per IDENTIFIER as an additional sorting item

trafalgarlaw · November 19, 2024, 2:28pm

Thank you @mlauber71!

Will wait for @takbb then check again your proposal. Thank you!

mlauber71 · November 19, 2024, 2:32pm

@trafalgarlaw BTW your examples do have some duplicates in the IDs - is that deliberate? Maybe you can clear them up and provide us with a consistent set of data?

trafalgarlaw · November 19, 2024, 2:41pm

@mlauber71 - Actually, these are the cases I can’t figure out. Hence, putting all cases here because in usual data, possible only one here will be in the report. Let me know what exact or consistent data are you referring to so I can provide as well. Thank you!

Case 1
IDENTIFIER	Type	Code	Company Name	Transaction Date
ABC144896	D	12345	ABC	09/30/2024	RETAIN
ABC144896	R	12345	ABC	10/13/2024	REMOVE
ABC144896	D	12345	ABC	10/24/2024	REMOVE
Case 2
IDENTIFIER	Type	Code	Company Name	Transaction Date
EFG144896	D	67890	EFG	08/31/2024	REMOVE
EFG144896	R	67890	EFG	09/30/2024	REMOVE
EFG144896	D	67890	EFG	10/31/2024	RETAIN
Case 3
IDENTIFIER	Type	Code	Company Name	Transaction Date
HIJ320597	D	54321	HIJ	07/30/2024	REMOVE
HIJ320597	R	54321	HIJ	09/25/2024	REMOVE
HIJ320597	D	12345	HIJ	07/30/2024	RETAIN
Case 4
IDENTIFIER	Type	Code	Company Name	Transaction Date
ABC144896	D	12345	ABC	10/31/2024	RETAIN
ABC144896	D	12345	ABC	10/01/2024	REMOVE
ABC144896	R	12345	ABC	10/24/2024	REMOVE
Case 5
IDENTIFIER	Type	Code	Company Name	Transaction Date
EFG144896	D	67890	EFG	08/31/2024	REMOVE
EFG144896	R	67890	EFG	09/30/2024	REMOVE
EFG144896	D	67890	EFG	10/15/2024	REMOVE
EFG144896	R	67890	EFG	10/20/2024	REMOVE
EFG144896	D	67890	EFG	10/31/2024	RETAIN

mlauber71 · November 19, 2024, 2:52pm

@trafalgarlaw maybe ask in another way:

what constitutes a group where you would like to retain only one line
what is the order of rules to consider when ranking them

ABC144896 and EFG144896 are there two times. What is it that constitutes a group where you want to retain one line. In the examples you are retaining two lines. What would be the rule for that.

Rules can be complicated: ID with the least amount of unique codes in one month or so. But the rules will have to be precise.

trafalgarlaw · November 19, 2024, 3:26pm

That’s a great catch, @mlauber71. Sorry I just put the Case 4 and Case 5 with same IDs in Case 1 and Case 2. But it should never happened to have two different cases with same IDs. It will always be unique.

Here’s the updated case data to avoid confusion:

Case 1
IDENTIFIER	Type	Code	Company Name	Transaction Date
ABC144896	D	12345	ABC	09/30/2024	RETAIN
ABC144896	R	12345	ABC	10/13/2024	REMOVE
ABC144896	D	12345	ABC	10/24/2024	REMOVE



Case 2
IDENTIFIER	Type	Code	Company Name	Transaction Date
EFG144896	D	67890	EFG	08/31/2024	REMOVE
EFG144896	R	67890	EFG	09/30/2024	REMOVE
EFG144896	D	67890	EFG	10/31/2024	RETAIN



Case 3
IDENTIFIER	Type	Code	Company Name	Transaction Date
HIJ320597	D	54321	HIJ	07/30/2024	REMOVE
HIJ320597	R	54321	HIJ	09/25/2024	REMOVE
HIJ320597	D	12345	HIJ	07/30/2024	RETAIN


Case 4
IDENTIFIER	Type	Code	Company Name	Transaction Date
KLM144896	D	12345	ABC	10/31/2024	RETAIN
KLM144896	D	12345	ABC	10/01/2024	REMOVE
KLM144896	R	12345	ABC	10/24/2024	REMOVE



Case 5
IDENTIFIER	Type	Code	Company Name	Transaction Date
NOP144896	D	67890	EFG	08/31/2024	REMOVE
NOP144896	R	67890	EFG	09/30/2024	REMOVE
NOP144896	D	67890	EFG	10/15/2024	REMOVE
NOP144896	R	67890	EFG	10/20/2024	REMOVE
NOP144896	D	67890	EFG	10/31/2024	RETAIN

mlauber71 · November 19, 2024, 4:44pm

@trafalgarlaw I still do not fully get the logic but maybe you can play with this sample workflow:

trafalgarlaw · November 25, 2024, 1:52pm

Hello @takbb, hope all is good… hope you get a chance to look how to solve this Really appreciate your help. Many thanks in advance!

takbb · November 25, 2024, 4:23pm

Hi @trafalgarlaw , I’ve been watching the thread but not had much time to devote to it. I’m trying to determine if the different cases that you have fit any simple patterns, but at the same time consider whether patterns would still hold if you had additional rows in any of the scenarios.

At the momentI’m tending to think that a Group Loop (grouped on IDENTIFIER) with an inner Recursive Loop to attempt to reduce down the group to a single row, based on the application of a set of rules might be the way forward.

I’d favour the type of approach in the direction I think @mlauber71 is headed, involving ranking and sorting to determine the “winner” in each case, but I haven’t got there with this one yet, as the different rows in each group need to be played off against each other.

The basic “rules” appear to be:

If there is a D and an R of the same Identifier and Code in the same month, then the earliest D and R for that month for that Identifier and Code are both removed
If there is a D and an R with the same Identifier and Code in different months then the earliest D and R for that Identifier and Code are both removed.

I believe that these two rules if recursively executed in that order until no further rows can be removed will resolve all of the cases you have listed. I’ve yet to see if that’s right though… so still a work in progress

takbb · November 25, 2024, 5:14pm

@trafalgarlaw, based on the above, please see this workflow:

I think it resolves all of the cases in your sample data.

(disclosure: I borrowed (stole) the Excel data from @mlauber71 's workflow… with thanks! )

trafalgarlaw · November 27, 2024, 8:16am

Thank you @takbb. I am still in 4.7.8 KNIME version. Would you mind sharing the configuration of your Row Splitter and other two Row Filter? Thank you so much!!

trafalgarlaw · November 27, 2024, 10:27am

Here also @takbb, just want to know the config of row splitter and row filter. I have row splitter and row filter in my node repository it is just that the one you use is the updated one.

I think just a screenshot of the configuration will do.

But once I come to know your config, it will run perfectly. Thank you in advance!

takbb · November 27, 2024, 10:46am

Hi, yes it’s using the updated one because it was written in using KNIME 5.3.3. It would be nice to be able to choose to use an older version but sadly that option isn’t available.

The comments on the workflow for the filters and splitters should be reasonably self-explanatory, but here is the detail:

Row Splitter :
Top output port condition is: Type = D

Both Row Filters:
Output is Row Number = 1 (first row)

trafalgarlaw · November 27, 2024, 11:53am

Thank you @takbb! I was able to do the editing or updating in Row Splitter however the sorter is kinda asking of the variable. Would you know what should I put in the flow variables inside joiner?

Also, I noticed that Joiner cannot be opened.

takbb · November 27, 2024, 12:07pm

I think the joiner config can’t be opened, simply because the flow isn’t valid prior to the joiner. With the Row Splitter fixed, and executed, the joiner config should open. The joiners in this flow aren’t using flow variables.

The joiner annotated “Join R to D on same code and month” is configured to join on these columns (images from Modern UI node config):

The second joiner annotated “Join R to D on same code IGNORE MONTH” is configured to join on just the code column:

I don’t know why your Sorter is complaining. I suspect it just needs resetting. Open the configuration and ensure it is set to Ascending Transaction Date.
That is all it does.