Duplicates removal

Hello,

I have a topic that unfortunately was closed, so I cannot add more comments now.

The link: Duplicates removal

The workflow that was offered by @goodvirus perfectly worked, however when i tried to implement it to more sophisticated files, unfortunately, it didn’t.

The new problem I am facing is:
If there are more same rows, then the duplicates are being identified only based on one row.
Hence, all negative transactions are being removed but all positive ones are still there.

If someone could share with me their ideas on how to remove those, would be great. The new example file is attached

You can follow the link for reference to the initial topic

@goodvirus could you please also have a look?

Thank you very much!!! :pray:

Knime example1 .xlsx (9.2 KB)

I cannot look at the example right now. But you could try and use a window function (creating a rank per group and sort of the rest) to identify the right row you want to keep

Hi @sherknime,

one question. Can the positive values only be removed from a negativ value of the same amount?
If not and the real questions is how much Jack owes, you could just do a basic group by with sums:
dub_remove.knwf (13.3 KB)

Because in real life scenarios you would only care about the absolut amount someone owes and it doesn’t matter in how many transactions he pay´s it back.

If this is not the case we would have to modify the previous workflow, but I´m little short on time so I would have to revisit this topic on the end of the week.

Best regards,

Paul

1 Like

Hello @goodvirus
Each negative transaction corresponds to a positive one. And positive values only can be removed from a negative value of the same amount.

Unfortunately, finding total debt doesn’t work for the case. only transactions that were paid back can be remeoved.

My problem is, when i have multiple same positive transactions, second, third, fourth etc transactions are being identified as a duplicate of a 1st one. And eventually they are not being removed.

I’ll attach one more example file for the reference
Knime example11 .xlsx (9.8 KB)

@sherknime I created an example that would first match all pairs of debt and payment that share the same Name, Country, City and absolute amount of debt and then remove the remaining duplicates. Maybe you can have a look:

3 Likes

Hi @mlauber71

I have tested the workflow, and by manual execution, I have received different results than the workflow gives. 11 rows at the end versus 8 in the workflow…

But thank you for the attempt

@sherknime would you care to elaborate where you see discrepancies? Maybe you could provide a more complete example with additional explanations.

@mlauber71 I just used the file that you were working with. I tried manually to remove all transactions that need to be removed and compared the results with knime’s result. You can simply try to remove them manually, just to check the result.
Unfortunatelly, i cannot share the real file I’m working with, as it contains
company’s data :grimacing:
But i will create a new file, which will correspond to it as much as it possibly can

Thank you for helping me out :pray:

1 Like

@sherknime happy to help I think the important part will be to define what constitutes a pair of transactions to be excluded also taking into account things like matching volumes.

@mlauber71 the conditions to define what constitutes a pair of transactions are pretty simple.

  1. Pair is when “name”, “country”, “city” are same and “value” is e.g. 10 and -10
  2. Transaction description should make sense. For example Debt for rent and Debt for rent paid. However, this cannot be used in Knime as descriptions are not standardized. The wording for a paid-back transaction can be different but still make sense for a human who is reading.
  3. Technich of finding a sum of all positive transactions by a particular person and deducting of all negative one, cannot be used. Only pair needs to be removed

The workflow that was offered in the initial topic (it is linked to the 1st message in this discussion) is close to being working fine. The only issue I noticed there is:
When there are more same positive transactions, the first transaction is classified as “chosen”, but the second, third, etc… positive transactions are classified as “duplicate”
Thus, all negative transactions which match positive ones are removed, but not all positive pairs are not removed either.

The node “duplicate row filter” is not applicable in this case

The new file is attached. Hope it is more clear now.
Knime example12 .xlsx (11.3 KB)

@sherknime OK here is another try. This time there is a loop that takes every debt (positive value) and would check for a corresponding negative value at the same or a later date based on Name, City, Country and Value. This entry is then removed from the ‘pool’ (this loop and table being saved is not pretty and might be done with a recursive loop but I still like it better this way).

So this would be the result where there is a ID attached to every debt that has been paid so one can check if this does make sense. If you want further matching criteria you might have to go for some text (fuzzy) matching.

2 Likes

@sherknime I’ve tried to solve the problem here duplicates removal – KNIME Hub , working on your last example. Hope it helps

2 Likes

Another simple way to solve problems like this is to use the column expressions node to test for your criteria, then use the conditional splitter to separate the target columns. After isolate the target rows everything becomes easier. You can easily do something simple at that point like a sort, lag column and another conditional splitter to remove your conditional duplicates, then concatenate to combine the target rows back with the main group.

I also use tend to use the reference row filter with this approach in larger workflows with multiple processing routes. I use it a lot on forensic accounting projects to test, group and route journal entries.

It is basically the rule based row splitter approach above, but performing the formula tests via the column expressions node opens up a ton more criteria options.

1 Like

Thank you @duristef, I will try this out

1 Like

Thank you @mlauber71. Will check it now

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.