Row Filtering

Hi

I am trying to remove some rows from an invoice data set. An example of the data is shown in the table below where I have identified duplicates based on the supplier mane, PO Number and Abs Invoice Amount:

And this is the output I desire:

I am struggling with the rule needed for a rule-based row filter. If I use (Duplicate Classifer LIKE “chosen” AND Invoice Amount < 0) then I can remove the row 1. If I then use (Duplicate Classifer LIKE “duplicate” AND Invoice Amount > 0) to remove row 2, it will also remove row 4 which is not what I want.

Also, just to add, I can’t group by PO Number and filter on PO Numbers with a zero sum as some PO Numbers have additional amounts associated with them.

Does anyone have any ideas on how I can solve this?

Thanks
Chris

Hi @fostc80857

You have to be careful with the LIKE function in a rule-based row filter. As per the description, it depends on a wildcard pattern. Something which you don’t have at the moment :wink:

image

2 Likes

Hello @fostc80857
You can do a ‘Group by’ $Supplier Name$ and $PO Number$ with Sum($Invoice Amount$) as ‘Manual aggregation’; and exclude zeroes from the result. An Inner Join from original data vs group filtered should give you the expected output.

BR

1 Like

Hi @fostc80857

I’m not sure why you omitted Invoice Amount while identifing dupicates. If you change this setting, Duplicate Classifier column will change too and you will be closer to the solution. GroupBy node (as mentioned by @gonhaddock) may be usefule then.

Happy KNIMEing,
Kaz

2 Likes

Hi @Kazimierz

Having read yours and @gonhaddock reply, I am not sure that starting the process by identifying duplicates will give me the results I need. I am going to adopt @gonhaddock approach of using groupby and then excluding zeros to see if that gives me the result I need.

@fostc80857 identifying duplicates could limit the number of rows to work with, but should be followed by other actions and groupinng is one of them.
Good luck!

1 Like

HI @fostc80857 , what is the actual rule that you are wanting to use for removing rows, because I don’t think you’ve stated it. I get that you want to remove duplicates, but you seem to have a further rule (nothing to do with duplicates). Whilst you’ve told us the rows you want to retain, you’ve left it to us to make assumptions about the rule you are applying.

The way I read it, your rule is:

If Supplier Name, PO Number and Abs Invoice Amount are duplicated then…
… if a duplicated row contains a negative amount, remove that
… otherwise, arbitrarily delete the duplicate row

Is that it?

But… which would you delete if one of the “1234” rows had an amount of 5000.00 and the other was 10000.00, whilst the Abs Invoice Amount was still 10000.00 ?

And which would you delete if one of the “1234” rows had an amount of 5000.00 and the other was 4000.00, whilst the Abs Invoice Amount was still 10000.00 ?

I think you would need to define the specific basis on which you wish to remove rows rather than treat this as a purely “duplicate removal” problem, since there is clearly more to it than that.

It may be that you want to use the additional “Row Selection” config in the “advanced” settings of the Duplicate Row Filter such as telling it to select the duplicate row that has “maximum of Invoice Amount”.

Or it might be you need to perform it in several steps, e.g. mark the duplicates, delete rows on a specific basis, then mark duplicates again to see if there are any remaining, and delete remaining duplicates on a different basis.

2 Likes

Hi @takbb

To prepare my dataset before applying any analysis, what I need to do is remove rows that effectively cancel to zero. My initial thoughts were that I would use the “abs” function in the math formula node to create a column of only positive invoice amounts.

This would then allow me to identify duplicate rows based on Supplier Name, PO Number and Abs Invoice Amount. My next step was to removed any duplicate rows with a negative Invoice Amount using the following rule:

(Duplicate Classifer LIKE “chosen” AND Invoice Amount < 0)

Whilst this would remove the row with the Negative Invoice amount, I still had the problem of removing the associated row with the positive Invoice Amount, hence the request for help on the forum.

With regards your two examples above (with the differing amounts for one of the “1243” rows I don’t think these cases would ever arise as the ‘Abs Invoice Amount’ column is a calculated column which simply takes the absolute value of the Invoice Amount column.

I think your last suggestion about performing several steps of identifying duplicates and then deleting rows on specific basis might work.

In fact, as I write this reply, I am thinking of the following:

  • marking duplicates then removing the row with the negative Invoice Amount.
  • marking duplicates again then removing rows where the first duplicate mark was either chosen or duplicate and the second duplicate mark was unique.
2 Likes