Duplicate Row Filter - more parameters

Hello,
I use Duplicate Row Filter. I include Name + Number to,keep only 1 row when it is the same person. I would like also to keep th row including the most information.
Example :
DRF.xlsx (9.2 KB)

After DRF :
VETY.xlsx (4.1 KB)
For example for SEURIN i want to keep the row including split2 “Hopital de Saint Laurent du Maroni”.
Is there something we can do to choose the best row ?
Thanks

Hello @Brain,

Duplicate Row Filter doesn’t have such possibility and thinking about it seems hard to implement it as how one defines “most/best information” is subjective. However this is what you can try: use group loop based on same columns used in Duplicate Row Filter node and implement this function on your own inside that loop. For example if “most information” means biggest length of a particular string column then use length() function from String Manipulation node and with Duplicate Row Filter leave only desired row.

Hope this helps.

Br,
Ivan

1 Like

Hi @Brain , taking @ipazin 's suggestions a step further, the “advanced settings” of the Duplicate Row Filter does have the ability to select not just “first” or “last” row, but can select based on “maximum” or “minimum” of a chosen column.

So as already mentioned, you need to derive an algorithm that defines “most/best” such as using the length() function.

You could for example use Column Aggregator prior to the row filter to create a concatenated string of all columns and then take the length of that, using String Manipulation, or maybe you could alternatively use it, to count the populated columns, or use Rule Engine to define a “score” based on which columns are not missing.

Once you have a column containing a score for each row, then the Duplicate Row Filter can do the rest, by retaining the row with the highest “score”

I’ve put together a demo here:
[Edit: I’ve updated this as I realised my original was over-complicated, and could be simplified so it is less confusing.]

Duplicate Row Filter options.knwf (27.9 KB)

2 Likes

Yep, that was the idea @takbb only now I see that group loop is not necessary for most/best function calculation :sweat_smile:
Ivan

3 Likes

I haven’t written a new component for a few days, so now even simpler :wink:

image
Duplicate Row Filter options 2.knwf (40.2 KB)

If anybody has any suggestions on any other generic scoring algorithms, (I’ve now also added “Average data length of populated columns”) then please let me know and I’ll consider adding them.

4 Likes

Absolutely perfect and brillant.
Easy to use.
Many Thanks :

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.