Finding duplicate values in columns per row

cason · January 20, 2019, 3:17pm

Hi,

I have a data set in KNIME that looks like this (columns and rows):

ID - Prod1 - Prod2 - Prod3 - Prod4 - Prod5

1 - 23 - 24 - 25 - 26 - 27
2 - 23 - 24 - 25 - 26 - 26

As you can see there is a duplicate on the second row (26 and 26) but the first row only contain unique values. I would now like to classify each row based on if all the column values for each row are unique or not, with this ideal output:

ID - Prod1 - Prod2 - Prod3 - Prod4 - Prod5 - UniqueFlag

1 - 23 - 24 - 25 - 26 - 27 - Unique
2 - 23 - 24 - 25 - 26 - 26 - Not unique

What would be the best way to set this up?

Thanks for your help.

izaychik63 · January 20, 2019, 4:52pm

Could you please explain better, if every product number in a separate columns or your data presented of a strings where fields separated by dashes? Ideally you can provide a file with couple of lines of a real data.

armingrudd · January 20, 2019, 8:00pm

Hi,

This is what I can suggest:
unique

Use a “Column Aggregator” node and choose all the columns that you want to check and in the option tab choose “Set” as the aggregation method.
Then a “Column Expressions” node and use this expression:

if (column("Set").length < 5){
"Not Unique"
}
else "Unique"

I used 5 because in your example you have 5 prod columns but you can change the number.

Output:

Best,
Armin

solution

cason · January 26, 2019, 10:57am

Hi @armingrudd,

Thanks for the advice on this and your solution worked perfectly. To continue in the workflow I would like to find a way to replace duplicate Pod1-5 values where the row indicates ‘Not Unique’ in the UniqueFlag, effectively trying to make all row have the flag ‘Unique’ as a final output. Conceptually, I have thought of something like this from a flow perspective:

Filter out all rows that are ‘Not Unique’
Create a reference table of some sort to list say 50 product ID that could we used as a replacement if the row is not unique
Lastly, and this is the tricky bit I think, somehow insert product id’s from the reference list where there is a duplicate value in the original table row (as indicted by the ‘UniqueFlag’).
Ideally I would like to create some sort of loop that replaces values until there are no more flagged ‘None Unique’ rows left

Any ideas on how I can implement this practically? Thanks for your continued help.

armingrudd · January 26, 2019, 11:00am

Hi,

Please create a new topic for your new question.
I think I can help you with that.

Best,
Armin

cason · January 26, 2019, 11:23am

Hi @armingrudd,

Thanks for this. I have created a new topic here:

system · February 2, 2019, 11:23am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.