Finding and replacing duplicate values in multiple columns

cason · December 24, 2018, 10:03am

Hi,

I have an interesting yet a bit difficult task ahead of finding and replacing duplicate values in several columns. My data set looks like this:

ID Col1 Col2 Col3 Col4 Col5
123 23811 28732 23811 Blank Blank
124 45362 45362 Blank Blank Blank

I would like to write a function that firstly finds duplicates on each row and replaces this with a value from a look up table (plus checks that the new insert doesn’t cause any new duplicates). For blank values it should replace these with a value from the same look-up list, again without creating any duplicates on the same row.

What’s the best way of setting this up in KNIME? Thanks in advance for your help!

armingrudd · December 25, 2018, 6:24am

Hi,

Would you please provide a sample dataset including your lookup table? And please explain it further that how these blank and duplicate values should be replaced.

Best,
Armin

cason · December 25, 2018, 11:04am

Hi @armingrudd,

Thanks for helping me with this. I have attached the KNIME workflow in question here:

Product_Recommendations_Test.knwf (92.6 KB)

And here is the input data attached:
Online_Retail_Small_Sample.xlsx (457.9 KB)

To explain a little bit further:

If there are a duplicate value on the row it should be replaced by a value from the look-up table as long as the inserted value from the look-up does not already exist on the row
If the row contains blank values it should be filled with random values from the look-up table as long as none of the inserted values are already present on the row
The operation and 1 and 2 should only be applied to the first four value columns (Split Value 1, Split Value 2, Split Value 3, Split Value 4, Split Value 5).

The desired output should be five unique values for each row consisting primarily of values from the original table but if duplicates or blanks are featured on the original list when values from the look-up should be used

Thanks again for your help with this.

armingrudd · December 25, 2018, 5:01pm

Sorry but the workflow you have shared contains one Excel reader node which is not executed.
And the xlsx file is a sales dataset. I couldn’t understand what you want to do with it.
Regarding your question, you may provide a dataset which has duplicate values in some columns of a same row (I guess) and a “look-up” table" which provides the values which can be used instead of those duplicates.