KNIME duplicate row filter cannot remove all duplicate data

emshihab · September 10, 2021, 12:15pm

Hi Everyone,
I have an issue. When I am trying to remove duplicate string data using duplicate row filter node, it cannot delete all the duplicate data. After transferring the data to excel file (using Excel sheet appender), i can remove duplicate data using excel built in function “Remove Duplicates”. How can I resolve this issue?

Regards,
Ekram

bruno29a · September 10, 2021, 12:45pm

Hi @emshihab , I have used this node many times and have never encountered such issue.

Can you please share your data or sample data and how you configured the node? Or perhaps share your workflow?

And also show what the expected results for the data?

emshihab · September 10, 2021, 1:03pm

Due to data security issue, I cannot share data now. But i will create a sample data and share with you.

emshihab · September 10, 2021, 2:18pm

I think I found the issue. Excel 365 can understand “Case Sensitive” and consider as duplicate value. But knime considers both row as unique. Can you suggest anything how to resolve this issue?

elsamuel · September 10, 2021, 2:25pm

I would convert them all to the same case using a String Manipulation node and then do the duplicate row removal.

bruno29a · September 10, 2021, 2:27pm

Hi @emshihab , convert them to lower case, and then remove duplication.

If you want to keep the original data as is, you can convert to lowercase into a new column, and then apply duplication filter on that new column, and then remove that new column after the operation

emshihab · September 10, 2021, 2:47pm

thanks for the solution

Daniel_Weikert · September 10, 2021, 3:57pm

also keep in mind that the “famous” whitespace characters can be a pain as well if you forget to remove them.
br

bruno29a · September 10, 2021, 4:56pm

That’s a very good point @Daniel_Weikert .

@emshihab , you can use strip() to get rid of leading and trailing whitespaces:
lowerCase(strip($Part number$))

In parallel, you can also check for whitespaces by adding some text before and after your records:
join("XXX", $Part number$, "XXX")

As you can see, the last 2 records have a whitespace at the end.

After removing duplicates:

I put something together for you. Here’s what the workflow looks like:

Here’s the workflow: Remove duplicate different case.knwf (9.7 KB)

emshihab · September 10, 2021, 7:03pm

thank you very much for the solution.

system · September 17, 2021, 7:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.