Find and remove duplicates across multiple columns

zhuma · February 28, 2023, 2:11am

I want to remove any duplicate synonyms from my dataset when compared to the name field.

I think I could split the synonyms field and then use rule engine to check if the name appears in any of the synonym split columns, but then I am not sure how to actually remove. I also only want to check across rows, i.e. any duplicates which may appear in e.g. ID 1 or ID 4 should not be removed. Any ideas? I was also trying unpivoting, but then not sure what to do next.

Thank you!

AlexanderFillbrunn · February 28, 2023, 10:35am

Hi @zhuma,
Maybe you can simply use a String Manipulation node? You could first replace any occurrence of the name in the synonym with an empty string and then clean up the separators. For only replacing, that would be: replace($SYNONYMS$, $NAME$, ""). Keep in mind, though: this will mess up potential synonyms that contain the name, like name=abc, synonym=abcdef, output=def. This can be alleviated by replacing only when the name is preceded or followed by a space:

replace(replace($SYNONYMS$, join($NAME$, " "), ""), join(" ", $NAME$), "")

This does not deal with the case where the synonym is just the name. For that you can add a Rule Engine node and check. Now to clean up, you can use the regexReplace function:

regexReplace(replace(replace($SYNONYMS$, join($NAME$, " "), ""), join(" ", $NAME$), ""), "(^ *\\$\\$ *)|( *\\$\\$ *$)|( *\\$\\$ +\\$\\$)", "")

This replaces any $$ with surrounding spaces in the beginning or end of the result, and any pattern where $$ appears twice, separated by spaces.
Attached you find a screenshot with my results using that expression.
Kind regards,
Alexander

denisfi · March 1, 2023, 4:23pm

Can you use rule node to find a match with regular expression too?

Example attached.

string filter.knwf (13.7 KB)

If yes, you can create a column with a simple expression with the name value as the example below and apply a rule node to check it for you.

The simple rule can be:

$SYNOMIMOUS$ MATCHES $Filter$ => TRUE
TRUE => FALSE

I’m saying here that IF I found a Match / exists, I’ll set TRUE, otherwise the default value will be FALSE.

With the result at a field, you can filter or split the flow for other operations as example below:

I hope helped you!

Tks,

Denis

system · May 30, 2023, 4:24pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.