De-duplicating words/substrings in a string

Hi everyone,

Is there a similar function/metanode/process/workflow as "removeDuplicates()" (in String Manipulation), but instead of just replacing occurrences of two or more spaces with single space, the similar function would find and remove duplicate words and/or substring within a string.

In finding for duplicate substrings, maybe we need to specify a parameter such as subtring length.

Has someone made a similar workflow?n Thank you.

Hi Marcellus,

For removing duplicate words, you can use RegEx in a String manipulation node. Just use the following code, you only need to adapt it to your column name:

regexReplace($column_name$, "(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+", "$1")

As for removing whole substrings, you can try to build on this solution: http://stackoverflow.com/questions/38683612/java-regex-to-remove-duplicate-substrings-from-string

 

Cheers,

Roland

This regexReplace code does remove duplicates but only when they are positioned consecutively in the string. I was hoping for a solution that would also work for non-consecutive duplicates.

Nevertheless, it certainly removes some of my problems. Thank you very much Roland. :)

Another approach would be the following sequence:

  1. String to document
  2. Bag of Words (BoW) creator
  3. Terms to string
  4. Concatenate

It's likely to remove the original word order, though -- didn't check.

Cheers
E

1 Like

Hi,

I’m having the same issue. I’ve tried this recommended code:
regexReplace($column_name$, “(?i)\b([a-z]+)\b(?:\s+\1\b)+”, “$1”)
but it did not work.

I have a string like: Cat,Dog &Cat,Dog &Cat,Dog &Cat,Dog,Cat, Cat, Mouse,Dog,Dog,Bird,Bird
I would expect: Cat,Dog &Cat,Dog,Cat,Mouse,Dog,Bird
But after I run the code it still shows: Cat,Dog &Cat,Dog &Cat,Dog &Cat,Dog,Cat, Cat, Mouse,Dog,Dog,Bird,Bird

Can anyone help me with this?

Thanks,
Winanda

EDIT: For following this question see here: de-deplucating words in a string

1 Like