Remove all backslash-starting substrings

Hi, I’m trying all six regex or nonregex methods I know of from the forum. I am trying to remove all the substrings that start with a backslash. The simple ones like \n is of course possible but for the longer ones, let alone being written next to one another without spaces, are difficult. Please have a look:

CleanUp.knwf (135.3 KB)

Many thanks!

1 Like

Other than the methods above, I also came up with the thought to enlist them, then to remove them using rules or cell replacers - or something to that effect. Considering most of them are of 6 characters including the backslash, I tried:

substr($Text$,indexOfChars($Text$,“\”),6)

which gave me:

which, as you can see, only worked for the first occurrence. It also has an undesired effect, whereby for cells that don’t have any, it will return the first few characters!

Anyone can share tips to improve this concept further? Or a totally different workaround altogether?

1 Like

Hi @badger101

Would this be close to your desired output?

1 Like

Yes that looks alright @ArjenEX . Is it a fully automated workaround?

pb

I tried to take the Strings to Document route today, where I created a Bag of Words to then follow up with Cell Replacer. But going even halfway of this route takes that much memory (orange in color). :rofl: And this is for a new session (which doesn’t include the memory usage from downstream & upstream tasks).

(My dataset contains 110k rows of text.)

Yes @badger101 , it’s actually an one-node solution.

removeDuplicates(regexReplace(column("Column0"),"(\\\\\\w\\\\\\w)|(\\\\[A-Za-z0-9]{5}\\b)"," "))

As the usage of \n\n appears to be consistently, it should work out but a sample of 10 rows is only so much. Maybe some other cases are uncaptured but should be manageable with additional OR statements.

That sounds great! I’m trying it out now, will update here. Meanwhile, can you explain the multiple backslashes in the script? I notice that there is a \b in there too, which although I never used it, but based on what I’ve read on the forum on other people’s posts, it’s for boundary right? What’s the difference between putting just one \b (like you did here) and putting two of them (like the ones I’ve read on the other forum posts)?

Update: I’m getting this error when applying the script:

ERROR at line 65
The method column(java.lang.String) is undefined for the type Expression26
Line : 64 public java.lang.String internalEvaluate() throws Abort {
Line : 65 return removeDuplicates(regexReplace(column(“Column0”),“(\\\w\\\w)|(\\[A-Za-z0-9]{5}\b)”," "));

In KNIME, you have to escape certain special chars of which the slash is one of them. Hence all the slashes to escape \w, and \

See the Regex in action here:

To get the KNIME equivalent of this Regex, click on Code Generation, select Java and copy the final String regex from the code block and paste it in whichever node you’re using.

Upon looking a second time, the boundary can actually be omitted. It was in there as a safety before but the quantifier already controls the length.

For that error, not sure what that could be. I would re-construct it from scratch with selecting the proper column.

It’s definitely running.
CleanUp - Regex.knwf (142.5 KB)

@ArjenEX Thanks for this attempt. It seems that the script works in Column Expressions but not in String Manipulation, but that’s fine with me. It’s just that it’s weird why that happens.

Anyway, I would like to adopt this solution since I’ve tested it on my real dataset and it seems to be working. I like that it’s a one-node solution and it’s a straightforward automatic task.

But I just have to optimize one more thing. Can you modify it a lil bit such that it also detects the unicode substrings that were attached to a non-unicode? In the dummy workflow, I guess this example wasn’t included, hence your script did not detect this. Here’s an example:

And here’s the new dummy dataset:
CleanUp.knwf (9.5 KB)

Other than starting with alphanumeric characters, the non-unicode word may also start with these 3 characters: ( @ #
(the examples for @ and # are not included in this new dataset as I’m having difficulties to locate the examples due to large volume of text.)

Hi @badger101

This should get you pretty far:

removeDuplicates(regexReplace(column("column1"),"(\\\\\\w\\\\\\w)|(\\\\[u][A-Za-z0-9A@#\\\\(]{4})|([\\\\][n])"," " ))

See it in action here:

Applied it to both datasets so far and some loose codes with (, @ #

2 Likes

Thanks a bunch! @ArjenEX Works like a charm, really appreciate it! :ok_hand:

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.