Regex works in RegexReplace (String manipulation) not Replacer

Vexatious_Outlier · February 7, 2022, 11:27pm

I’ve got this regex which works fine in the String Manipulation, but I want to use it as part of a document preprocessing flow and therefore need to use Replacer.

This is the regex:

(?<!\w{2})\s(?!\w{2})

I use it with an empty replacement string to turn: “A B C Taxi Co L T D” into “ABC Taxi Co LTD”

The source of the text is an XPATH query and I have had to replace the space characters because some of them were u+2002 which caused me some headaches. I have explicitly replaced them in the flow and the string manipulaition node works fine, so I’m guessing it’s not related.

I’m using KNIME 4.5.0.

Any ideas how to get the Replacer node working or should I just use the String Manipulation node and convert the output back into a document?

Thyme · February 8, 2022, 8:24am

Hmm, both the String Manipulation and the String Replacer work for me on a string column. Can you share more details about your problem? How does your input look like and what’s the desired output (especially the data type)?
RegEx manipulation String Replacer.knwf (15.1 KB)

Vexatious_Outlier · February 8, 2022, 10:15am

Hi Thyme,

This is my workflow. I’ll update KNIME and have a look at your workflow.

Fuzzy Match.knwf (41.8 KB)

As you can see, I’m wanting to continue with NLP, so I need the output to stay as Documents.

Thanks,

Rob

Thyme · February 8, 2022, 12:31pm

I don’t have the Textprocessing or the Python Extension installed, but I reconstructed your workflow up until the Duplicate Row Filter, using http://www.sports-clubs.net/Sport/Clubs.aspx?Name=A in the Webpage Retriever.
Maybe I’m not seeing the full picture that way, but shouldn’t you be able to use the String Manipulation directly after the Duplicate Row Filter? Case conversion can be done with the upperCase/lowerCase functions in the String Manipulation node as well.

Vexatious_Outlier · February 8, 2022, 2:44pm

That’s effectively what I’ve done, but I dropped in a Python node to handle all the cleaning rather than String Manipulation to avoid a torturous expression or multiple nodes. I was getting my regexes only partially working with the Replacer node.

From reading around I think the problem is due to Java having a narrow view of what a whitespace character is compared to other languages. However, that doesn’t explain why the difference between the String Manipulation Node and Replacer node, unless the String Manipulation Node calls on a “fixed” Java library for its regexing.

Thyme · February 8, 2022, 3:00pm

Yeah, I like to use Java Snippets whenever I’m unhappy with the amount of nodes I’d have to use

The only difference I can see is that the String Manipulation node needs a double backslash to escape characters. This is probably because the needle is first passed to the JVM and then used in the expression, “using” one "\" each time.

Does that mean your problem is fixed? Not sure whether or not you need more help.

Vexatious_Outlier · February 8, 2022, 3:59pm

One of the things I like about KNIME is it’s so easy to jump into a language if you hit a wall or want to do something unusual.

I’m fine and don’t need any more help. I was raising this more to flag the issue than becasue I was blocked. From an aesthetic view, I like to keep things as KNIME as possible though, so it would have been nice to be able to use the Text Processing nodes. I’d suggest passing it to whoever looks after those nodes as a possible bug or item for improvement since it’s easily replicable with the workflow and it does work fine with the String Manipulation nade. I’ve tried it with a single slash and using an actual space rather than the escaped character and it still doesn’t work for me with Replacer.

Thanks for looking at the issue for me.

system · February 15, 2022, 4:00pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.