String Replace: Are RegEx Flags supported?

Hi,

when being tasked with extracting a sub-string from a cell that contains line breaks, it seems that the multiline flag is not working.

Considering that case sensitivity is managed separately instead if (?i) I believe the Regex Flags are ignored all together. Or not?

https://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#MULTILINE

Here is the test workflow:

Best
Mike

Hi @mwiegand ,
Thanks for sharing the workflow. I will look at the workflow and ask internally about this.

Thanks,
Sanket

Hi @mwiegand,

thanks for explaining the problem and sending an example workflow.
The issue here seems if a flag like (?m) is used, it only applies to the part of the pattern that comes after it. So, while <pattern>(?m) does not have any effect, (?m)<pattern> will enable multiline mode for your pattern. I agree that the Java Docs could be a bit more specific here :smile: .

Furthermore, I do think that you want to enable dotall mode instead of multiline mode. Dotall makes the dot character . also match newlines, while Multiline matches the caret ^ and dollar sign $ at the start/end of a line instead of the whole string.

Let me know if that explains and solves your issue or if you have further questions :slight_smile:

Best wishes,
Jasper

1 Like

Hi @jeeesper,

prefixing the RegEx with the flag as suggested doesn’t fix it (?m)<pattern>

About dotall vs. multiline, I see a benefit being able to match until the end of a line i.e. so bluntly enabling dotall mode might not be beneficial. Though, the flags I’d better integrate as visual options like the case sensitivity to prevent “RegEx flag collision” but also increase visibility of these features to the not so RegEx enthusiasts :wink:

Best
Mike

Hi @mwiegand,

can you maybe give an example of what you want to extract from your multi-line string cell? The regular expression you use (with only multiline enabled) will only match a two-line string that is separated by a single newline character. The input string has three lines, and also windows uses \r\n for new lines, so the single \s won’t match the newline and, because dotall mode is not enabled, neither will the dot .. And since the pattern does not match the whole string (see “Replacement strategy” setting), no replacement will take place.

Best,
Jasper

Hi @jeeesper,

when processing text based files such s PDF, Word but also Excel, CSV, you name it, data in cells frequently contain multiple lines. Splitting cells by a line break to be able to extract the data is not desired as it would scramble the data.

I do have developed techniques to cope with these situations but it’s quite a chore. Here is one example where I recently faced the issue when trying to help someone in the forum.

Best
Mike

Hi @mwiegand,

please have a look at this configuration and the associated output (I changed the input as well) – With the dotall mode enabled, but not multiline:

Like this, it should be possible to process multi-line strings just like single-line strings. The implementation of the String Replacer (and neither that of other nodes like e.g. the String Splitter (Regex)) does not distinguish between single- or multi-line strings.

Regarding your concern of “blindly enabling” a mode: You can only enable a mode for part of the pattern by either

  • Creating a scope with modified flags: ...(?s:<pattern>)...
  • Disabling the flag after the part of the pattern where you want to match across lines: ...(?s)<pattern>(?-s)...

In encourage you to play around with an example on regex101.com.

Regarding a more user-facing interface: I will relay that idea internally. We want to, however, not overload the node dialog with too many configuration options.

I hope I could clear things up a bit. If not, please do not hesitate to reach out again :slight_smile:

Best,
Jasper

2 Likes