Multiline in the Regex Split node

acommons · May 5, 2019, 6:57am

Can someone explain what the Multiline option is meant to do in the Regex Split node?

I have a regex I’m using of the form (?s)^…regex stuff…$ to parse multiline input in a column. It works fine, does exactly what I want.

If I remove the (?s) as part of the regex and select the Multiline option in the Configuration instead the regex no longer works.

If this is expected behaviour then what is the Multiline option actually supposed to do? From the documentation it looks like it is supposed to mimic (?s) but it doesn’t seem to do that!

armingrudd · May 5, 2019, 12:33pm

Hi,

In the single line mode which can be turned on by using (?s), the dot “.” character matches all characters, which normally matches any character except the new line character e.g. \n.

In the multiline mode which can be turned on by using (?m), the caret “^” and dollar “$” signs match the beginning and the end of each line instead of only matching the beginning and the end of the string.

Best,
Armin

acommons · May 5, 2019, 12:54pm

Thanks Armin!

So the Multiline checkbox is inserting the (?m) option for the regex you supply?

cheers,
Andrew

Edit:

Here is the help for Multiline -
"Enables multiline mode, i.e. when selected the expression ^ and $ match the start and end of the input string. This option only matters if the input string have line breaks. "

Which still looks like (?s) to me.

armingrudd · May 5, 2019, 1:03pm

Exactly!

acommons · May 5, 2019, 1:05pm

I’ve just edited my previous reply but will repeat here.

Multiline help says "Enables multiline mode, i.e. when selected the expression ^ and $ match the start and end of the input string. This option only matters if the input string have line breaks. " which looks more like (?s) than (?m) to me…I think the reference to ‘string’ is what is confusing me. I’m reading it to mean the whole cell contents…

armingrudd · May 5, 2019, 1:14pm

This is telling us that if there are no line breaks in the string (the string is a single line text) then the multiline option makes no sense and if enabled, it would behave like what regex normally does. So the ^ and $ will determine the beginning and the end of the string (as the string has only a single line). But if the string has line breaks (it contains multi lines of text), this option becomes useful and the beginning and the end of lines would be the correct matches of the ^ and $ elements.

Armin

acommons · May 5, 2019, 1:20pm

Ok, I will have to do more R&D on use cases where the (?m) option is required.

Thanks so much for your input again. You are always such a great help!

cheers,
Andrew

armingrudd · May 5, 2019, 1:34pm

Take a look at this workflow I just built:
regex_line_mode.knwf (25.5 KB)

I hope it would help you to understand the behavior of the modes.

Best,
Armin

acommons · May 5, 2019, 2:01pm

If I understand your lovely little example correctly the (?m) causes the supplied regex to be applied to each line in the input in turn.

That could be quite interesting with some of the text I’m working on!

I will play in the morning (nearly midnight in my timezone).

armingrudd · May 5, 2019, 2:07pm

The caret “^” and the dollar “$” signs match the beginning and the end of each line instead of only matching the beginning and the end of the string.

acommons · May 6, 2019, 1:26am

The caret and the dollar don’t consume the end of line when matching either.

So in your example the regex ^. matched both lines.

So does ^.$

But whilst (?:^|\n). without (?m) behaves the same as ^ with (?m) the same is not true when (?:^|\n).(?:\n|$) is used without (?m). In this last case the end of line match consumes the character and it fails to match on the second line.

A subtle difference.

Thanks again for the clarifications.

cheers,
Andrew

armingrudd · May 6, 2019, 7:05am

This is because the \n is the new line character and comes once at the beginning of each new line. So when you use (?:\n|$) without (?m) since there is only one ending for the string which is at the end of the string, it always captures the \n character till it reaches the end of the string where the $ is captured as there is no new line. Now regarding the combination of (?:^|\n).(?:\n|$) without (?m), as you mentioned, at the first line, the pattern matches ^.\n because there is no \n at the beginning of the string and there is no $ at the end of the line. And also since there is no ^ at the beginning of the next line, there is no match until the third line (if exists) where (?:^|\n) matches the next \n (which is the beginning of the third line) and (?:\n|$) matches either the \n for the next line or end of the string whichever the case is and so on.

I hope everything is clear now.

Armin

acommons · May 6, 2019, 7:20am

I understand exactly why it behaves that way. I was just pointing out the difference in behaviour which makes (?m) necessary. I extended your little example to stress test the various possibilities

system · May 13, 2019, 7:20am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.