Can someone explain what the Multiline option is meant to do in the Regex Split node?
I have a regex I’m using of the form (?s)^…regex stuff…$ to parse multiline input in a column. It works fine, does exactly what I want.
If I remove the (?s) as part of the regex and select the Multiline option in the Configuration instead the regex no longer works.
If this is expected behaviour then what is the Multiline option actually supposed to do? From the documentation it looks like it is supposed to mimic (?s) but it doesn’t seem to do that!
In the single line mode which can be turned on by using (?s), the dot “.” character matches all characters, which normally matches any character except the new line character e.g. \n.
In the multiline mode which can be turned on by using (?m), the caret “^” and dollar “$” signs match the beginning and the end of each line instead of only matching the beginning and the end of the string.
So the Multiline checkbox is inserting the (?m) option for the regex you supply?
cheers,
Andrew
Edit:
Here is the help for Multiline -
"Enables multiline mode, i.e. when selected the expression ^ and $ match the start and end of the input string. This option only matters if the input string have line breaks. "
I’ve just edited my previous reply but will repeat here.
Multiline help says "Enables multiline mode, i.e. when selected the expression ^ and $ match the start and end of the input string. This option only matters if the input string have line breaks. " which looks more like (?s) than (?m) to me…I think the reference to ‘string’ is what is confusing me. I’m reading it to mean the whole cell contents…
This is telling us that if there are no line breaks in the string (the string is a single line text) then the multiline option makes no sense and if enabled, it would behave like what regex normally does. So the ^ and $ will determine the beginning and the end of the string (as the string has only a single line). But if the string has line breaks (it contains multi lines of text), this option becomes useful and the beginning and the end of lines would be the correct matches of the ^ and $ elements.
The caret and the dollar don’t consume the end of line when matching either.
So in your example the regex ^. matched both lines.
So does ^.$
But whilst (?:^|\n). without (?m) behaves the same as ^ with (?m) the same is not true when (?:^|\n).(?:\n|$) is used without (?m). In this last case the end of line match consumes the character and it fails to match on the second line.
This is because the \n is the new line character and comes once at the beginning of each new line. So when you use (?:\n|$) without (?m) since there is only one ending for the string which is at the end of the string, it always captures the \n character till it reaches the end of the string where the $ is captured as there is no new line. Now regarding the combination of (?:^|\n).(?:\n|$) without (?m), as you mentioned, at the first line, the pattern matches ^.\n because there is no \n at the beginning of the string and there is no $ at the end of the line. And also since there is no ^ at the beginning of the next line, there is no match until the third line (if exists) where (?:^|\n) matches the next \n (which is the beginning of the third line) and (?:\n|$) matches either the \n for the next line or end of the string whichever the case is and so on.
I understand exactly why it behaves that way. I was just pointing out the difference in behaviour which makes (?m) necessary. I extended your little example to stress test the various possibilities