Rule Engine Interpretes escaped RegEx Line Breaks

mwiegand · October 19, 2022, 7:26pm

Hi there,

I might have stumbled across an issue in the interpretation of actually escaped (single and double) RegEx line breaks “\n” in all Rule Engine Filter / Splitter Nodes. Maybe even in all as it’s too late to test all scenarios.

Issue
When you have to work with data i.e. extracted from PDF, it often can contain line breaks. Whilst I know about a workaround it’s an unnecessary step and, when working with large data sets, has significant performance implications.

Factoring in line breaks in a rule node like so $Content$ MATCHES "^0000 (.|\n)+ 007F$" => TRUE results in the issues shown in the screenshot below.

Here is the corresponding workflow:

Cheers & Good Night
Mike

thor · October 19, 2022, 7:55pm

This is a general “problem” of regular expression (in Java). By default a regular expression only matches a single line, i.e. you can not match line breaks because they will not be part of the input string. Matching in multiline strings must be explicitly enabled and I’m pretty sure that none of the nodes do this.

Adrix · October 19, 2022, 10:35pm

HI ,

I admit that probably i am not getting the main goal.

Do you want to capture the Section title in the Field content ?
If so :

^\w*\s((\w+\W))\V.

In the regex split node will solve it ( at least for the sample provided)

system · January 17, 2023, 10:35pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.