Text Mining - regex failing


I am trying to extract some information from several thousand scanned pdf-documents, which I made searchable through OCR.

The two values (Evd and T) I am trying to capture appear somewhere within the documents but are allways near each other. Some documents contain several pairs of the two values.

Typical lines from my files could look something like this:

  • CASE 1: Evd, then T
    "Evd: 10,9 MN/rn2 (T.: 0,9 m u. SOK}
    "Evd: 13,3 MN/m2 (t: 1,2 m u. SOK)
    "Ev: 17,9 MN/m2 (T.:0, 9 m. u. SOK)"
    "Ev d:  18,6MNlm'(t:1.0 mu.SOK)"
  • CASE 2: T, then Evd (linebreaks inbetween?)
    "(T.: 0,6 m u. SOK): Ev : 36,8 MN/m²"
    "(t : ca. 0,9 m u. SOK): Evd: 17,10 MN/m2"
    "(t : 1.35 m u. SOK):   Evd  : 55.8 MN/m2"
    "(t : 0,65 m u. SOK):   Evd  : 8,4 MN/m2"

So far I have managed to create a sucessful workflow for CASE 1:

[PDF Parser] --> [Document Data Extractor] --> [Sentence Extractor] --> [Regex Split]

Regex Split uses the following input (with multiline and ignore case activated):
.*\s*Ev[d: ]*\s*(\d+[,. ]+\d+).*\s*[Tt!I1]+[:. ]*(\d+[,. ]+\d+).*

which gives good results. The options in brackets have been introduced to compensate for bad textrecognition, where a "T" gets recognized as a "t", "!", "I" or "1".

Now here's the challenge:

I have not managed to create a regular expression for CASE 2: In the original documents there are separate lines for the T-value and the Evd-value. The [Sentence Extractor] makes two different sentences out of the values, that I want to search for - so the RegexSplit doesn't work. I've tried to introduce a [String Replacer] and Replace all "\r\n|\n|\r" for "", which doesn't help.

If I try to use the [Regex Split] directly on the Document Body text of the [Document Data Extractor] using above regex I get the warning:

WARN  Regex Split          2:37       1731 input string(s) did not match the pattern or contained more groups than expected

Is there a possibility to use the regex on the whole text ot the document? Then a regex like this could work:
"\(T.*?[ .:]*(\d+[,.]+\d+).*\).*Ev.*?(\d+[,.]+\d+).*"

Any help is highly appreciated! 

Regards, Erik

Hi Erik,

that is indeed a bit tricky. First a bit of background: sentence tokenization is done by the Strings to Document node. After that node sentence boundaries will not be changed. This means removing white spaces or \n characters will not changed the tokenization. To affect the tokenization you need to apply the removing of \n ... on the string value before using the Strings to Document node.

Using the regex directly on the document body text string should work. The Document Data Extractor extracts the text as string col. Does the Regex Split node fail completely or does it only throw a warning? According to the warning I would assume that the number of groups create by the regex is very high or does not match the number of groups specified in the dialog.

Cheers, Kilian


Hi Kilian,


thank you very much for your answer - which I just discovered recently.

Unfortunately I still don't get any results for my regex when trying to use the [RegexSplit] directly on the DocumentBodyText. The [RegexSplit] only works if I use it after the [SentenceExtractor] node. But the Sentence Extractor causes problems when it actually splits the two values I need to find, into two separate sentences.


I tried, what you suggested, to create a document without linebreaks:

[PDF Parser] --> [Document Data Extractor] --> [String Manipulation] (deleting all line breaks: regexReplace($Document body text$,"\n"," ") --> [Strings to Document]


Still, I can not use the [RegexSplit] directly on the DocumentBodyText. Now a strange thing happens: I use the [Row Splitter] after the [Sentence Extractor] to only filter the rows, that include the values I am looking for: pattern matching (regular expression: ".*Ev.*").

The [Row Splitter] generates proper results by finding ~1400 rows. But when I use the [Regex Split] directly after, using the exact same regex, that I used before

"\(T.*?[ .:]*(\d+[,.]+\d+).*\).*Ev.*?(\d+[,.]+\d+).*"

it does only find very few resuls (21 of  ~1400)?


Should I post my workflow for clarification?