I am trying to extract some information from several thousand scanned pdf-documents, which I made searchable through OCR.
The two values (Evd and T) I am trying to capture appear somewhere within the documents but are allways near each other. Some documents contain several pairs of the two values.
Typical lines from my files could look something like this:
CASE 1: Evd, then T
"Evd: 10,9 MN/rn2 (T.: 0,9 m u. SOK}"
"Evd: 13,3 MN/m2 (t: 1,2 m u. SOK)"
"Ev: 17,9 MN/m2 (T.:0, 9 m. u. SOK)"
"Ev d: 18,6MNlm'(t:1.0 mu.SOK)"
CASE 2: T, then Evd (linebreaks inbetween?)
"(T.: 0,6 m u. SOK): Ev : 36,8 MN/m²"
"(t : ca. 0,9 m u. SOK): Evd: 17,10 MN/m2"
"(t : 1.35 m u. SOK): Evd : 55.8 MN/m2"
"(t : 0,65 m u. SOK): Evd : 8,4 MN/m2"
So far I have managed to create a sucessful workflow for CASE 1:
[PDF Parser] --> [Document Data Extractor] --> [Sentence Extractor] --> [Regex Split]
Regex Split uses the following input (with multiline and ignore case activated):
.*\s*Ev[d: ]*\s*(\d+[,. ]+\d+).*\s*[Tt!I1]+[:. ]*(\d+[,. ]+\d+).*
which gives good results. The options in brackets have been introduced to compensate for bad textrecognition, where a "T" gets recognized as a "t", "!", "I" or "1".
Now here's the challenge:
I have not managed to create a regular expression for CASE 2: In the original documents there are separate lines for the T-value and the Evd-value. The [Sentence Extractor] makes two different sentences out of the values, that I want to search for - so the RegexSplit doesn't work. I've tried to introduce a [String Replacer] and Replace all "\r\n|\n|\r" for "", which doesn't help.
If I try to use the [Regex Split] directly on the Document Body text of the [Document Data Extractor] using above regex I get the warning:
WARN Regex Split 2:37 1731 input string(s) did not match the pattern or contained more groups than expected
Is there a possibility to use the regex on the whole text ot the document? Then a regex like this could work:
Any help is highly appreciated!