Document split into sub-documents based on regular expression

b0raas · November 9, 2021, 4:53pm

I have an example document which contains roughly 40 sub-documents to extract.

Each of these documents contains a starting regular expression such as “"Der Standard" vom \d\d.\d\d.\d\d\d\d” and an ending expression such as “von \d\d.\d\d.\d\d bis \d\d.\d\d.\d\d”.

I have started with the following workflow: PDF Parser → REGEX Split.

The Pattern I am using is: "Der Standard" vom \d\d.\d\d.\d\d\d\d(.*)von \d\d.\d\d.\d\d

However, the split contains multiple results. The warning is: “input string(s) did not match the pattern or contained more groups than expected”

elsamuel · November 9, 2021, 9:38pm

My view is that the RegEx Split node is best used with with simple strings. How did you construct and check your RegEx? You’re trying to handle a multiline string and the RegEx you wrote cannot parse new line characters.

I think a more user-friendly way to approach this would be to use the Tika Parser node to parse the pdf then filter out all unnecessary columns.

At this point I’d use the the Regex Extractor node to do the splitting using the expression: "Der Standard" vom \d\d.\d\d.\d\d\d\d[\s\S]*?von \d\d.\d\d.\d\d. The [\s\S]*? will parse newline characters

And here are the 20 sub documents, each in its own row:

b0raas · November 10, 2021, 8:08am

I was not aware of the “Regex Extractor Node”. Which extension is it? I could not find it with my query on KNIME hub.

qqilihq · November 10, 2021, 8:45am

I could not find it with my query on KNIME hub.

You’ll find it with NodePit: https://nodepit.com/?q=regex+extractor

It’s part of the Palladian Extension:

–Philipp

system · November 17, 2021, 8:46am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.