Document split into sub-documents based on regular expression

I have an example document which contains roughly 40 sub-documents to extract.

Each of these documents contains a starting regular expression such as ““Der Standard” vom \d\d.\d\d.\d\d\d\d” and an ending expression such as “von \d\d.\d\d.\d\d bis \d\d.\d\d.\d\d”.

I have started with the following workflow: PDF Parser → REGEX Split.

The Pattern I am using is: “Der Standard” vom \d\d.\d\d.\d\d\d\d(.*)von \d\d.\d\d.\d\d

However, the split contains multiple results. The warning is: “input string(s) did not match the pattern or contained more groups than expected”

My view is that the RegEx Split node is best used with with simple strings. How did you construct and check your RegEx? You’re trying to handle a multiline string and the RegEx you wrote cannot parse new line characters.

I think a more user-friendly way to approach this would be to use the Tika Parser node to parse the pdf then filter out all unnecessary columns.

At this point I’d use the the Regex Extractor node to do the splitting using the expression: "Der Standard" vom \d\d.\d\d.\d\d\d\d[\s\S]*?von \d\d.\d\d.\d\d. The [\s\S]*? will parse newline characters

image

And here are the 20 sub documents, each in its own row:

4 Likes

I was not aware of the “Regex Extractor Node”. Which extension is it? I could not find it with my query on KNIME hub.

I could not find it with my query on KNIME hub.

You’ll find it with NodePit: https://nodepit.com/?q=regex+extractor

It’s part of the Palladian Extension:

–Philipp

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.