Analyse PDFs

@rvissers ,

I think the best course of action here is Regex because this is a complex problem which String Matcher may not be suited for.

You can use this workflow

as an example of PDF → Regex → Extraction.

As well, please go through the links I sent above. This is quite a complex case and will require a good amount of time and effort to build rules for extraction.

For instance to extract ISAE XXXX type N from each of your pdfs, try using the regex:

Here are general rules for Regex as well:

And of course, please see the examples of regex I posted above.

If you find one particular element is hard to extract, let us know and we can provide some expertise there as well. Many people ask about Regex because it is such a powerful tool for extraction within KNIME for exactly these kinds of problems.

Just to give you an idea of how complex PDF extraction have a look at a Just KNIME It challenge we did:

And see community solutions as well.

2 Likes