Analyse PDFs

Hello @rvissers

Welcome to the KNIME forum!

Do you happen to have an example pdf? Either the actual one, dummy, anonymized, etc. You are mentioning quite a few issues which are pretty tricky to judge purely based on the symptoms you briefly describe without getting a feel of what you are dealing with.

How does your workflow looks like? What nodes are you using for each steps? Certain settings, etc.

To illustrate, you mention the desire to extract date ranges which could be of different formats. A small test bench:

I load two dummy PDF’s through the PDF parser like you have that each contains a date range in the text. They are different in both the date itself (M vs MM) as the range descriptor (until vs to). By applying a regex that can capture both version, I’m able to extract this information.

Again, if this could be a feasible solution really depends on the specifics of your use case I’d say.

2 Likes

Hello, we recently had a PDF extraction event via Data Connect. The slides can be found here.

As well, PDFs are notoriously difficult hence the event so we would need sample PDFs and your workflow to diagnosis any issues.

For regex, I have created the following: Various Examples of Regex in KNIME – KNIME Hub

For PDFs, you may also find that the tika parser is better for extraction (but it depends on how/what you want to extract).

Please follow up with very specific examples and screenshots and sample PDFs, so we can provide the best possible answers. Thank you!

4 Likes

Hi Victor,

Thanks for your reply. I attached a dummy document (note that I cant upload PDFs, but I have to analyse PDFs).
In addition I added an overview in Excel with our analyses on different terms across multiple documents to create an overview with search terms. In the report itself (Word doc) I created the same structure as we would normally see when receiving these kind of reports.

As we are new to Knime, we find it very difficult to determine which logic we need / what nodes are relevant etc. We have looked to quite some tutorials but can’t find the proper solution. I also included a screenshot of the workflow we created so far. Not much progress as you can see.

Is there anyone who can help us out?

Analysis SOC Reports _Dummy data and expected outcomes.xlsx (15.7 KB)
DUMMY Document PDF.docx (22.3 KB)

Hi Arjen,

Just replied to Victor. Perhaps you also might have an idea?

Thanks.

@victor_palacios

I just sent a response to your reply. Not sure whether this was visible to you, so hereby the notification.

@rvissers ,

I think the best course of action here is Regex because this is a complex problem which String Matcher may not be suited for.

You can use this workflow

as an example of PDF → Regex → Extraction.

As well, please go through the links I sent above. This is quite a complex case and will require a good amount of time and effort to build rules for extraction.

For instance to extract ISAE XXXX type N from each of your pdfs, try using the regex:

Here are general rules for Regex as well:

And of course, please see the examples of regex I posted above.

If you find one particular element is hard to extract, let us know and we can provide some expertise there as well. Many people ask about Regex because it is such a powerful tool for extraction within KNIME for exactly these kinds of problems.

Just to give you an idea of how complex PDF extraction have a look at a Just KNIME It challenge we did:

And see community solutions as well.

2 Likes

@Vitctor,

Thanks for sharing. I will have a look!
I already checked the regex you provided. however it seems that something does not work. It will only replace isae 3402 type 2. All other options are not included.

Formula:
regexReplace($lowercase_Searchterms$,"(.)(isae\s?[0-9]+\stype\s[A-Z0-9])(.)",“123”)

Output (as you can see only one item has changed to 123.

That’s correct. The regex I provided can’t capture those examples.

My intention was to show you a starting point and then you modify it to your needs.

This regex:

regexReplace($lowercase_Searchterms$,"(.)(isae\s?[0-9]+\stype\s[A-Z0-9])(.)",“123”)

can’t capture the other instances because that regex requires the word “type” to occur and requires only a single digit to follow “type”.

In this case, you need to use the ? regex which means “may or may not occur” and you will also need [a-z0-9]{1,2} which means find any single or paired lowercase letter or number.

regexReplace($lowercase_Searchterms$,"(isae\\s?[0-9a-z]+(\\stype\\s[a-z0-9]{1,2})?)", “123”)

Please note in your regexReplace, use 2 backslashes (not 1 or 3).

Screen Shot 2022-07-01 at 12.38.06 PM

@victor_palacios thanks for sharing! It works indeed!

I am able to collect and visualize quite some information already.
Main challenge remains extracting the values from the tables within the reports.
I used the examples provided by you but somehow it seems that Knime reeds the data from the left to the right. Example; If a table contains three columns; knime does not detect "relevant exceptions noted " as one sentence, but will separate it on e new row. Which makes it impossible to detect whether the content “relevant exceptions noted” exists in the table. Also, the Tika / Pdf parser does not always recognize the text within the tables.

Column1: Column 2: Column 3
Test1 Test 2 Relevant
ab exceptions noted
abc

It may not be directly solve your task but may be helpful to identify surrounding terms

Use the Column Combiner node to deal with this.

What text within the tables? If you’re using image-based pdfs, text within tables is very hard to read even with paid software.

@victor_palacios it are text based pdfs. Each document contains tables. However, all of these are different in terms of content / size. Some might have three columns, other two or even four etc. Also, in most cases they use merged cells before they describe the descripton, test activities, test results.

So the issue is not within KNIME but in general reading the pdf tables correctly?
br

I dont know if it is purely pdf related or the way Knime handles Pdfs.

We are able to search / extract all other relevant data what is not stored within tables.

Anything stored within tables get mixed up since not all relevant sentences are written on one line.

In the document I uploaded earlier you will find an example (word version). If the sentence no relevant exceptions noted is seperated due to cell width, Knime wont recognize it when searching for it.

Also. This is just an example table as these might differ each time in terms of content (apart from some standard terms) / layout.

Have you tried other PDF Parser/Extractors yet?

I have only used the tika and pdf parser so far.

image

Can someone help me to remove the breaks after a. b. and c.?
Tried to do so via either the cell splitter or String manupulation.
Cell Splitter: delimited: “\n”
String manipulation: “regexReplace($Text$,”(.)\n(.)“,”" )"
Note that I did use two backslashes each time.

Hi again,

Could you let me know if String Manipulation - the strip() function - works for you?

Perhaps I don’t use it the correct way, but how should the expression look like when using the Split node and regexreplace?

strip(regexReplace($Text$,”(.)\n(.)“,”" ))