Can this be done?

ovivojh · January 5, 2017, 5:54pm

KNIME newbie, is the following possible to do?

I have a large amount of PDF documents that are all written in a similar way. These documents are all multiple pages broken up and have generally static section identifiers. (PDF Parser works well to get them into KNIME)

"Ex: "Facility Information" I would like to be able to extract that body of text in that section ( I have tried Bag of Words to get it into a word-by-word situation. But struggling on how to say, Capture all words from Rows #-# and begin where Row = "Facility Information" or "Facility Description" and stops before a specific variable is counted (Ex: a section "#.# or if possible a larger font/bolded word signaling a new section)
An ID number in each PDF that is unique to that facility "XX#######" (RegEx will pull that)

Then the next step I would like to provide a list of dictionary words that count the occurrences of those words it sees based on the section from the "Facility Information" that was extracted. This would add a # under each column where it found that particular word.

I could then take that data and populate a database where a particular ID has the following occurrences of certain words, and populate the extracted body of text from "Facility Information" into that same database, associated to that same XX###### number.

Possible to do? Complicated request? Can a newbie be taught to do something amazing? :)

Thank you, everyone, and I look forward to your reply.

Signed,

John

RolandBurger · February 24, 2017, 10:59am

Hi ovivojh,

This is not easy, since working with PDFs can be tricky due to a lack of standardized formatting. This means that, depending on the specific PDF you are working with, it might not be possible to identify sections.

If you manage to identify the sections, you can count the terms using a GroupBy node after using a Term to String node and then a Reference Row Filter to keep only the rows containing the terms you are interested in. You can then join these counts with your original table.

Cheers,

Roland