KNIME newbie, is the following possible to do?
I have a large amount of PDF documents that are all written in a similar way. These documents are all multiple pages broken up and have generally static section identifiers. (PDF Parser works well to get them into KNIME)
- "Ex: "Facility Information" I would like to be able to extract that body of text in that section ( I have tried Bag of Words to get it into a word-by-word situation. But struggling on how to say, Capture all words from Rows #-# and begin where Row = "Facility Information" or "Facility Description" and stops before a specific variable is counted (Ex: a section "#.# or if possible a larger font/bolded word signaling a new section)
- An ID number in each PDF that is unique to that facility "XX#######" (RegEx will pull that)
Then the next step I would like to provide a list of dictionary words that count the occurrences of those words it sees based on the section from the "Facility Information" that was extracted. This would add a # under each column where it found that particular word.
I could then take that data and populate a database where a particular ID has the following occurrences of certain words, and populate the extracted body of text from "Facility Information" into that same database, associated to that same XX###### number.
Possible to do? Complicated request? Can a newbie be taught to do something amazing? :)
Thank you, everyone, and I look forward to your reply.
Signed,
John