Entity and event extraction from pdf

Hi,

I am new to NLP and KNIME text processing feature. My current project requires me to perform information extraction from raw text file. Specifically I need to identify and extract entities , dates and the relationship between entities from unstructured text. I have been able to read in the data using pdf parser node followed by open nlp tagger to tag the document. It would be great if somebody could guide me with the relevant nodes to perform the aforementioned task.

 

Thanks.

Hi,

using the Tika Parser node is the better option to read PDF files. For plain text files youc an also use the Text File Parser.

Named entity recognition can be done by NER tagger nodes. The textprocessing extension e.g. bundles some of the Stanford NER tagger models. Relationships can not be extracted with a dedicated tagger node. You would need to do this by creating your own workflow. E.g. you could extract all NER in a sentence and the verb in the sentence. Parts of speech can be tagged using the Stanford Tagger node.

I hope this helps.

Cheers, Kilian

Thank you for the suggestion.

Just a small query, is it possible to read a PDF file, page wise, using Tika parser. I happened to install it and I also read in a PDF file, however I was not able to locate any option or feature which suggested page number of that PDF file. 

Thanks.

Hi,

the TIKA nodes read the whole PDF. It is not possible to read page wise. Also all formatting information of the PDF will get lost during parsing.

Cheers, Kilian

Hi,

one more option is to use the "External tool node" (or Python Script...) first and split a PDF file into single files. Then you can read in all files  (=single pages) from this folder (using "List Files") and process the files individually. In order to force the list-files to be executed after the "External Tool Node", you can simply draw a Flow-Variable connection from "External Tool Node" FlowVariable output to "List Files" FlowVariable input. 

I hope this helps,

Christian

PS:  you can try to run tools such as PDFTK or PDFSplitter to split the PDF.

Hi Kilian,

can you possibly name some advantages of Tika Parser over PDF Parser node?

When using Tika Parser I’m not able to use i.e. “Stop Word Filter” or “Case Converter” anymore on the given output (Datatype: String).

From my point of view the Tika Parser seems more like an alternative to the “Document Data Extractor” as it extracts some interesting meta-data… but I’m not really sure with this classification. :slight_smile:

Some additional opinion for clarification is highly welcome.

Thank you in advance.

Best regards,
Paul

Hello @Monkeyschool -

I’m no Kilian :slight_smile:, but I can offer a suggestion. If you use the Strings to Document node after importing data using the Tika Parser, you’ll find you can use the preprocessing nodes in the usual way.

1 Like

Hi @ScottF

thank you very much for the feedback.
This is a good approach for further processing.

Are there any reasons for choosing Tika Parser over PDF-Parser for parsing PDF files?

Best regards,
Paul

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.