PDF Parser - Patent Documents

michael19602016 · November 16, 2016, 10:54pm

Dear All,

I'm absolutely new to Knime and I' desperately trying to get out some information from patent pdf files downloaded from the DPMA homepage. The PDF parser works and generates a list of documents withhte first column being the row number and the second colum the path to th document(s). But up to now, I wasn't able to extract any informations from the document text itself (for example BoW). The document view node als works but displays only the information mentioned above... Could anybody give me a hint how confugure the PDF parser properly and how to proceed from its outlet ?

Best regards

Michael

rs1 · November 17, 2016, 10:07am

You should have some document in the output table of the PDF Parser node. Here is a getting started document about text analytics

https://www.knime.org/files/knime_text_processing_introduction_technical_report_120515.pdf

I hope it helps

-- Rosaria

michael19602016 · November 17, 2016, 10:48pm

Dear Rosaria,

thank you very much for your soon reply. The document was very helpful and now I have an Impression of the working steps to do. I could put all the nodes in operation and they worked. Nevertheless, in the end, again only the "text" of thee path to the directory and the name of the document files were processed, not the content of the documents themselves.

What is processed is for example

C:\Data\Documents\Patents\DE 2954309.pdf.

This appears in the document table and is further processed by the succeeding nodes, but the information in the document is not accessed...

I' m sure that I'm doing something wrong on an elementary level, but I don' know what it is :-(

Best regards

Michael

rs1 · November 25, 2016, 12:36pm

When you visualizem the results of the node you only see the title of the document.

Use the Document Viewer node to see what is underneath.

If that does not answer your question, can you maybe share the workflow?

-- Rosaria

system · June 2, 2023, 9:48pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.