PDF Parser Node

8mm · May 19, 2013, 12:41pm

Hello everyone,

KNIME has recently received considerable attention from my professor for its capability of determining keywords and clustering text based upon those keywords. The workflow we use is pretty straight forward (if needed, I'll post it): (1) PDF-parsing, (2) some preprocessing, (3) keyword extraction followed by (4) hierarchical clustering of documents based upon the keyword-scoring.

However, there seems to be one practical flaw: When evaluating the results, we are experiencing some trouble identifying the documents in the cluster. According to the source code, the PDF-Parser node assigns the title of the document by (1) meta information, (2) first sentence and (3) file name. Only after the extraction of meta information failed and no first sentence could be identified, a document is referred to by its file name.

Is there any way to exclusively use the document's file name in the "Document" column without creating my own parser node?

Best regards

kilian.thiel · May 21, 2013, 6:54pm

Hi,

there is no direct way to do this. I will put this issue on the feature request list but i can not promise it for 2.8.

A workaround would be to extract the parsed text from the documents (Document Data Extractor), filter of the filename from the fulltext (via e.g. JavaSnippet node) and use the filename (which can be extracted too with the Document Data Extractor) as title column for the Strings to Document node [PDF Parser -> Document Data Extractor -> e.g. Java Snippet -> Strings to Document].

Cheers, Kilian

8mm · May 24, 2013, 10:57am

Hello Kilian,

thanks for your response. I guess the workaround you mentioned will suffer perfectly.

Edit: I totally forgot. In case anyone is having a similar issue and is not familiar with the Snippet Node: the substring method is what you'll most likely want to use:

filename = filename.substring(filename.lastIndexOf("\\") + 1 , filename.length() - 4);

This reduces the string "\bla\location\here\and\there\document.pdf" to "document".

Best regards and have a nice day!

frankovits_tzahi · March 16, 2015, 2:14pm

Hi everyone,

I'm trying to use PDF Prarser node under the Knime Labs library and for some unknown reason when bwosimg to my documents library the Node doesn't recognize any PDF file also there are planty of them.

Any suggestion?

THX

kilian.thiel · March 17, 2015, 1:31pm

Hi,

you need to place your PDFs in one directory and specify this directory in the dialog of the PDF Parser node. The node will read all files ending with .pdf or .PDF. Of course these PDFs need to contain text. The node can not apply OCR.

Do you get one document in the nodes output table for each PDF file in the directory?

Cheers, Kilian

Ellert_van_Koperen · March 18, 2015, 10:14am

8mm: if you use File.separatorFile.separatorFile.separatorFile.separator instead of \ then your code will also work on linux and mac installations.

frankovits_tzahi · March 19, 2015, 2:05pm

Thanks Kilian, it works.

Yes, one doc for each PDF.

Now I can start playing around with Text Analytics

Tsahi

system · June 2, 2023, 9:49pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.