KNIME has recently received considerable attention from my professor for its capability of determining keywords and clustering text based upon those keywords. The workflow we use is pretty straight forward (if needed, I'll post it): (1) PDF-parsing, (2) some preprocessing, (3) keyword extraction followed by (4) hierarchical clustering of documents based upon the keyword-scoring.
However, there seems to be one practical flaw: When evaluating the results, we are experiencing some trouble identifying the documents in the cluster. According to the source code, the PDF-Parser node assigns the title of the document by (1) meta information, (2) first sentence and (3) file name. Only after the extraction of meta information failed and no first sentence could be identified, a document is referred to by its file name.
Is there any way to exclusively use the document's file name in the "Document" column without creating my own parser node?
there is no direct way to do this. I will put this issue on the feature request list but i can not promise it for 2.8.
A workaround would be to extract the parsed text from the documents (Document Data Extractor), filter of the filename from the fulltext (via e.g. JavaSnippet node) and use the filename (which can be extracted too with the Document Data Extractor) as title column for the Strings to Document node [PDF Parser -> Document Data Extractor -> e.g. Java Snippet -> Strings to Document].
thanks for your response. I guess the workaround you mentioned will suffer perfectly.
Edit: I totally forgot. In case anyone is having a similar issue and is not familiar with the Snippet Node: the substring method is what you'll most likely want to use:
filename = filename.substring(filename.lastIndexOf("\\") + 1 , filename.length() - 4);
This reduces the string "\bla\location\here\and\there\document.pdf" to "document".
Best regards and have a nice day!
I'm trying to use PDF Prarser node under the Knime Labs library and for some unknown reason when bwosimg to my documents library the Node doesn't recognize any PDF file also there are planty of them.
you need to place your PDFs in one directory and specify this directory in the dialog of the PDF Parser node. The node will read all files ending with .pdf or .PDF. Of course these PDFs need to contain text. The node can not apply OCR.
Do you get one document in the nodes output table for each PDF file in the directory?
8mm: if you use
File.separatorFile.separator instead of \ then your code will also work on linux and mac installations.
Thanks Kilian, it works.
Yes, one doc for each PDF.
Now I can start playing around with Text Analytics