PDF parser

Dear Kilian:

The PDF parser like word parser takes the first sentence to be the title of the document automatically. This sometimes results in very long sentences which exist in the original pdf document or even the entire document as a document cell.  You will understand what kind of pain it could be to have a document cell with entire pdf document of 10 to 15 pages in one cell.

I am not sure if others have experienced this.  I could provide an example, if required.

Cheers.

Hi Sridhar,

i definitely see your point here. This issue is already on the todo list and now just got +1. What do you suggest to choose as title alternatively? In the node dialog the user could choose between first sentence or file name for example. Any other suggestions?

Cheers, Kilian

Dear Kilian:

Happy to know that KNIME will work towards correcting this.  My simple response to your query is anything would be better than a long string of the first sentence.  Coming from a 400+ character experience, I am aware that I am biased.

File name is an excellent option - we could get back to good old names of naming conventions limited to 8 characters, etc. etc.  

In Zotero, the metadata is automatically captured in to the citation database.  I am sure something similar could be attempted here. 

In the worst case, an option to provide details in the Strings to Document node would help.

You may want to consider similar change in the Word Parser as well.

Cheers.

What I’m missing is the possibility to select an empty string as title, like in the String to Document node. The reason is that the Dictionary Tagger will also tag the text in the title, which translates in terms being counted twice in later statistics, or even worse, being counted because they appear in the file path when they do not in the document… To avoid this, I now have to extract the text from the doc and convert this back to a doc with the String to Document node, which is quite inefficient.

Hey @mpenalver,

yes, this is true. Tagging affects all parts of the document and currently there is no other solution to extract the document body text and convert it back into a document. If you want to you can update to v4.2 which includes some performance improvements regarding document conversion. So at least, this workaround should be faster now. :wink:

Cheers,
Julian

I did the moment it came out. :relieved:

2 Likes