PDF Reader vs Java Snippet

symonsjo · June 30, 2018, 4:11pm

A few things I found when using PDF Reader vs using PDFBOX with a Java Snippet.

The option to use no title (as you can in Strings To Document) should be available in PDF Reader. It’s not clear what PDF Reader uses by default if you don’t select file as title but it seems to have a large impact on the document content (as seen in the attached comparison workflow). See this thread for a reason why stripping titles is tedious.
Punctuation erasure is inefficient when it comes to PDF content and there are some characters which may be worth adding to PDF erasure (see the replacement node in the attached comparison workflow).

Overall, despite being 3x slower than using PDF Reader, I would still prefer to use Java Snippet with PDFBOX directly rather than PDF Reader as it’s unclear what impact PDF Readers default creation of Documents has on NLP. It may be worth giving the option to output as strings rather than documents but certainly worth giving the same options for title when automatically doing strings to document.

PDF_WORD_COUNT_COMPARISON.knwf (67.0 KB)

marten_kose · December 7, 2018, 10:08am

Hi @symonsjo,

first of all sorry for the late response. I’ve addressed your concerns in an enhancement ticket for the PDF Reader node and will keep you posted on it’s progress.

Cheers,
Marten