Extracting Comments section from PDF and page numbers

Macca · June 28, 2017, 4:47pm

I have a PDF that is annotated with comments. I would like to extract the comments section information ONLY from the PDF along with the page number they come from using KNIME.

The Tika parser can capture all the contents of the PDF, but there is too much text to extract the relevant comments as they are not tagged.

Would be nice to hear from anyone in the KNIME community which has solved this problem.

Thanks

kilian.thiel · July 5, 2017, 10:49am

Hi Macca,

specific sections in the PDF files can not be identified and extracted separately. The Tika parser node can parse text from PDF file but it takes all the text. Also formatting information will get lost.

Cheers, Kilian