Can't read pdf files with either of the PDF Parser and Tika Parser Nodes

tma · October 8, 2020, 3:56pm

Hi all,

I am stuck at reading pdf files. I used PDF Parser Node first but after viewing the parsed document using Documents Viewer Node I got only the title and section heading but not the text in each section and subsection of the pdf file. Then I tried to access it using Tika Parser Node and got the same output. I am still very knew to KNIME. I read related topics but couldn’t find the exact solution.

Can anyone tell me how I can parse the whole content of pdf file? What I am trying to do is to read a pdf and perform some data cleaning (using Number Filter, Punctuation Erasure, String Replacer Nodes etc.) and store the document as csv.

May be I could get a link to somewhat similar workflow ?

Looking forward to your responses.

Thanks in Advance !

izaychik63 · October 8, 2020, 4:44pm

Take a look at PDF files version. For 1.4 or lover parser may not work.

ScottF · October 8, 2020, 5:23pm

There are several workflows dealing with PDF files on the Hub that may be of use:

That said, as @izaychik63 suggested, it might have something to do with your specific file. If you have a non-confidential, non-working PDF file you can post here, maybe we can check on that for you.

izaychik63 · October 12, 2020, 8:59pm

Everything is good with your document. After Tika Parser I Used Strings to Document node using RowID as a Title and open document in the Document Viewer

tma · October 13, 2020, 9:50am

@izaychik63 Thank you it worked.
I can view the whole text. But, there is still some problem. When I view the document it shows two copies of the same text one in bold text format and other in normal text as in your above solution. Do you know what could be the possible reason for this ?

system · October 20, 2020, 9:50am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.