PDF Parser gives sometimes strange results

When I visualize a pdf file, I can see the lines and identify which line is below which one, it's practicle to give sense to what we read. I used the PDF Parser node, which some times works very well, but sometimes gives strange results. In my case I foccused my attention on the page 5 of the PDF file.

The strange event is that ROWs 262 and 263 (one after the other from the parser) are not one after the other in the PDF file but separated from 14 lines. With such result it's difficult to analyze the PDF document !

I attached

- the PDF file I used

- the workflow that show the strange result

- a snapshot showing that from the PDF parser node, and the page 5 of the PDF. Words that were compared are hilighted.

I hope that the attached files will help you to have a better look on what I say.

Is there something I did wrong in my workflow ?

Thank you in advance for any help on this topic.

 

Hi tconvard,

that is really strange. Underneath the node uses the Apache POI lib to parse the PDF. With 3.3.0 we have relesead new Tika Parser nodes. These nodes can also parse PDF files (and many other formats). Could you try it with the Tika nodes and check if the order is still incorrect.

Cheers, Kilian
 

Hi Kilian,

thank you for replying to my post.

I checked the tika node, and got similar error in the order of lines.

Please find here under some additional information:

The help information on the PDF parser node indicate the web site http://pdfbox.apache.org/ for details.

I saw the following information concerning the issue I am facing;

"Why does the extracted text appear in the wrong sequence?

By default, text extraction is done in the same sequence as the text in the PDF page content stream. PDF is a graphic format, not a text format, and unlike HTML, it has no requirements that text one on page be rendered in a certain order. The order is the one that was determined by the software that created the PDF. To get text sorted from left to right and top to botton, use setSortByPosition(true)."

 

Is there any possibility into the knime PDF parser node to set

setSortByPosition(true)

 

Thank you in advance for your help.

Best regards.

Thierry

1 Like