Why does the extracted text from a PDF appear in the wrong sequence?

tconvard · January 10, 2017, 6:30pm

The help information on the PDF parser node indicate the web site http://pdfbox.apache.org/ for details.

I saw the following information concerning the issue I am facing;

"Why does the extracted text appear in the wrong sequence?

By default, text extraction is done in the same sequence as the text in the PDF page content stream. PDF is a graphic format, not a text format, and unlike HTML, it has no requirements that text one on page be rendered in a certain order. The order is the one that was determined by the software that created the PDF. To get text sorted from left to right and top to botton, use setSortByPosition(true)."

Is there any possibility into the knime PDF parser node to set

setSortByPosition(true)

Thank you in advance for your help.

Best regards.

kilian.thiel · January 17, 2017, 5:55pm

Hi,

thank you for the hint. We will check if this option helps and fix the node. Have you tried the new TIKA Parser nodes? They can also parse PDF files.

Cheers, Kilian

tconvard · January 20, 2017, 3:15pm

Hi Kilian,

yes I tried the TIKA parser and got the same problem.

Maybe it's a question of version of the java class used, I did a test by downloading the jar file from the web site http://pdfbox.apache.org/ , and using command line under DOS window, I was able to parse the PDF file with a correct result.

The jar file I downloaded is pdfbox-app-2.0.4.jar, it seems that the version used in knime was 2.0.0, maybe this information could help you.

Thank you in advance for your help.

Regards.

Thierry

izaychik63 · March 3, 2017, 8:20pm

Hello, where.

I have a task to extract pdf file names and as a second field to have a document effective date.

Documents are in a folder.

The date is a part of the text like below:

Policy Effective Date: May 18, 2015

I started fromm PDF parser. It generated documents names from document.

I'd like to have a file name.

Please advice how to get the date from PDF text.

Thank you, Igor