Issue with PDF Parser Node - Combining Two Columns of Text into One line

Hello KNIME Community,

I’m encountering an issue with the PDF Parser node in KNIME. When I use the node to extract text from a PDF document that has two columns of text, it combines both columns into a single line in the output. This is causing the extracted text to lose its original formatting and structure.

I’ve attached two screenshots to illustrate the problem:

  • This shows the result from the PDF Parser node, where the two columns are combined into one row.

  • This displays the original PDF with the two distinct columns.

Additionally, I’ve attached the PDF article for reference.
ijmsv11p1185.pdf (1.1 MB)

Is there a way to configure the PDF Parser node to maintain the original column structure and extract the text as it appears in the source PDF? Or is there a workaround or setting I might be missing?

Any help or guidance on how to address this issue would be greatly appreciated.

Thank you in advance!

Is there a way to configure the PDF Parser node to maintain the original column structure and extract the text as it appears in the source PDF?

Well, from the node description:

The full text of the PDF is extracted, the structure of the PDF is not taken into account.

The Tika Parser is no different, so there isn’t a good way to do this with built in nodes. I’ve seen R tabulizer package mentioned as an alternative, but I have no experience with it.

Other people have had the same question, and there’s an existing ticket for this (AP-14318) but as of March there was no movement on it.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.