Issue with PDF Parser Node - Combining Two Columns of Text into One line

elsamuel · September 2, 2023, 2:39pm

Is there a way to configure the PDF Parser node to maintain the original column structure and extract the text as it appears in the source PDF?

Well, from the node description:

The full text of the PDF is extracted, the structure of the PDF is not taken into account.

The Tika Parser is no different, so there isn’t a good way to do this with built in nodes. I’ve seen R tabulizer package mentioned as an alternative, but I have no experience with it.

Other people have had the same question, and there’s an existing ticket for this (AP-14318) but as of March there was no movement on it.