Parsing two-column PDF files

Many PDF articles are formatted in two columns, which neither PDF Parser nor Tika Parser interpret properly in the sense that a sentence is considered to span across columns. This invalidates any subsequent sentence-based analysis. Any suggested workaround?

Hi @mpenalver -

Unfortunately there is a not currently a simple solution for parsing data formatted in this way. See the recent discussion in the thread below, which made use of the tabulizer() package in R:


Thank you, @ScottF. Not good news here, as this is the way most scientific papers are formatted.

I think we have a ticket in the system to improve this behavior - I’ll double check. If not, I’ll create one. Sorry for the trouble.

EDIT: Existing ticket is AP-14318


I’m conscious of the difficulty of handling PDF files formatted in different ways properly, but it is good to read that KNIME is looking for possible improvements.

Thanks again.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.