Parsing two-column PDF files

Many PDF articles are formatted in two columns, which neither PDF Parser nor Tika Parser interpret properly in the sense that a sentence is considered to span across columns. This invalidates any subsequent sentence-based analysis. Any suggested workaround?

Hi @mpenalver -

Unfortunately there is a not currently a simple solution for parsing data formatted in this way. See the recent discussion in the thread below, which made use of the tabulizer() package in R:


Thank you, @ScottF. Not good news here, as this is the way most scientific papers are formatted.

I think we have a ticket in the system to improve this behavior - I’ll double check. If not, I’ll create one. Sorry for the trouble.

EDIT: Existing ticket is AP-14318


I’m conscious of the difficulty of handling PDF files formatted in different ways properly, but it is good to read that KNIME is looking for possible improvements.

Thanks again.

