PDF and Tika Parser

Hi,
Scientific pdf files are usually composed of 2 or 3 columns in one page. When I use the PDF or Tika parser and I check the text content output afterwards, these columns are combined (attached picture).
Do you know whether would be possible to avoid such a thing and read the columns separately?
Thank you in advance!
Cheers,
Nazareno.

Hey @Nazareno,

unfortunately I don’t see any (easy) possibility to work around this issue right now.
I will create a ticket in order to fix this.

Thank you for reporting.
Cheers,
Julian

3 Likes

Hi Julian,
Thank you for the answer. I hope you can solve it soon.
Best,
Nazareno.

hi all - any update on this? maybe a number of characters by row? or some kind of find and replace?

Hi @cscheeser -

I checked the ticket (AP-14318) and there’s nothing new to report unfortunately. But I will add a +1 from you on it.

Hi Scott thx for the reply. I can envision some kind of grid system (maybe in 0.1" increments) where text plucked from the section of x-y space would be imported as contiguous. Other text ignored.

Related if you have any clever regex or similar hacks, would love to learn more and see what might possible with existing nodes…

Our expert on text extraction from PDFs is @victor_palacios - I tag him here in case he has some clever tricks to share in this case.

Hello @cscheeser ,

This is an image segmentation problem and not strictly a PDF-parsing problem. You can see here my response to something very similar:

Being able to do this kind of segmentation is actually state of the art, so it would require the advanced techniques shown in the link above.

1 Like

Thank you Victor - i’ve seen another video of yours, good stuff!

I’ve had good luck with opportunistic parsing downstream of the Tika Parser. One area I can’t resolve - wrapping text in columns that falls to the next row, see pic:

upon parsing, i see the below. row 1 and row 2 both fine. row 3 though is too anonymous. Any ideas welcome.

DC WHSE NO. DATE NO. NO. NO. DESCRIPTION DESCRIPTION QTY PRICE AMOUNT
01/10/2023 0 0 FOF Oral Care FOF Oral Care 1.00 16,154.00000 16,154.00
Incremental Incremental SKU funding

Hello @cscheeser , could you give me a little more detail? I’m not sure what you mean by “anonymous”.

Hi - per the pic, text wrapping occurs in Column 8 and Column 9. However the post-parsing data suggests row 2 of text wrapping occurs in Column 1. No spaces, no tabs, no delimiters suggest the wrong column.

How to get the column 8’s “Incremental” to line up with column 8? “Incremental Sku Funding” in column 9?

another pic of the data post parser
image
again, row1 and 2 and manageable, lots of delimiters & patterns baked in. but row three isn’t able to attributed to column 8 or column 9. relationship seems lost…

@cscheeser

Now I understand: when text appears within in a box on 2 different lines like “Incremental” it actually gets assigned a new row and that new row’s column is incorrect. This seems to be an issue with parsing a table which is notoriously difficult. See the many discussions I’ve had with people about reading from tables:

TLDR; Not even state of the art models can read from tables with 100% accuracy without specific training, so this is one area where manual effort, clever strategies, or advanced models need to be used.

1 Like

Thx Victor - appreciat this reply. One idea: Adobe pro has a ‘batch’ exporter. If you batch export to excel, formatting seems retained. i’m playing w/ this now (of course you have to have a license). But maybe the solution is shared between adobe pro and knime?

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.