PDF and Tika Parser

Nazareno · May 20, 2020, 11:39am

Hi,
Scientific pdf files are usually composed of 2 or 3 columns in one page. When I use the PDF or Tika parser and I check the text content output afterwards, these columns are combined (attached picture).
Do you know whether would be possible to avoid such a thing and read the columns separately?
Thank you in advance!
Cheers,
Nazareno.

julian.bunzel · May 22, 2020, 12:57pm

Hey @Nazareno,

unfortunately I don’t see any (easy) possibility to work around this issue right now.
I will create a ticket in order to fix this.

Thank you for reporting.
Cheers,
Julian

Nazareno · May 25, 2020, 8:00am

Hi Julian,
Thank you for the answer. I hope you can solve it soon.
Best,
Nazareno.

cscheeser · March 23, 2023, 7:57pm

hi all - any update on this? maybe a number of characters by row? or some kind of find and replace?

ScottF · March 24, 2023, 4:02pm

Hi @cscheeser -

I checked the ticket (AP-14318) and there’s nothing new to report unfortunately. But I will add a +1 from you on it.

cscheeser · March 24, 2023, 6:29pm

Hi Scott thx for the reply. I can envision some kind of grid system (maybe in 0.1" increments) where text plucked from the section of x-y space would be imported as contiguous. Other text ignored.

Related if you have any clever regex or similar hacks, would love to learn more and see what might possible with existing nodes…

ScottF · March 24, 2023, 7:38pm

Our expert on text extraction from PDFs is @victor_palacios - I tag him here in case he has some clever tricks to share in this case.

victor_palacios · March 24, 2023, 10:16pm

Hello @cscheeser ,

This is an image segmentation problem and not strictly a PDF-parsing problem. You can see here my response to something very similar:

Being able to do this kind of segmentation is actually state of the art, so it would require the advanced techniques shown in the link above.

cscheeser · March 28, 2023, 2:56pm

Thank you Victor - i’ve seen another video of yours, good stuff!

I’ve had good luck with opportunistic parsing downstream of the Tika Parser. One area I can’t resolve - wrapping text in columns that falls to the next row, see pic:

upon parsing, i see the below. row 1 and row 2 both fine. row 3 though is too anonymous. Any ideas welcome.

DC WHSE NO. DATE NO. NO. NO. DESCRIPTION DESCRIPTION QTY PRICE AMOUNT
01/10/2023 0 0 FOF Oral Care FOF Oral Care 1.00 16,154.00000 16,154.00
Incremental Incremental SKU funding

victor_palacios · March 28, 2023, 3:49pm

Hello @cscheeser , could you give me a little more detail? I’m not sure what you mean by “anonymous”.

cscheeser · March 28, 2023, 5:07pm

Hi - per the pic, text wrapping occurs in Column 8 and Column 9. However the post-parsing data suggests row 2 of text wrapping occurs in Column 1. No spaces, no tabs, no delimiters suggest the wrong column.

How to get the column 8’s “Incremental” to line up with column 8? “Incremental Sku Funding” in column 9?

cscheeser · March 28, 2023, 8:27pm

another pic of the data post parser

again, row1 and 2 and manageable, lots of delimiters & patterns baked in. but row three isn’t able to attributed to column 8 or column 9. relationship seems lost…

victor_palacios · April 2, 2023, 5:42pm

@cscheeser

Now I understand: when text appears within in a box on 2 different lines like “Incremental” it actually gets assigned a new row and that new row’s column is incorrect. This seems to be an issue with parsing a table which is notoriously difficult. See the many discussions I’ve had with people about reading from tables:

TLDR; Not even state of the art models can read from tables with 100% accuracy without specific training, so this is one area where manual effort, clever strategies, or advanced models need to be used.

cscheeser · April 3, 2023, 4:42pm

Thx Victor - appreciat this reply. One idea: Adobe pro has a ‘batch’ exporter. If you batch export to excel, formatting seems retained. i’m playing w/ this now (of course you have to have a license). But maybe the solution is shared between adobe pro and knime?

system · July 2, 2023, 4:43pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.