Hi,
Scientific pdf files are usually composed of 2 or 3 columns in one page. When I use the PDF or Tika parser and I check the text content output afterwards, these columns are combined (attached picture).
Do you know whether would be possible to avoid such a thing and read the columns separately?
Thank you in advance!
Cheers,
Nazareno.
Hey @Nazareno,
unfortunately I donât see any (easy) possibility to work around this issue right now.
I will create a ticket in order to fix this.
Thank you for reporting.
Cheers,
Julian
Hi Julian,
Thank you for the answer. I hope you can solve it soon.
Best,
Nazareno.
hi all - any update on this? maybe a number of characters by row? or some kind of find and replace?
Hi @cscheeser -
I checked the ticket (AP-14318) and thereâs nothing new to report unfortunately. But I will add a +1 from you on it.
Hi Scott thx for the reply. I can envision some kind of grid system (maybe in 0.1" increments) where text plucked from the section of x-y space would be imported as contiguous. Other text ignored.
Related if you have any clever regex or similar hacks, would love to learn more and see what might possible with existing nodesâŚ
Our expert on text extraction from PDFs is @victor_palacios - I tag him here in case he has some clever tricks to share in this case.
Hello @cscheeser ,
This is an image segmentation problem and not strictly a PDF-parsing problem. You can see here my response to something very similar:
Being able to do this kind of segmentation is actually state of the art, so it would require the advanced techniques shown in the link above.
Thank you Victor - iâve seen another video of yours, good stuff!
Iâve had good luck with opportunistic parsing downstream of the Tika Parser. One area I canât resolve - wrapping text in columns that falls to the next row, see pic:
upon parsing, i see the below. row 1 and row 2 both fine. row 3 though is too anonymous. Any ideas welcome.
DC WHSE NO. DATE NO. NO. NO. DESCRIPTION DESCRIPTION QTY PRICE AMOUNT
01/10/2023 0 0 FOF Oral Care FOF Oral Care 1.00 16,154.00000 16,154.00
Incremental Incremental SKU funding
Hello @cscheeser , could you give me a little more detail? Iâm not sure what you mean by âanonymousâ.
Hi - per the pic, text wrapping occurs in Column 8 and Column 9. However the post-parsing data suggests row 2 of text wrapping occurs in Column 1. No spaces, no tabs, no delimiters suggest the wrong column.
How to get the column 8âs âIncrementalâ to line up with column 8? âIncremental Sku Fundingâ in column 9?
another pic of the data post parser
again, row1 and 2 and manageable, lots of delimiters & patterns baked in. but row three isnât able to attributed to column 8 or column 9. relationship seems lostâŚ
Now I understand: when text appears within in a box on 2 different lines like âIncrementalâ it actually gets assigned a new row and that new rowâs column is incorrect. This seems to be an issue with parsing a table which is notoriously difficult. See the many discussions Iâve had with people about reading from tables:
TLDR; Not even state of the art models can read from tables with 100% accuracy without specific training, so this is one area where manual effort, clever strategies, or advanced models need to be used.
Thx Victor - appreciat this reply. One idea: Adobe pro has a âbatchâ exporter. If you batch export to excel, formatting seems retained. iâm playing w/ this now (of course you have to have a license). But maybe the solution is shared between adobe pro and knime?
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.