Export data from specific point of a PDF

Hello Everyone and sorry for the newbie question.

I am trying to export some data from a pdf document to a table. The issue is that the pdf I am using as a source has a pre-determined print as an image on it and the values filling the blanks are written as figures( I mean you can copy those ). Another problem I am having is that using Tika or PDF parser I get only the figures printed in the blanks and not the cell names or codes.

I also tried to convert the pdf to an image and use tesseract to ocr the document data but it was unfortunate to.

What I am looking to do is if there is a figure in the pdf cell with number 301 to extract it to a relevant table,if the cell 301 is blank and the cell 302 has a figure I would like to extract a zero or blank value in the table for the cell 301 and the figure in the cell 302.

Do you think this is possible?

Below you may find the form I am looking to parse but without any figures.
Thank you in advance

Hi @thanos_agr,
The cell names are probably not stored as text but maybe as images within the PDF and this is why it does not work. Would you be open to use a service like Azure Form Recognizer to do the work? I’ve had pretty good results with that. I have uploaded a component for using the service to KNIME Community Hub here. If this is something you could imagine doing yourself later, let me know and I can try it out for you real quick, as I have the Form Recognizer set up already in our Azure account.
Kind regards,
Alexander

1 Like

Hello Alexander and thank you for your reply. I will try to fetch a document excluding personal information for a check but the reason I am asking is because I am looking to make a respository in which I will drop my files in a directory and I would finally export in a structured table. I am not quite sure if this is possible. Maybe if I transformed the pdf into an image and after this use OCR to export the data? Do you think that this is possible?

Hi,
Azure Form Recognizer can deal with PDFs as well as image files. You can drop them all in one folder, then use the List Files/Folders together with a loop, such as the Table Row To Variable Loop Start to go through the documents one by one, send them to Azure Form Recognizer, parse the result, and then collect the data from all files in one table.
Kind regards,
Alexander

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.