Image PDF to text

Could you please demonstrate an example of image pdf to text

Please, be more specific. Do you need to recognize text from PDF? If yes, then use

1 Like

Thank you for quick reply. Can I use this workflow for Optical Character Recognition (OCR) taking images in PDF file containing texts?

For OCR look here

1 Like

Thank you for quick responce

1 Like

Hi Colleagues, I am using Tess4J to treat OCR. According to your instruction I need to use only png or svg files for that. Transform pdf to png I can do using Tika Parser. Unfortunately it represents me tif inline images files instead png ones. please see fragment of scan

Sometimes it gives me png for other pdf.
Tika Parser doesn’t contain any explanation in Help regarding option – Extract inline images from PDFs.
Please let me know what should I do? Thank you in advance

1 Like

Dear colleagues, please help with Tess4J component. Instead text I received such set of symbols. Please see below. I used png files and saved your settings from your example OCR_meets_SemanticWeb

I’m trying to replicate your flow but is impossible to know what is it in the ReGex flow.

I’m sorry I need to remove the original message because privacy of pictures.

The Tika parser doesn’t parse PDF files with embedded Tif format pictures.
Instead, today I’m using a node (external tool) to execute a open source and free converter (from poppler distribution) to convert each page of PDF files to a PNG picture (External Tool (Labs)). After, all the pictrures are converted to text with TESS4J node (OCR).

The Regex MetaNode goal is only to extract some data from the generated text files (by the Tess4J) and is not related to the original question.

Best regards.