Image PDF to text

Vladimir_Savin · September 10, 2019, 1:16pm

Could you please demonstrate an example of image pdf to text

izaychik63 · September 10, 2019, 3:08pm

Please, be more specific. Do you need to recognize text from PDF? If yes, then use

or

Vladimir_Savin · September 11, 2019, 7:29am

Thank you for quick reply. Can I use this workflow for Optical Character Recognition (OCR) taking images in PDF file containing texts?

izaychik63 · September 11, 2019, 11:34am

For OCR look here
https://www.knime.com/book/knime-image-processing-tesseract-ocr-extension

Vladimir_Savin · September 11, 2019, 11:49am

Thank you for quick responce

Vladimir_Savin · September 12, 2019, 1:00pm

Hi Colleagues, I am using Tess4J to treat OCR. According to your instruction I need to use only png or svg files for that. Transform pdf to png I can do using Tika Parser. Unfortunately it represents me tif inline images files instead png ones. please see fragment of scan

Sometimes it gives me png for other pdf.
Tika Parser doesn’t contain any explanation in Help regarding option – Extract inline images from PDFs.
Please let me know what should I do? Thank you in advance

Vladimir_Savin · September 12, 2019, 1:28pm

Vladimir_Savin · September 13, 2019, 7:28am

Dear colleagues, please help with Tess4J component. Instead text I received such set of symbols. Please see below. I used png files and saved your settings from your example OCR_meets_SemanticWeb

KP11 · September 30, 2021, 7:02am

Hi,
I’m trying to replicate your flow but is impossible to know what is it in the ReGex flow.

PBJ · September 30, 2021, 12:35pm

I’m sorry I need to remove the original message because privacy of pictures.

The Tika parser doesn’t parse PDF files with embedded Tif format pictures.
Instead, today I’m using a node (external tool) to execute a open source and free converter (from poppler distribution) to convert each page of PDF files to a PNG picture (External Tool (Labs)). After, all the pictrures are converted to text with TESS4J node (OCR).

The Regex MetaNode goal is only to extract some data from the generated text files (by the Tess4J) and is not related to the original question.

Best regards.

system · June 2, 2023, 9:11pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.