Image PDF to text

#1

Could you please demonstrate an example of image pdf to text

0 Likes

Transfer images from Tika Parser to Tess4J
Issue with Tess4j OCR community plugin in KNIME 3.1 for ita language.
#2

Please, be more specific. Do you need to recognize text from PDF? If yes, then use


or

1 Like

#3

Thank you for quick reply. Can I use this workflow for Optical Character Recognition (OCR) taking images in PDF file containing texts?

0 Likes

#4

For OCR look here
https://www.knime.com/book/knime-image-processing-tesseract-ocr-extension

1 Like

#5

Thank you for quick responce

1 Like

#6

Hi Colleagues, I am using Tess4J to treat OCR. According to your instruction I need to use only png or svg files for that. Transform pdf to png I can do using Tika Parser. Unfortunately it represents me tif inline images files instead png ones. please see fragment of scan


Sometimes it gives me png for other pdf.
Tika Parser doesn’t contain any explanation in Help regarding option – Extract inline images from PDFs.
Please let me know what should I do? Thank you in advance

0 Likes

#7
1 Like

#8

Dear colleagues, please help with Tess4J component. Instead text I received such set of symbols. Please see below. I used png files and saved your settings from your example OCR_meets_SemanticWeb
image

0 Likes

#9

I use Tika parser to extract images (embedded jpeg) from a PDF file. Each PDF file contains a series of invoice scans. During the extraction, Tika parser seems to take only the first page of each invoice.
image
The attachements table show the embedded files. Strangely, each image is the first page of each invoice suggesting that Tika recognizes the structure of the invoices (break on each invoice header). Is it possible to get all the scanned images?


Image Reader allows you to read the saved images (only the first image of each invoice). The image is then returned to be sent to Normalizer.

The image is then returned to be sent to Tess4J.

Only the Eng - Deskew - Autopage Seg and OSD seems OK.
image

  • |Eng - Deskew - Full Autopage Seg - Default - Append|OK|
  • |Eng - Deskew - Autopage Seg and OSD - Default - Append |Not OK|
  • |Fr - Deskew - Full Autopage Seg - Default - Append |Not OK|
  • |Eng - Deskew - Full Autopage Seg - Cube only - Append |Running but no correct OCR|
  • |Eng - Deskew - Full Autopage Seg - Default - New Table |OK|
    It’s not possible to use French, OSD or Cube because KNIME quit.
    Is it possible to use other language and OSD and Cube for better recognition?
0 Likes