OCR with Tika Parser, Image Reader Table and Tess4J

Hi Dear,
I am wondering how can I add the Norwegian language to Tess4J?
How can I change the encoding in XML that comes from the Image Reader Table?
Thanks
Best
Reza

Hi @RezaBaharmand!

I am wondering how can I add the Norwegian language to Tess4J?
I’ve never worked with Tess4J myself, but you’d need to provide the Tessdata yourself - if there the norwegian language is included, it will show up under the language drop-down menu.

How can I change the encoding in XML that comes from the Image Reader Table?
Right now it’s only possible to read OME-XML metadata with the image readers. What would be your usecase? Maybe it’d be possible to extract the metadata with anothe node and work from there. Do you have an example image you could send?

Kind regards,
Lukas

2 Likes

Hi @LukasS ,
Thanks for your answer.
I’m not sure to understand the Tessdata Path exactly.
It’s sensitive information, Sorry, I can not share it but it would be helpful if you can give me names of the nodes you mentioned.
Best
Reza

1 Like

Hi @RezaBaharmand ,

sure, this would be the Tess4J node. You can download the Traineddata from the internet (for example this page provides trained data for the Norwegian language) to a folder, which you then provide to the node configuration. Once its in there, you should see the language “nor” pop up.

Hope that helps! Best Regards,
Lukas

3 Likes

@LukasS , Thanks a lot.
would you please tell me why I get an error of Invalid memory access when I choose: anything on OCR Engine Mode and Page Segmentation Mode?
Best
Reza

1 Like

Hi @RezaBaharmand ,

after playing around now I can :slight_smile: Two things:

  1. apparently, the Tess4J node expects the traineddata to live in a folder called “tessdata”
  2. since KNIME uses version 3.* of tesseract, the traineddata for 4.* will not work. You can find the traineddata for version 3.* here. (Thanks to Tess4j for chinese Execute failed: Invalid memory access - #10 by stelfrich)

I also uploaded a little example workflow:

Please excuse my norsk, I have no idea what is written there :wink: But the transcript seems pretty similar to the original text to me.

Cheers,
Lukas

3 Likes

Hi @LukasS ,
Many thanks for doing this.
Still, I got an error in some options I choose, same as before. it is strange.:face_with_monocle::face_with_raised_eyebrow:
It works but the result is far away from reality.:grin:I do not know why of course.
I can say that.
Thanks a lot for your kindly help.
I appreciate that.
I have a java program it is working so much better from these nodes, I would like to know how can I bring it to knime as a node and use it for the automatic process?
Best
Reza

2 Likes

Hey Reza (@RezaBaharmand),

you are very welcome! From my experience, OCR depends a lot on the input image contrast - with the Nodes from the Image Procession Extension (given you have the image reader, you already have them installed) you could try to do some preprocessing to increase contrast etc, which should improve your results.

But if you have already working code in Java, it makes much more sense to use that. The fastest way would be to use the Java Snippet node - writing you own node would be possible as well, it only seems a bit excessive. If you need assistance, I’m also happy to guide you, but I’d recommend to try the Java Snippet first.

Cheers,
Lukas

1 Like

Thanks a lot for your kind guidance.
I will try and inform you.
Many thanks.
Best
Reza

2 Likes

@RezaBaharmand, were you able to get your OCR Java implementation to work? I was wondering if it was flexible enough to allow other languages (like Japanese) and if you could share your workflow? Thank you!

3 Likes

Hi @victor_palacios , I did the OCR out of KNIME, as I told you before I used our java application that trained with Norwegian language, specifically medical notes. after that I used Knime to do ETL.
I can ask for permission if you are interested.
Best
Reza

2 Likes

Yes, I would be interested in your Java implementation. I’ve made a python implementation but it’s quite complex. Looking forward to it. thank you!

3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.