KNIME Read Image or PDF Scanned

Hello, is there way to scan a pdf scanned invoice or data? I know we have Image Reader then Tess4J but it limits what it gets and should be an image. But my scanned file data type is PDF so it ruled out the Image Reader node.

Is there a way wherein it will get the full text? Example is PDF file using Tika Parser? But I tried using Tika Parser but it is not showing.

@takbb

Hey,

not 100% clear on your requirement here.

Tika parser etc. will require that you pdf is “machine-generated” - e.g. a word file saved as PDF. If it is a pdf based on a scan of a document it will not work.

If it is a scanned pdf then Tess4J may be an option. Alternatively Vision LLMs have become incredibly good at text recognition.

In an act of shameless self promotion I can recommend my article on this on Medium:

(use case 3)

And my Extension that contains a vision model prompter node:

Or alternatively this example workflow on how to prompt vision models with only the “normal” KNIME nodes:

2 Likes

Apologies. @MartinDDDD. I’ve rephrased my post:

Hello, is there way to scan a pdf scanned invoice or data? I know we have Image Reader then Tess4J but it limits what it gets and should be an image. But my scanned file data type is PDF so it ruled out the Image Reader node.

Is there a way wherein it will get the full text? Example is PDF file using Tika Parser? But I tried using Tika Parser but it is not showing.

Hello @MartinDDDD, yes this is a scanned pdf. Tried to check your post, but not sure what exactly node should I use before Tess4j to get the pdf file?

I just tried the following:

Use TIKA Parser Node, check the boxes as shown in the screenshot and also select a directory:

This will save the images that are in the scanned PDF to the selected folder.

From there you can then read them with Image Reader and continue from there.

I tried with PDFs that I generated by taking pictures with one Onedrive app on my mobile phone and it was 1 image per 1 page in the file that gets saved.

4 Likes

Amazing! Let me try this :slight_smile:

May I know @MartinDDDD, the configuration inside Tess4j? But it is showing blank for the image ones.

I have not used Tess4J yet, but there is an example workflow you can download and inspect here:

3 Likes

Thank you, @MartinDDDD ! Really appreciate you replying to my questions :slight_smile: Cheers!

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.