Tika PDF Parser does not extract all images?

jansiewert · September 4, 2020, 6:23pm

Hi everyone,

I am using Tika PDF Parser to extract images from scanned PDFs. I have encountered the following problem:

I use two test files. File 1 is a scan directly from my printer. It does not contain any text, only images. File 2 is identical; I just added a “confidential” stamp to file 2 to test if that interferes with the process.
Tika PDF parser extracts the images as expected from file 2 (with the stamp), giving me PNGs that I can work with. However, it only extracts TIF files that apparently contain no data and have a size of 0 bytes from file 1. I use the same node to do this, so the settings are exactly the same. I do not have problems extracting the images from either PDF file outside KNIME, e.g., with ubuntu’s document viewer.

Has this occurred before? Or do you know if Tika Parser (or its KNIME implementation) has any limitations regarding the layout of the PDF? Does it not work with PDFs that exclusively contain images, or anything like that?

Thank you for your help,
Jan

izaychik63 · September 4, 2020, 6:31pm

From my experience Tika/PDF Parsers are pdf version sensitive. Say, v 1.4 is not recognized, when v 1.6 works fine.

Andrew_Steel · September 5, 2020, 7:16am

Hi @jansiewert,

I have got the same problem and tested some solution with and without tika-parser. The best solution I have found to extract images from hundreds of pdf-files from different sources, is the usage of pdfimages (https://anaconda.org/conda-forge/poppler/files).
Bildschirmfoto von 2020-09-05 09-10-56

I am using linux and the script is a bash script but I think it is easy to transfer into a Windows batch-script.

I hope it works for you.

Best regards
Andrew
pdf2images.knwf (20.7 KB)

jansiewert · September 5, 2020, 10:33am

Hi Andrew,

Thank you for your solution! I think I can work with that.
In your experience, what kind of documents worked well with KNIME’s Tika Parser, and which documents caused problems? You seem to tried out a lot of different approaches

Best,
Jan

Andrew_Steel · September 6, 2020, 3:16pm

Hi Jan,
I worked with tika-parser, image magic convert and pdfimages with different options. Inside my pdf-files I found a mixture of jpeg with rgb color and ccitt with gray color.
The tika-parser extracted jpeg to jpg and ccitt to empty tif files.
Image magic convert created own images and did not extract embedded images.
pdfimages with -j for jpeg and -png as fallback for all other formats was the best solution for me.

Best Regards
Andrew

jansiewert · September 7, 2020, 7:48am

Thank you Andrew! Much appreciated.
Best,
Jan

system · June 2, 2023, 9:41pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.