Tika parser output image format

Hello everyone.

I have a weird behaviour while using Tika Parser. I’m trying to extract tables as image from a pdf but when using Tika Parser, the 1st image is a .jpg and others are .tif.
Tif aren’t quite usefull for TESS4j after that, the quality is quite poor and the output are not on point.

Could you help me understand how to extract the best quality images from a pdf in Knime please ?

Thanks a lot

Have a great evening.

A

Hi @aurel44 and welcome to the forum.

Without having the actual PDF to look at it’s hard to say for certain, but it’s likely that the type of embedded file is just native to the images themselves, and is not something that the Tika Parser has any control over.

Do you have a sample (non-confidential!) file you could post, along with your workflow in progress? Then maybe someone could take a look and see if other approaches could help.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.