Parsing PDF problem - Similar files with different results

lelloba · May 30, 2023, 3:00pm

Hello,

I have several PDFs to be parsed in KNIME. In this case, I give you two: one with 2022 data and one with 2023 data.
The two files have the same structure, but while 2022 is read by the Tika parser node with no issue, 2023 seems to be unreadible. Why is that happening and how can I make that work again?

Here is a small workflow to reproduce the error:

Thank you,
Raffaello Barri
LinkedIn

armingrudd · May 31, 2023, 10:52am

Hi @lelloba,

Could you share the PDF related to 2023 data? Can you check if the content of the PDF (2023) is in text format not image format?

lelloba · May 31, 2023, 11:00am

Hello @armingrudd ,

you are right, 2023 data is not in text format and Tika can’t read it. Download zipped PDF folder here
Don’t know what the website has changed to make this happen.

What do you recommend? Using a PDF parser node (not much successful, according to my attempt)? Or going for an external tool (Riconoscimento del testo tramite OCR - facilmente, online, gratis - PDF24 Tools), make the pdf text-formatted and use the workflow as it is?

Thank you,

Raffaello

armingrudd · May 31, 2023, 11:25am

I’d recommend using OCR Space APIs in KNIME.

lelloba · May 31, 2023, 12:28pm

I didn’t know about this, thank you for sharing!

Have a nice day,
Raffaello

system · June 7, 2023, 12:29pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.