Pdf in Knime

Hi everyone, I need some help with KNIME regarding PDF processing.

I’m trying to load a folder that contains several PDFs (a company’s Annual Reports, one for each year) along with an Excel file containing a list of keywords.
My goal is to have KNIME read each individual PDF, correctly associate it with its corresponding year, and search for the keywords within it.

The issues I’m facing are:

  1. I can’t manage to assign each PDF the correct title/year, so the output isn’t clean.

  2. The number of resulting rows is much larger than expected: I would expect 58 keywords × 14 reports = 812 rows, but I’m getting more than 11,000.
    What could be causing this?

Thanks in advance to anyone who can help!

If you can, you can also contact me via email: roberto.cirillo@unicampania.it

Hi @RobertoCirillo ,

Welcome to the KNIME forum.

Can you please let us know which Node you are using to read the PDF(s)?

1 Like

This won’t do exactly what you want, but it might give you some ideas. It searches for individual terms in multiple pdfs.

1 Like

Hi,
would be good to have a look at your workflow to help more efficiently

Regarding the point 2 I guess that the 11.000 entries are after a joining node? Do you have duplicates both of your tables? This usually leads to large increased row numbers.