Reading from a Text and Scanned pdf and then perform a search on them

Rajib_Bhattacha · June 8, 2025, 6:09am

Hi,
Just introduced to the KNIME world and have been mesmerized about it.
Now, I am planning to create an Analytics with couple of PDF files, some of them are scanned pdf, I need to perform the following.

Search availability of a text string in those PDF files.
Filter the pdfs which contain that string.
Highlight the related section or text , which contains the string.
Could you kindly help me building a model to perform this?
Thanks,
Rajib

rfeigel · June 9, 2025, 3:37am

If your pdfs aren’t proprietary it would be very helpful if you could share some examples. Some technical issues. If your pdfs are scanned they’ll need to be OCRed to create readable text. You can OCR them before feeding to KNIME or use the Tess4J node. You’ll need to install the Image Processing extension to access it. Depending on the format of your pdfs OCRing can be pretty unreliable. If they’re straight text it should work pretty well. If you have tables and images it can be a mess. Are you using the same search string for all the pdfs? If not, how do propose to match different search strings with the appropriate pdfs?

mlauber71 · June 10, 2025, 11:30am

@Rajib_Bhattacha there are some examples on the forum about using OCR to extract information from a PDF.

I have a more general article about the extraction of data from PDF with several examples.

As @rfeigel has said: it would be best if you could provide an example any explain what information you would like to extract.