How to extract from and divide a pdf with text and tables into rows and then search for a row by a word it contains.

Hello Community! I’m new to Knime, using version 5.1.1 on a mac. I have a PDF with text (and some tables) and am trying to extract all the content as text, divide it into rows, and search for a particular row in the resulting table. I don’t know the row’s name or number but I know some of the words it contains. Can you please help?
Thank you!”

Hello @Ami1 and welcome to the KNIME community

Your request is very generic. Please check the following workflow, it can be a starting point for you:

Aiming to get the right support from the community; It would be very useful if you can provide some more details:

  • How your input looks like, even a sample if possible
  • Expected output description
  • What you have already tested and results

BR

1 Like

@gonhaddock Thanks very much. I’ll try that. Am attaching a sample input file and an image. I’d like to be able to get the highlighted figures and words (in the image) from the PDF. They are on the table on page 47. In other pdf’s I’ll be using this for, I’ll know the words I’m looking for but not where they are in the pdf, what row, or the page on which the table is.
I tried starting with the Tika parser / pdf parser followed by various row/column filtering nodes, and then nodes suggested by the workflow coach, but didn’t get the right result. I’m quite new to Knime and really appreciate your help and any help from the forum. Thanks!

BR,
Ami


Interim-Report-as-of-June-30-2023.pdf (2.0 MB)

Hello @Ami1
Thanks for the provided example; I can provide you another example workflow, about how you extract table data from a PDF file.

Unfortunately I won’t have to much free KNIME time these days, aiming to work on customized forum solutions. I am sure that other forum knimer fellows can help you to progress forward if needed.

Happy KNIM(E)ing

1 Like

@Ami1
Just one question. Many companies provide their financial statements as excel download as well / or you use financial websites for that. Have you tried that already?
br

Hi @gonhaddock, Thanks, will try this out.
Much appreciated!

Hi @Daniel_Weikert,
Thanks for the question. In this instance I’m trying to provide a service to an entity that only has this type of PDF document available. Looking for a solution in order to be able to perform this kind of action on this kind of document on an ongoing basis.
Best,
Ami

@Ami1 the solutions to the KNIME challenge 015 might Indeed be a starting point. Then there is an approach using R package:

You can search for a word in a PDF file and extract the location:

To extract tables there is an option to use tabulizer

Or you can try and use the python package camelot

3 Likes

@mlauber71 thank you so much for all this great information! Will try it!
BR,
Ami

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.