How to extract from and divide a pdf with text and tables into rows and then search for a row by a word it contains.

Ami1 · October 2, 2023, 8:36am

Hello Community! I’m new to Knime, using version 5.1.1 on a mac. I have a PDF with text (and some tables) and am trying to extract all the content as text, divide it into rows, and search for a particular row in the resulting table. I don’t know the row’s name or number but I know some of the words it contains. Can you please help?
Thank you!”

gonhaddock · October 2, 2023, 10:55am

Hello @Ami1 and welcome to the KNIME community

Your request is very generic. Please check the following workflow, it can be a starting point for you:

Aiming to get the right support from the community; It would be very useful if you can provide some more details:

How your input looks like, even a sample if possible
Expected output description
What you have already tested and results

BR

Ami1 · October 2, 2023, 1:37pm

@gonhaddock Thanks very much. I’ll try that. Am attaching a sample input file and an image. I’d like to be able to get the highlighted figures and words (in the image) from the PDF. They are on the table on page 47. In other pdf’s I’ll be using this for, I’ll know the words I’m looking for but not where they are in the pdf, what row, or the page on which the table is.
I tried starting with the Tika parser / pdf parser followed by various row/column filtering nodes, and then nodes suggested by the workflow coach, but didn’t get the right result. I’m quite new to Knime and really appreciate your help and any help from the forum. Thanks!

BR,
Ami

Interim-Report-as-of-June-30-2023.pdf (2.0 MB)

gonhaddock · October 3, 2023, 7:12am

Hello @Ami1
Thanks for the provided example; I can provide you another example workflow, about how you extract table data from a PDF file.

Unfortunately I won’t have to much free KNIME time these days, aiming to work on customized forum solutions. I am sure that other forum knimer fellows can help you to progress forward if needed.

Happy KNIM(E)ing

Daniel_Weikert · October 3, 2023, 3:34pm

@Ami1
Just one question. Many companies provide their financial statements as excel download as well / or you use financial websites for that. Have you tried that already?
br

Ami1 · October 3, 2023, 4:57pm

Hi @gonhaddock, Thanks, will try this out.
Much appreciated!

Ami1 · October 3, 2023, 5:01pm

Hi @Daniel_Weikert,
Thanks for the question. In this instance I’m trying to provide a service to an entity that only has this type of PDF document available. Looking for a solution in order to be able to perform this kind of action on this kind of document on an ongoing basis.
Best,
Ami

mlauber71 · October 4, 2023, 5:42am

@Ami1 the solutions to the KNIME challenge 015 might Indeed be a starting point. Then there is an approach using R package:

You can search for a word in a PDF file and extract the location:

To extract tables there is an option to use tabulizer

Or you can try and use the python package camelot

Ami1 · October 5, 2023, 1:22pm

@mlauber71 thank you so much for all this great information! Will try it!
BR,
Ami

system · January 3, 2024, 1:23pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.