Extract information from a multiple pdfs PDF file

Hello, I’m looking for assistance in extracting information from PDF files. I’ve over 100 pdf files. The aim is to extract the necessary information and then group it according to the headers. The information in the pdf is not in a tabular format. In the past, I attempted to extract data from tables using tika parser and pdf parser. Because the data in my current PDFs is in a different structure, it is not helping me in understanding how to do it. Each data from the pdf relates to a single point of information and can range from one page to three pages (i.e. 100 PDFs correspond to 100 locations information, and the number of pages in each pdf varies). As i cant upload pdf i am attaching the data in word format. The data in attached is a rough output i am looking for. Please help
Output_Test_SB.xlsx (9.4 KB)

Test_SB.docx (84.0 KB)

Hi @Sbhandary -

Data extraction from PDFs can definitely be a tricky task. Let me point you to some other threads where this type of analysis is address in more detail.

The Data Connect recording that Victor links in the second post might be of particular interest.

HI Scott, Thanks for directing me to the specific links to the workflow. I will look into it and get back in case of any queries.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.