Hello, I’m looking for assistance in extracting information from PDF files. I’ve over 100 pdf files. The aim is to extract the necessary information and then group it according to the headers. The information in the pdf is not in a tabular format. In the past, I attempted to extract data from tables using tika parser and pdf parser. Because the data in my current PDFs is in a different structure, it is not helping me in understanding how to do it. Each data from the pdf relates to a single point of information and can range from one page to three pages (i.e. 100 PDFs correspond to 100 locations information, and the number of pages in each pdf varies). As i cant upload pdf i am attaching the data in word format. The data in attached is a rough output i am looking for. Please help Output_Test_SB.xlsx (9.4 KB) Test_SB.docx (84.0 KB)

Extract information from a multiple pdfs PDF file

ScottF February 15, 2023, 9:18pm 2

Data extraction from PDFs can definitely be a tricky task. Let me point you to some other threads where this type of analysis is address in more detail.

The Data Connect recording that Victor links in the second post might be of particular interest.

Data Extraction From PDFs Containing OCR image.