Parsing PDF with irregular table

I would like to parse a pdf file that contains data in three different columns. When I use the cell splitter node and split on each line it breaks data that would be “wrapped” in excel into different rows.

For example:
The pdf does not create a perfect table so I cannot split on each new line as seen below
image

How do I split it so I can get:
image

I think ideally I would like to split by the section number.

I appricate any help.

Current methodology:

@mmays maybe you can take a look at this entRy and links.

Also could you upload an example to explore? Then I had some Pxthon code trying to extract tables lately but was not satisfied with the result. Might take another look. There are several python packages that can extract tables from pdf.

2 Likes

I took a look at the links but I still cannot seem to get it to work. In the attached workflow I am close with the Regex split but I still run into the issue of the wrapped text.

AOOFUS.knwf (109.6 KB)

Any updates on this topic? Similarly, I am very close in python but not quite the results I want.

@mmays not quite. Also the example does not contain any PDF file

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.