Parsing PDF with irregular table

mmays · January 13, 2023, 10:24pm

I would like to parse a pdf file that contains data in three different columns. When I use the cell splitter node and split on each line it breaks data that would be “wrapped” in excel into different rows.

For example:
The pdf does not create a perfect table so I cannot split on each new line as seen below

How do I split it so I can get:

I think ideally I would like to split by the section number.

I appricate any help.

Current methodology:

mlauber71 · January 14, 2023, 10:03am

@mmays maybe you can take a look at this entRy and links.

Also could you upload an example to explore? Then I had some Pxthon code trying to extract tables lately but was not satisfied with the result. Might take another look. There are several python packages that can extract tables from pdf.

mmays · January 18, 2023, 8:10pm

I took a look at the links but I still cannot seem to get it to work. In the attached workflow I am close with the Regex split but I still run into the issue of the wrapped text.

AOOFUS.knwf (109.6 KB)

mmays · January 26, 2023, 6:36pm

Any updates on this topic? Similarly, I am very close in python but not quite the results I want.

mlauber71 · January 26, 2023, 7:00pm

@mmays not quite. Also the example does not contain any PDF file

system · April 26, 2023, 7:13pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.