Parsing Sections of PDF File Separately

eroma934 · February 25, 2014, 10:35pm

All,

I was wondering if it is possible to parse sections of a PDF and not the whole document at once. I have a set of files that include sections for name, date, state, and then a narrative section. Is it possible to parse each section individually?

Thanks in advance for any suggestions!

kilian.thiel · February 27, 2014, 11:49am

Hi Eroma,

the PDF Parser node does not recognize sections. To extract only certain section(s) of a pdf you could cut this section out of the pdf manually (http://www.pdfsam.org/), which is of course unhandy if you have a lot of pdf files, or try to recognize the section in the text (string) of the document. Therefore parse the complete pdf, extract the text, then identify and extract the section out of the string. Attached you find an example workflow how this could work.

Cheers, Kilian

sectiondetection.zip

eroma934 · March 1, 2014, 9:54pm

Thanks you! I will look into this.

madegomez · April 8, 2018, 7:41pm

Hello!
I am interested in extracting only certain parts of a pdf document … I followed the instructions given but they do not satisfy my particular problem. I would like to analyze the attached document and extract only the sections of texts that I highlighted in yellow.

I try to use the nodes: Document Data Extractor and String Replacer, but I think my problem is in configuring the String Replacer node. I would like to know if you can help me configure this node to extract the information that I highlight in the attached document.

Thanks in advance,

Manuela
Art.zip (755.9 KB)

kilian.thiel · April 10, 2018, 6:55am

Hi,

extracting specific fields or sections from a PDF is not really possible or only with workarounds e.g. identifying words before and after this section and extracting it based on these marker words. Anyhow, in your case I suggest to convert the PDF into a HTML file and try to extract the section you want from that. This is probably easier than trying to fiddle it out from the PDF. To convert PDF to HTML there are free console tools available. To read the HTML (or XML) I recommend the XML Reader node and XPath for field extraction.

Cheers, Kilian

madegomez · April 19, 2018, 12:44pm

Thanks for the clarification and your proposal!

Manuela

system · June 2, 2023, 9:45pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.