I was wondering if it is possible to parse sections of a PDF and not the whole document at once. I have a set of files that include sections for name, date, state, and then a narrative section. Is it possible to parse each section individually?
the PDF Parser node does not recognize sections. To extract only certain section(s) of a pdf you could cut this section out of the pdf manually (http://www.pdfsam.org/), which is of course unhandy if you have a lot of pdf files, or try to recognize the section in the text (string) of the document. Therefore parse the complete pdf, extract the text, then identify and extract the section out of the string. Attached you find an example workflow how this could work.
Hello!
I am interested in extracting only certain parts of a pdf document … I followed the instructions given but they do not satisfy my particular problem. I would like to analyze the attached document and extract only the sections of texts that I highlighted in yellow.
I try to use the nodes: Document Data Extractor and String Replacer, but I think my problem is in configuring the String Replacer node. I would like to know if you can help me configure this node to extract the information that I highlight in the attached document.
extracting specific fields or sections from a PDF is not really possible or only with workarounds e.g. identifying words before and after this section and extracting it based on these marker words. Anyhow, in your case I suggest to convert the PDF into a HTML file and try to extract the section you want from that. This is probably easier than trying to fiddle it out from the PDF. To convert PDF to HTML there are free console tools available. To read the HTML (or XML) I recommend the XML Reader node and XPath for field extraction.