I have an example document. I’d like to retrieve the phone call and SMS information from the document and process it later with Spacy.
I have split the multipage (up to 100) PDF into several main blocks.
The information at the beginning of the document (header and heading data within red frame) is not relevant until the first Technical Number.
The most important blocks (within blue frame) run from Technical Number (in Hungarian “Technikai szám”) to the next Technical number. The Technical number is marked with blue oval.
Within the main blocks, there are one, or more sub-blocks. These are in green frames. At the beginning of the green frames there is the Phone time.
The bold text and the rest of the sub-block are no longer required, but I can’t find a way to remove the unnecessary part.
Because the text is not structured, I do not know how many lines the block contains and where the extra line is added to the end.
Unfortunately it cannot be specified to delete or trim the last 10 lines. Or I don’t know the first line of the text that’s written from the beginning of the block.
I have attached the process and the sample PDF included in a Word document.
Thank you for your advice. After your answer I try use your workflow, but it was to difficoutl for me.
Unfortunately I don’t know the R language at all. In R I need a step by step help in my oppinion.
I placed your PDF file in the /data/ folder of the node and searched for it in order to create the paths. If you have more than one PDF they would all be processed. The paths are also fed to R so no need to enter them manually. You also do not need much R knowledge just set it up (Data separate and melting - #9 by mlauber71, How to read multiple lines from PDF File - #12 by mlauber71). If you want to use some advanced tools a little setup efforts will help you a great deal and give you further skills and options - I may add an environment propagation later.
the top loop would simply extract all text from the PDF mark the pages where it is from and save the content in a CSV file and a KNIME table so you could later do some extractions.
The part on the left would take a search word (“A jogosult ügy”) and list all the pages and lines where you could find the word and keep the lines. If you then want to extract further information from this key word (behind etc.) you will also be able to do that - I think there have been other threads handling this.