Unstructured text mining from PDF

Dear Knime Members!

I have an example document. I’d like to retrieve the phone call and SMS information from the document and process it later with Spacy.
I have split the multipage (up to 100) PDF into several main blocks.
The information at the beginning of the document (header and heading data within red frame) is not relevant until the first Technical Number.
The most important blocks (within blue frame) run from Technical Number (in Hungarian “Technikai szám”) to the next Technical number. The Technical number is marked with blue oval.

Within the main blocks, there are one, or more sub-blocks. These are in green frames. At the beginning of the green frames there is the Phone time.

The bold text and the rest of the sub-block are no longer required, but I can’t find a way to remove the unnecessary part.

Because the text is not structured, I do not know how many lines the block contains and where the extra line is added to the end.
Unfortunately it cannot be specified to delete or trim the last 10 lines. Or I don’t know the first line of the text that’s written from the beginning of the block.

I have attached the process and the sample PDF included in a Word document.

20221027_split_part_of_text.knwf (39.6 KB)
Word_which_insert_PDF.docx (325.4 KB)

@janszky_nav there was a similar question on the forum and a text string within a PDF has been identified with the help of R. Maybe you take a look:

If the information is structured within a table there might also be a solution using R/tabulizer


Dear Mlauber,

Thank you for your advice. After your answer I try use your workflow, but it was to difficoutl for me.
Unfortunately I don’t know the R language at all. In R I need a step by step help in my oppinion.

But, I am prisiated, if give me some help.

Thank you.

@janszky_nav I would suggest you toy around with the workflow read the comments and the links. I was not able to extract the PDF from the word file but used the file you posted earlier (Text mining from PDF documents and results places - #6 by janszky_nav).

I placed your PDF file in the /data/ folder of the node and searched for it in order to create the paths. If you have more than one PDF they would all be processed. The paths are also fed to R so no need to enter them manually. You also do not need much R knowledge just set it up (Data separate and melting - #9 by mlauber71, How to read multiple lines from PDF File - #12 by mlauber71). If you want to use some advanced tools a little setup efforts will help you a great deal and give you further skills and options :slight_smile: - I may add an environment propagation later.

the top loop would simply extract all text from the PDF mark the pages where it is from and save the content in a CSV file and a KNIME table so you could later do some extractions.

The part on the left would take a search word (“A jogosult ügy”) and list all the pages and lines where you could find the word and keep the lines. If you then want to extract further information from this key word (behind etc.) you will also be able to do that - I think there have been other threads handling this.

You can download the workflow here (I might add some things later):

That is it for know. I have to go. Maybe you check it out and we can continue this later :slight_smile:

1 Like