PDF TIKA / PARSER - Extract data

Brain · August 16, 2024, 10:56am

HI,
Here is a PDF
20240813DEA01_last_times_fr.pdf (177.1 KB)

from which I would like to extract only the highlighted information on PAGE 1

The idea is to end up with a table like this :
TABLEAU FINAL.xlsx (8.5 KB)
My plan is, of course, to use TIKA PARSER or PDF PARSER, but the problem starts afterward. If anyone can help…
Thanks
Br

mlauber71 · August 16, 2024, 1:12pm

@Brain you can take a look at this example and try to manipulate it. Three approaches with Tabula, Camelot and PDFPlumber extract tables from your PDF file

With the help of PyMuPDF you can extract individual areas of your pdf file so to get the headline and the date. You will have to define the areas so as to capture enough of the text but no unwanted context.

You will have to do some cleaning up and combining and also you could try and use several results to see if they agree on the outcome and only use them or something similar.

The Tika Parser is used to demonstrate how to extract a ll the images that are contained in the PDF - just to see if it works …

More on the logic behind the packages in this article:

rfeigel · August 17, 2024, 2:02am

What do mean by "the problem starts afterwards "? Do you have model you can share?

PBJ · August 17, 2024, 10:35am

If the PDF files are not digital you can use “Tikka node” to extract embedded pictures into a folder (if these pictures are known by Tikka othewise you can use poppler to split PDF files into pictures with the “external Tools node” to poppler) and use “Tess4J node” to do OCR in each pocture and get digital content.

If PDF files are digital you can use Tikka to extract digital content.

With these digital contents, you can use “Regex Extractor node” to extract the columns you want.

Brain · August 17, 2024, 10:53am

Hello,

I downloaded the workflow as well as MINIFORGE. On the PYTHON Script node, I keep getting the error NO MODULE NAMED ect
Do I need to download packages like Camelot, Tabula, etc., somewhere?
Thanks for your help
Br

mlauber71 · August 17, 2024, 12:08pm

@Brain you will have to install the necessary packages following the yaml file in the /data/ subfolder and the instructions in the article. I was not yet able to adapt the conda node to windows also as I would normally do. Might take a few days.

py3_knime_pdf.yml (3.2 KB)

mlauber71 · August 19, 2024, 7:27am

@Brain you might want to try again. Windows Environment is now in the workflow and component

system · November 17, 2024, 7:28am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

PDF TIKA / PARSER - Extract data

I downloaded the workflow as well as MINIFORGE. On the PYTHON Script node, I keep getting the error NO MODULE NAMED ect Do I need to download packages like Camelot, Tabula, etc., somewhere? Thanks for your help Br

I downloaded the workflow as well as MINIFORGE. On the PYTHON Script node, I keep getting the error NO MODULE NAMED ect
Do I need to download packages like Camelot, Tabula, etc., somewhere?
Thanks for your help
Br