Extract informations from a PDF file

Stephane73 · August 29, 2020, 7:41pm

Hi,

I would like to extract information from a PDF file but i didn’t know how to do.

I tried to use existing worflows but I don’t understand how they worked.

So, if you can help me to do this, it will be very nice.

Thank you.

elsamuel · August 29, 2020, 9:55pm

It would be much more helpful if you were more specific about what you’re trying to accomplish, what existing workflows you tried, what you didn’t understand.

Stephane73 · August 29, 2020, 11:09pm

Thank you for your reply. You right, I need to be more specific.

So, I would like to extract information from a PDF File.

For example, It can be a brochure of a movie (in pdf format) where i could extract name of movie, actors, director, release …

Or an announcement of a real estate project. It will be interesting if I could extract the location, the cost, the real estate developpers …

The biggest difficult is to identify where are thoose informations and for what they corresponding.

I trying to use the Tikka Parse node to read a PDF. It seem to be working but after this i don’t know how i am going to do to extract this informations. How to specify the position, it depend, it can be different from a document to one other.

I have already did a similary work but it was on a website and it’s more easier because there is name of class to specify where is the information. But with a PDF, it’s not easy.

I don’t know if it is clear, so do not hesitate to ask me for clarification. I can also provide you an example.

Thank you.

andrejz · August 31, 2020, 5:36am

Hi,

Start with “PDF Parser” node and then “Document Data Extractor” node and maybe cell splitter … depends of the structure of the document.
If your PDF file do not have an text layer (no OCRed pdf or image) you can try to use KNIME Image Processing - Tesseract (OCR) Extension - https://www.knime.com/book/knime-image-processing-tesseract-ocr-extension.

When you have the data in a table you can extract the data with Regex (depends on the structure of the data (ideal will be Movie title: … ) … Can you upload one of this documents?

Stephane73 · August 31, 2020, 3:57pm

Thank you for you answer.

So I started to work with a internet services invoice in a pdf format.
In this invoice, there are differents sections, so I was thinking about create differents buckets like Informations Client,Calls,Internet,TV.

The goal is to summarize the informations and doing analysis after.
But I have to find the right method to extract the right information.
As I said before, depending on the document, the organization may be different.

For example, you can see the PDF wich I’am working with.

It’s in a word format because I can"t upload a PDF format here.

Invoice.docx (27.5 KB)

mlauber71 · August 31, 2020, 9:46pm

If it were about tables in a docx file you could have a look at this workflow and accompanying discussion.

There is a good chance there are R (or Python) packages out there that might be able to extract tables from PDF files.

system · March 2, 2021, 9:46am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.