Extract informations from a PDF file

Hi,

I would like to extract information from a PDF file but i didn’t know how to do.

I tried to use existing worflows but I don’t understand how they worked.

So, if you can help me to do this, it will be very nice.

Thank you.

It would be much more helpful if you were more specific about what you’re trying to accomplish, what existing workflows you tried, what you didn’t understand.

2 Likes

Thank you for your reply. You right, I need to be more specific.

So, I would like to extract information from a PDF File.

For example, It can be a brochure of a movie (in pdf format) where i could extract name of movie, actors, director, release …

Or an announcement of a real estate project. It will be interesting if I could extract the location, the cost, the real estate developpers …

The biggest difficult is to identify where are thoose informations and for what they corresponding.

I trying to use the Tikka Parse node to read a PDF. It seem to be working but after this i don’t know how i am going to do to extract this informations. How to specify the position, it depend, it can be different from a document to one other.

I have already did a similary work but it was on a website and it’s more easier because there is name of class to specify where is the information. But with a PDF, it’s not easy.

I don’t know if it is clear, so do not hesitate to ask me for clarification. I can also provide you an example.

Thank you.

Hi,

Start with “PDF Parser” node and then “Document Data Extractor” node and maybe cell splitter … depends of the structure of the document.
If your PDF file do not have an text layer (no OCRed pdf or image) you can try to use KNIME Image Processing - Tesseract (OCR) Extension - https://www.knime.com/book/knime-image-processing-tesseract-ocr-extension.

When you have the data in a table you can extract the data with Regex (depends on the structure of the data (ideal will be Movie title: … ) … Can you upload one of this documents?

Thank you for you answer.

So I started to work with a internet services invoice in a pdf format.
In this invoice, there are differents sections, so I was thinking about create differents buckets like Informations Client,Calls,Internet,TV.

The goal is to summarize the informations and doing analysis after.
But I have to find the right method to extract the right information.
As I said before, depending on the document, the organization may be different.

For example, you can see the PDF wich I’am working with.

It’s in a word format because I can"t upload a PDF format here.

Invoice.docx (27.5 KB)

If it were about tables in a docx file you could have a look at this workflow and accompanying discussion.

There is a good chance there are R (or Python) packages out there that might be able to extract tables from PDF files.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.