PDF reader workflow

davehansen · April 7, 2022, 3:00pm

I have been using Knime for awhile, but by no means an expert. I leverage it a lot for data analysis, but most recently have identified an opportunity to read PDF’s. Everything I have tried and read about does not give me the results I need. I simply just want Knime to read the PDF and I can probably run with it from there, but all of the nodes I have used only read the first line or just present the document path. Additionally, all of the workflows in the forum are super complicated for what I need, I really just want to read the text in the PDF file. See the sample below. I know PDF reads can be complicated but the data is semi structured and even if I could just get the text into a blob I probably could figure out a way to parse it.

ipazin · April 7, 2022, 3:27pm

Hello @davehansen,

and welcome to KNIME Community!

In short if you have used PDF Parser (think same applies for Tika Parser node) you will get column that is type Document with path as value. (Keep in mind that this node allows you to read multiple PDFs.) Then you can use Document Data Extractor node to extract data from it including text. Give it a try!

And here is a simple workflow you can check out (the first part)

Br,
Ivan

izaychik63 · April 7, 2022, 4:43pm

Take a look here. It could be close to what you try to do

victor_palacios · April 7, 2022, 9:52pm

Hello @davehansen,

PDFs are in fact very tricky. First, what kind of PDF are you dealing with? An image-based PDF or a text-based PDF?

I assume text-based since you did get some text, but I need to be sure because the path you walk will depend on whether you can select the text in your PDF (high probability of a text-based PDF) or if you can’t select the text in your PDF you will need to OCR it first (the Tess4J node in KNIME handles OCR).

As well, you can read text-based PDFs with the PDF Parser node which returns a Document type (usually looks like one line as you mentioned) or the Tika Parser node which returns many columns (Content is the one you want). If you used PDF parser, then you need to change the document type to string type with the Document Data Extractor node.

As well, if the PDF is text-based but you don’t like the output you’re getting with the PDF Parser or the Tika Parser you can even try the Camelot Extractor Component which is specifically meant for dealing with tables within PDFs like your screenshot.

Now if you need to OCR these tables, then that is problematic for any software. OCRing tables is currently something even top organizations do not do well.
You can see this post for more details or this post

The best way to get help on the forums is to put a sample pdf file and the workflow you used so we can quickly find the bottleneck. Thanks for posting and welcome to explain anything else. PDFs seem fairly straightforward but extraction from a PDF is a bit of a science and art.

If you’re interested in learning more, we’re also hosting a PDF outlier detection event in June with our next North America Data Connect for KNIME.

I’ve also went ahead and made a simple PDF parser workflow on the KNIME Hub based on your comments. Thanks for the idea!

-Victor

mlauber71 · April 7, 2022, 11:11pm

@davehansen could you provide us with a sample. Too extract tables from pdf I have used this r package

If you want to just read text there might be this package

You should especially check out the information @victor_palacios gave.

davehansen · April 8, 2022, 2:32pm

Wow! What a great community and responses. I used some of the provided insights here and I now have the PDF(s) parsed and I am working on getting the data formatted (WIP). But thank you for all the replies!

system · April 15, 2022, 2:32pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.