PDF reader workflow

victor_palacios · April 7, 2022, 9:52pm

PDFs are in fact very tricky. First, what kind of PDF are you dealing with? An image-based PDF or a text-based PDF?

I assume text-based since you did get some text, but I need to be sure because the path you walk will depend on whether you can select the text in your PDF (high probability of a text-based PDF) or if you can’t select the text in your PDF you will need to OCR it first (the Tess4J node in KNIME handles OCR).

As well, you can read text-based PDFs with the PDF Parser node which returns a Document type (usually looks like one line as you mentioned) or the Tika Parser node which returns many columns (Content is the one you want). If you used PDF parser, then you need to change the document type to string type with the Document Data Extractor node.

As well, if the PDF is text-based but you don’t like the output you’re getting with the PDF Parser or the Tika Parser you can even try the Camelot Extractor Component which is specifically meant for dealing with tables within PDFs like your screenshot.

Now if you need to OCR these tables, then that is problematic for any software. OCRing tables is currently something even top organizations do not do well.
You can see this post for more details or this post

The best way to get help on the forums is to put a sample pdf file and the workflow you used so we can quickly find the bottleneck. Thanks for posting and welcome to explain anything else. PDFs seem fairly straightforward but extraction from a PDF is a bit of a science and art.

If you’re interested in learning more, we’re also hosting a PDF outlier detection event in June with our next North America Data Connect for KNIME.

I’ve also went ahead and made a simple PDF parser workflow on the KNIME Hub based on your comments. Thanks for the idea!

-Victor