ScottF
February 15, 2023, 9:18pm
2
Hi @Sbhandary -
Data extraction from PDFs can definitely be a tricky task. Let me point you to some other threads where this type of analysis is address in more detail.
Hello @davehansen ,
PDFs are in fact very tricky. First, what kind of PDF are you dealing with? An image-based PDF or a text-based PDF?
I assume text-based since you did get some text, but I need to be sure because the path you walk will depend on whether you can select the text in your PDF (high probability of a text-based PDF) or if you can’t select the text in your PDF you will need to OCR it first (the Tess4J node in KNIME handles OCR).
As well, you can read text-based PDFs with the PDF Pa…
Hi, I’m the PDF guy on the forum. I’ve never heard of a non-OCR PDF. What is that exactly?
We recently had a PDF extraction event via Data Connect. The slides can be found here .
For PDFs, you may also find that the tika parser is better for extraction (but it depends on how/what you want to extract).
As well, we did PDF extraction in a Just KNIME It challenge:
[image] KNIME Hub
[image]
Extracting a Table from a PDF – alinebessa
Given a text-based PDF document with a table, can you par…
The Data Connect recording that Victor links in the second post might be of particular interest.