Webinar: PDF Text Extraction using KNIME, Regex, and Python- August 17, 2022

Join @victor_palacios for the webinar " PDF Text Extraction using KNIME, Regex, and Python" on Wednesday, August 17 at 5 PM - 6 PM UTC +2 (Berlin) which is 10 AM - 11 PM UTC -5 (Chicago)

HERE IS THE RECORDING

PDF Extraction Webinar Slides

In this webinar, we will parse PDF documents using the no-code, free tool KNIME and integrate it with code-based tools - Regex and Python.

PDFs bring a number of unique challenges. For instance, how do we know if the PDF is text-based or image-based? If text-based, extracting the text can be done with 1 node and a few clicks in KNIME. But if the PDF is image-based we need to perform Optical Character Recognition (OCR) first to extract the text. But what if we have thousands of PDFs of mixed types? Similarly, tables found in PDFs are almost always tough to extract, so what techniques does KNIME offer in this case? And can KNIME handle non-English or non-ASCII languages? Come join us for this 1 hour presentation with @victor_palacios (KNIME Team Member) who will tackle each of these interesting problems.

In this webinar, we will:

  1. Learn different ways to read text- or image-based PDFs in KNIME.
  2. Examine the quality of our input PDFs to understand our output.
  3. Extract text from PDFs using KNIME, Regex, and Python integrations.
6 Likes

To get actually in :wink:

3 Likes

Hi,

will the Webinar be recorded?

Kind regards

Hi everyone, please tag me in the event you have a follow-up question. Thank you!

1 Like

Hi Victor,

Thank you very much for the training!

I am a little bit confused because not able to find tesseract integration in KMIME
Even automatically it is not possible
OC Windows, KNIME 4.6.1 vers.
Using archive files instead of direct links for installation and update (due to company`s limitation)

@victor_palacios

Yes, the webinar was recorded and will be uploaded within the week to this thread. Thank you!

1 Like

Hello, I’ve seen this issue before but it was due to issues with “Internal Network Restrictions”, could this be the cause of your error:

Please disregard the previous message. Found a solution and it is very simple:)
Tess4J located in experimental extensions what I have not known

1 Like

When will the slide set be posted?

Just saw it posted above.

2 Likes