Webinar: PDF Text Extraction using KNIME, Regex, and Python- August 17, 2022

Shantanuty · August 17, 2022, 6:13am

Join @victor_palacios for the webinar " PDF Text Extraction using KNIME, Regex, and Python" on Wednesday, August 17 at 5 PM - 6 PM UTC +2 (Berlin) which is 10 AM - 11 PM UTC -5 (Chicago)

HERE IS THE RECORDING

PDF Extraction Webinar Slides

In this webinar, we will parse PDF documents using the no-code, free tool KNIME and integrate it with code-based tools - Regex and Python.

PDFs bring a number of unique challenges. For instance, how do we know if the PDF is text-based or image-based? If text-based, extracting the text can be done with 1 node and a few clicks in KNIME. But if the PDF is image-based we need to perform Optical Character Recognition (OCR) first to extract the text. But what if we have thousands of PDFs of mixed types? Similarly, tables found in PDFs are almost always tough to extract, so what techniques does KNIME offer in this case? And can KNIME handle non-English or non-ASCII languages? Come join us for this 1 hour presentation with @victor_palacios (KNIME Team Member) who will tackle each of these interesting problems.

In this webinar, we will:

Learn different ways to read text- or image-based PDFs in KNIME.
Examine the quality of our input PDFs to understand our output.
Extract text from PDFs using KNIME, Regex, and Python integrations.

ArjenEX · August 17, 2022, 8:22am

To get actually in

helpmeplease · August 17, 2022, 12:36pm

Hi,

will the Webinar be recorded?

Kind regards

victor_palacios · August 17, 2022, 4:19pm

Hi everyone, please tag me in the event you have a follow-up question. Thank you!

Asap · August 17, 2022, 4:56pm

Hi Victor,

Thank you very much for the training!

I am a little bit confused because not able to find tesseract integration in KMIME
Even automatically it is not possible
OC Windows, KNIME 4.6.1 vers.
Using archive files instead of direct links for installation and update (due to company`s limitation)

@victor_palacios

victor_palacios · August 17, 2022, 5:18pm

Yes, the webinar was recorded and will be uploaded within the week to this thread. Thank you!

victor_palacios · August 17, 2022, 5:19pm

Hello, I’ve seen this issue before but it was due to issues with “Internal Network Restrictions”, could this be the cause of your error:

Asap · August 17, 2022, 8:19pm

Please disregard the previous message. Found a solution and it is very simple:)
Tess4J located in experimental extensions what I have not known

rwalker · August 18, 2022, 5:07am

When will the slide set be posted?

rwalker · August 18, 2022, 5:17am

Just saw it posted above.

system · November 16, 2022, 5:18am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.