Extracting a Table from a PDF – KNIME Community Hub

Hub · December 2, 2022, 6:51pm

Given a text-based PDF document with a table, can you partially extract the table into a KNIME data table for further analysis? For this challenge we will extract the table from https://www.mountwashington.org/uploads/forms/2021/10.pdf and attempt to partially reconstruct it within KNIME. The corresponding KNIME table should contain the following columns: Day, Max, Min, Norm, Depart, Heat, and Cool. Note 1: Your final output should be a table, not a single row with all the relevant data. Note 2: The Tika Parser node is better suited for this task than the PDF Parser node. We completed this task without components, regular expressions, or code-snippet nodes. In fact, our solution has a total of 10 nodes, but labeling the columns required a bit of manual effort.

This is a companion discussion topic for the original entry at https://hub.knime.com/-/spaces/-/latest/~1uIrwPPiVwP-r-v5/

tite_za · June 1, 2023, 11:41am

Thank you for sharing. I am trying to extract data from a text based pdf document (table that I would like to transform in a table that I could use in my workflow). The pdf parser works but I can not find how I can configure the Cell Spitter, as I do not have recurrent sign that I can use… Does someone have an idea ? Thank you

ScottF · June 1, 2023, 3:19pm

Hi @tite_za and welcome to the forum.

I would suggest posting your question in a new topic in the #knime-analytics-platform category, along with your workflow in progress so far, and some sample input data (if it’s not confidential). Then some of our community experts might be able to dive in and take a look.