Convert PDF files to database

luise295 · March 7, 2024, 4:17pm

Hi,

I have a PDF document of 21.708 pages, in which each page looks like this:

Is there any way to transform all these pages into a database?
The database must have the columns ACCT, NMBR, TYPE, TIME, USER-ID, DESCRIPTION, PREVIOUS VALUE, NEW VALUE.

Thanks!

Juergen · March 7, 2024, 4:55pm

Hi,

I have used the R package tabulizer for a comparable task.

Here is the link: Introduction to tabulizer

You can combine the R package with KNIME by using the R Snippet node and scan over the document pages with a loop. In order to ease the further processing in KNIME, I would select ‘data.frame’ as the output format for the extract_tables function in the R script.

Once you have the data (one page or the complete document) in KNIME you can feed a database……

Best regards,

Jürgen

PS Presumably you will find comparable functionalities in Python……

mlauber71 · March 8, 2024, 4:17am

Here is an example how to use it:

Also there was a challenge how to extract a table from a pdf file with this solution.

https://hub.knime.com/search?type=Workflow&tag=justknimeit-15&sort=best

A python option could be Camelot.

system · June 6, 2024, 4:18am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.