Convert PDF files to database

Hi,

I have a PDF document of 21.708 pages, in which each page looks like this:

Is there any way to transform all these pages into a database?
The database must have the columns ACCT, NMBR, TYPE, TIME, USER-ID, DESCRIPTION, PREVIOUS VALUE, NEW VALUE.

Thanks!

Hi,

I have used the R package tabulizer for a comparable task.

Here is the link: Introduction to tabulizer

You can combine the R package with KNIME by using the R Snippet node and scan over the document pages with a loop. In order to ease the further processing in KNIME, I would select ‘data.frame’ as the output format for the extract_tables function in the R script.

Once you have the data (one page or the complete document) in KNIME you can feed a database……

Best regards,

Jürgen

PS Presumably you will find comparable functionalities in Python……

3 Likes

Here is an example how to use it:

Also there was a challenge how to extract a table from a pdf file with this solution.

https://hub.knime.com/search?type=Workflow&tag=justknimeit-15&sort=best

A python option could be Camelot.

1 Like