Automate pdf reader and convert data to excel table with correct column mappings

Hi all, I need assistance with creating a pdf reader which accurately reads the pdf file and converts the file into an excel document with the correct level of column mapping of the data.

@Miguel_n11 welcome to the KNIME forum. There was a discussion about this with some example workflows.

Maybe you can provide us with a sample to see how the tables would look like, if you have one that is ok to be published.

Hi I have an issue with uploading the pdf file as it is not an accepted file format. Could you assist?

I think if you change the extension to something else (like txt) you should be able to upload it - then whoever downloads can change it back to PDF afterward.

Hi I have embedded the PDF file here: https://www.beamium.com/FGGQVRWP

Please see pdf which requires KNIME to extract data from PDF tables: https://www.beamium.com/FGGQVRWP

Hello @Miguel_n11,

please make sure that you are not sharing any kind of private or sensitive information publicly.

Br,
Ivan

Hi,

This is a dummy data view of the data we receive.

Regards,

Miguel

2 Likes

I had problems with the mentioned KNIME only approaches so I tried something with KNIME and R. It has these steps:

  • run and configure R’s “tabulizer”
  • it seems the settings ‘stream’ and GUESS are working best in your case
  • it would extract one table from each page and try to find headers and bring them to a table
  • not all information would be in the same columns (we come to that later)
  • the tables are saved as single CSVs (with their varying structure)
  • then they would be imported into KNIME forcing the columns to be all strings and be brought into a single table
  • the text fields which contain information in three columns would be integrated
  • the summary lines with the Credit balance would be separated
  • a single ID for each transaction block is created and distributed
  • the “our reference” field is extracted separately and be stored in a separate column (you might do that to other information as well)
  • the remaining “communication” is brought into one cell
  • all the information is being put together and could be stored

Of course, you might do further manipulations like converting the sums into numbers. Introducing checks with the separate balances and so on. If you have columns that would change very much you might have to alter the workflows and change the definitions in R.

The results would then look something like this:

Unfortunately on my Mac, I experience problems with the R package when running from within KNIME.

1 Like

Hi @Miguel_n11,

thought so but wanted to check :wink:

Br,
Ivan

1 Like