Extract data from PDF invoice

Hello all, I have multiple invoices converted from word to PDF. I am trying to extract the information in tabular form using KNIME. Tried using PDF Parser and Tikka Parser, nothing works. I have given a sample doc, suggestions how to go about would be timely.

Company Name.docx (40.0 KB)

@lavvenkatesh welcome to the KNIME forum. That is possible with the help of R and the package “docxtractr” like in this workflow using your example (two others are 1 | 2):

The content from your example is split into separate tables and then exported to KNIME tables or Excel sheets:

You might have to come up with the further handling of the data. Separate columns are added to indicate where the table came from. If your invoices always have the same structure you could always use the 3rd table or you could identify the tables by column names. All .DOCX files that you would place in the /data/_docx/ folder will be scanned and imported.

Admittedly this approach might not be totally intuitive, but once you have familiarised yourself with some R and KNIME you gain a very powerful tool.

4 Likes

Look here

5 Likes

@izaychik63 / @lavvenkatesh - If we are talking about extracting tables from PDF there also is a R package “tabulizer” for that :slight_smile:

4 Likes

Thank you @izaychik63 and @mlauber71 for timely reply. Let me try the suggestions, by day end will know which one worked for me.

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.