I managed to get my first Amazon Textract workflow running for Automated Invoice Processing! Man, installing the python integration took me back to installing and running ubuntu packages.
I have some issues on the OCR side of things as well as general questions on the endpoints of the workflow.
Here’s the start point
(Automatic Invoice Processing using Amazon Textract through Boto3 – KNIME Hub)
Here’s a snapshot of source invoice
Date reads as Feb10/2: Textract appears to read the header-level fields correct for the most part (i.e., customer, terms, etc), but for the date field, it adds a colon instead of “22”
Doesn’t read line items Isn’t this the most important part of an invoice? We have a PO system and buy/receive at the item level. So pulling line items from invoice would then allow me to create a workflow where I can send an update back to the PO bill what has been received.
Thanks in advance.
Just wondering if you tried using the workflow I made previously with Tess4J? Since this is another table, I worry that OCR software which doesn’t specialize specifically in tables will return poor results (even Textract which can deal with tables gives errors as you mentioned). I tried investigating open source OCR software but most had subpar results for anything in a table format.
For the 2 becoming a colon, if that problem is localized to 2 and for that specific date format, you can run regex to find only that type of error and correct it.
Hey Victor, tried the workflow, but what’s the best way to parse the contents cell in the Tika Parser URL Input node?
This is a preview of that cell:
If it was XML or JSON then I could indeed parse it, but this one is just plain text.
Side note - invoice is in PDF, so I moved the original Image Reader and Image Normalizer node to the side.
It will take a little leg work, but this seems parseable. I’m attaching a sample with a similar pdf and table extraction.
Table Extraction.knar.knwf (359.4 KB)
The point was to extract some of the information from this table within the text-based PDF:
Start node output:
End node output: