I managed to get my first Amazon Textract workflow running for Automated Invoice Processing! Man, installing the python integration took me back to installing and running ubuntu packages.
I have some issues on the OCR side of things as well as general questions on the endpoints of the workflow.
Date reads as Feb10/2: Textract appears to read the header-level fields correct for the most part (i.e., customer, terms, etc), but for the date field, it adds a colon instead of “22”
Doesn’t read line items Isn’t this the most important part of an invoice? We have a PO system and buy/receive at the item level. So pulling line items from invoice would then allow me to create a workflow where I can send an update back to the PO bill what has been received.
Just wondering if you tried using the workflow I made previously with Tess4J? Since this is another table, I worry that OCR software which doesn’t specialize specifically in tables will return poor results (even Textract which can deal with tables gives errors as you mentioned). I tried investigating open source OCR software but most had subpar results for anything in a table format.
For the 2 becoming a colon, if that problem is localized to 2 and for that specific date format, you can run regex to find only that type of error and correct it.