Formatted Text/Numeric Extraction from Invoices

Hello,

I have a bunch of invoices (several verndors) and like to extract certain informations. I’m able to extract and match some infos, but some doesn’t match. For example I’m able to extract one IBAN, but the second. First IBAN is without spaces (DE123456…), the second is with spaces (DE12 3456 7890). From the second I only recieve DE12. Is is possible to clean the PDF and get rid of “errors”?, because the extraction of the customer-id is possible (formatted 123 456 789).

Also, is it possible to extract information not like “$Term as String$ LIKE “ummer”” with rule-based row filter but in an expected format like “###0,##” for € and “dd.mm.yyyy” for dates?

br,
Sven

hi @sven-abx,

I would suggest the usage of the Regex Extractor node.

For your IBAN use case, the following pattern should get all possible values:

[A-Za-z]{2}\d{2}\s*\d{4}\s*\d{4}\s*\d{4}\s*\d{4}\s*\d{2}


regex should also be used for € and dates substrings.

Hope that helps, Greetz, Tommy

5 Likes

hi @tommy,

thanks for your answer. i had a closer look at the regex extractor node and was able to get all the necessary infos.

br,
sven

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.