maybe someone can help me out. I read a PDF line by line (string) and search for keywords. Sometimes two or more keywords are in a line (string).
Example (CustomerID and Contact is a keyword): Row 12: CustomerID: 123456 Contact: John Doe
In a first step I checked, via a loop, if keywords are found and the result looks like this.
Row0 CustomerID: 123456 Contact: John Doe Row12 CustomerID
Row1 CustomerID: 123456 Contact: John Doe Row12 Contact
But the final result should be (or something similiar):
CustomerID Contact
Row0 123456 John Doe
I home someone can give me a hint, because I’m searching for a solution for two days now and tries several things.
thanks for your help. I knew there must be a Node for this.
But I a few more steps (data prep) needs to be done, because I noticed, some CustomerIDs have spaces like 123 456.
You can use at first Tikka node to read the PDF file(s) follow by “Regex Extractor” node from Palladian extensions with a regex expression like:
CustomerID\s*:\s+(?>CustId>.?)\s+Contact\s:\s(?.*?)\n
with flags set to i,s,m
The first group named “CustId” take all the characters between "CustomerId : " and "Contact : " and the second group named “Cont” take all the characters after "Contact : " to the end of each line.
thanks for your answer, is it possible to use the RegExNode with a Variable Like $ValueOfRow$\s*:\s+(?>CustId>.?)\s+until a space appears followed be a-zA-Z\s:\s(?.*?)\n?