maybe someone can help me out. I read a PDF line by line (string) and search for keywords. Sometimes two or more keywords are in a line (string).
Example (CustomerID and Contact is a keyword): Row 12: CustomerID: 123456 Contact: John Doe
In a first step I checked, via a loop, if keywords are found and the result looks like this.
Row0 CustomerID: 123456 Contact: John Doe Row12 CustomerID
Row1 CustomerID: 123456 Contact: John Doe Row12 Contact
But the final result should be (or something similiar):
Row0 123456 John Doe
I home someone can give me a hint, because I’m searching for a solution for two days now and tries several things.
here a little example workflow using your example:
I hope this solves your problem.
Please let me know in case you have any questions regarding the example workflow.
thanks for your help. I knew there must be a Node for this.
But I a few more steps (data prep) needs to be done, because I noticed, some CustomerIDs have spaces like 123 456.
Thanks for your quick help.
You can use at first Tikka node to read the PDF file(s) follow by “Regex Extractor” node from Palladian extensions with a regex expression like:
with flags set to i,s,m
The first group named “CustId” take all the characters between "CustomerId : " and "Contact : " and the second group named “Cont” take all the characters after "Contact : " to the end of each line.
Send a sample to check the regex.
thanks for your answer, is it possible to use the RegExNode with a Variable Like $ValueOfRow$\s*:\s+(?>CustId>.?)\s+until a space appears followed be a-zA-Z\s:\s(?.*?)\n?
Thanks and regards,
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.