Keyword search in document and extraction

Hello Community,

maybe someone can help me out. I read a PDF line by line (string) and search for keywords. Sometimes two or more keywords are in a line (string).
Example (CustomerID and Contact is a keyword): Row 12: CustomerID: 123456 Contact: John Doe

In a first step I checked, via a loop, if keywords are found and the result looks like this.
Row0 CustomerID: 123456 Contact: John Doe Row12 CustomerID
Row1 CustomerID: 123456 Contact: John Doe Row12 Contact

But the final result should be (or something similiar):
CustomerID Contact
Row0 123456 John Doe

I home someone can give me a hint, because I’m searching for a solution for two days now and tries several things.

BR,
Sven

Hi @sven-abx,

here a little example workflow using your example:

I hope this solves your problem.

Please let me know in case you have any questions regarding the example workflow.

Best
Kathrin

2 Likes

Hi @Kathrin,

thanks for your help. I knew there must be a Node for this. :smiley:
But I a few more steps (data prep) needs to be done, because I noticed, some CustomerIDs have spaces like 123 456. :sweat_smile:

Thanks for your quick help.

BR,
Sven

You can use at first Tikka node to read the PDF file(s) follow by “Regex Extractor” node from Palladian extensions with a regex expression like:
CustomerID\s*:\s+(?>CustId>.?)\s+Contact\s:\s(?.*?)\n
with flags set to i,s,m

The first group named “CustId” take all the characters between "CustomerId : " and "Contact : " and the second group named “Cont” take all the characters after "Contact : " to the end of each line.

Send a sample to check the regex.

Best regards.

Hi @PBJ,

thanks for your answer, is it possible to use the RegExNode with a Variable Like $ValueOfRow$\s*:\s+(?>CustId>.?)\s+until a space appears followed be a-zA-Z\s:\s(?.*?)\n?

Thanks and regards,
Sven

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.