Extract text that matches a regular expression

  • Is there a node that allows to extract text that matches a regular expression? For the beginning I would liketo extract email addresses from html pages with the regular expression:

\b[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}\b

  • What type of document is expected, if I get this error message?

---------------------------
Dialog cannot be opened
---------------------------
The dialog cannot be opened for the following reason:
No column in spec compatible to "DocumentValue".
---------------------------
OK   
---------------------------

Alex


 

 

Hi

The node for your problem is RegExSplit. In this forum there is already a solution that can be adapted to your algorithm.

Maybe you have to use some nodes from the textprocessing extension. Then your reported error message like "No column in spec compatible to "DocumentValue"" makes sense. The node needs at least one column of type Document, not String. The textprocessing extension includes nodes that can convert from string to document and vice versa.

 

Frank

 

I tried a very simple example:

File Reader -> RegExSplit

File:

...

serhatli@itu.edu.tr>after
esin@itu.edu.tr>after
esezer@itu.edu.tr>after

...

RegExSplit:

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b

Option: Ignore case turned on:

Result is silch: Nothing is extracted:

What I am doing wrong?

Hi

 

I have attached a very simple example that extracts email addresses. Please try if this this for your use cases or feel free to adapt this simple example.

 

 

Frank