Extract text that matches a regular expression

akosgmbh · June 17, 2013, 1:14pm

Is there a node that allows to extract text that matches a regular expression? For the beginning I would liketo extract email addresses from html pages with the regular expression:

\b[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}\b

What type of document is expected, if I get this error message?

---------------------------
Dialog cannot be opened
---------------------------
The dialog cannot be opened for the following reason:
No column in spec compatible to "DocumentValue".
---------------------------
OK
---------------------------

Alex

frank · June 18, 2013, 9:21am

Hi

The node for your problem is RegExSplit. In this forum there is already a solution that can be adapted to your algorithm.

Maybe you have to use some nodes from the textprocessing extension. Then your reported error message like "No column in spec compatible to "DocumentValue"" makes sense. The node needs at least one column of type Document, not String. The textprocessing extension includes nodes that can convert from string to document and vice versa.

Frank

akosgmbh · June 21, 2013, 4:58pm

I tried a very simple example:

File Reader -> RegExSplit

File:

...

serhatli@itu.edu.tr>after
esin@itu.edu.tr>after
esezer@itu.edu.tr>after

...

RegExSplit:

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b

Option: Ignore case turned on:

Result is silch: Nothing is extracted:

What I am doing wrong?

frank · June 28, 2013, 10:12am

Hi

I have attached a very simple example that extracts email addresses. Please try if this this for your use cases or feel free to adapt this simple example.

Frank

email-addresses.zip

system · June 2, 2023, 9:50pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.