Help with a simple regex problem...

bioart · September 20, 2014, 4:55pm

Hi,

I'm trying to extract an email from a string and I'm having a case of brain freeze. The approach I would use in Perl/Java isn't working and I'm not sure if I'm missing something....

Here are sample strings (each block is one line, disregard the wrapping):

[This post has been edited by KNIME since it contained confidential customer data.]

My regex that kinda works (a publised email validation regex doest work, so I simplified for now):

.*\s(.*@.*)

The one that I would love to put in place is closer to:

.*\s([_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)@[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,}))\.? (but again, it doesn't work)

The main problem is that nothing I do to remove the trailing period works. I have tried:

.*\s(.*@.*)\.?

.*\s(.*@.*)\b

etc...

Any thoughts?

cheers

art

aborg · September 20, 2014, 8:11pm

If you could specify how they do not work, that might help diagnosing the problem.

Here, you can see my attempt with the following regex: [\s<](\w+@\w+(?:\.\w+)+)

Cheers, gabor

bioart · September 21, 2014, 6:10am

Thanks... this helps a bit, but I'm still missing some... here's my modified one (based on yours):

.*[\s<]([\.\-\+\w\d]+(?:\s?)@(?:\s?)[\.\+\-\w\d]+(?:\.\w+)+)

On the list below, this matches the right things on: http://regex101.com/r/mY5kO8/1 (pasting the list at the end)

But when I run it within knime, none of the numeric emails (except for "94007@imas.imim.es") match...

[This post has edited by KNIME since it contained confidential customer data.]

bioart · September 22, 2014, 4:12pm

Hi again,

To clarify, the data I posted was public data from genbank (I would never post confidential info on a forum like this). I think there's either a bug in how Knime is parsing the string, or some slight difference in the interpretation of the regex that I'm missing. I can use the regex in Java/Perl without issues, but knime misses many of the hits. The message I get is not that useful:

"545 input string(s) did not match the pattern or contained more groups than expected". IS there a way to increase verbosity?

I think it has to do with the end of string handling, but nothing I try works.

Cheers.