RegEx Filter - Need to filter specific phrase + word

gustavo.velho · June 9, 2016, 3:27pm

Hello,

I've been playing with RegEx and using some testing tools online but couldn't figure that out, so posting this here to check if someone can help, please! I have several documents, which are social media posts, and there's a column that contains a phrase:

TWEET FROM: TwitterHandle

My goal is to remove all Twitter Handles (or else they impact text processing), but the only reference I have is the "TWEET FROM" phrase. How can I filter the whole phrase using RegEx Filter?

Thanks!

Gustavo

Geo · June 9, 2016, 10:44pm

check with www.regexr.com

gustavo.velho · June 10, 2016, 3:39am

Hi Geo, I did and although in regexr.com the expression works, it doesn't work on Regex node. It seems to be an issue with whitespace between words. Even if I simply add "TWEET FROM", it's not filtered. If I use "TWEET" or "FROM" alone, it does work. Any ideas?

Thanks

Gustavo

marco_ghislanzoni · June 10, 2016, 12:38pm

The RegEx Filter node, part of Text Processing, acts on a tokenized version of the input document, meaning that it looks at each single term (or word) in the document and tries to match it with the regular expression. If it matches, the term (or word) is removed from the document.

This node will not match and remove multiple terms (words) at once because of the tokenization. This is why it seems to work on TWEET or FROM alone, but not on both at the same time.

For what you are trying to achieve the String Replacer node with a Regular Expression filter/replacer will do the job.

Cheers,
Marco.

gustavo.velho · June 10, 2016, 3:29pm

Thank you Marco! It worked perfectly now. Appreciate your help!

Gustavo

system · June 2, 2023, 9:48pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.