I've been playing with RegEx and using some testing tools online but couldn't figure that out, so posting this here to check if someone can help, please! I have several documents, which are social media posts, and there's a column that contains a phrase:
TWEET FROM: TwitterHandle
My goal is to remove all Twitter Handles (or else they impact text processing), but the only reference I have is the "TWEET FROM" phrase. How can I filter the whole phrase using RegEx Filter?
Hi Geo, I did and although in regexr.com the expression works, it doesn't work on Regex node. It seems to be an issue with whitespace between words. Even if I simply add "TWEET FROM", it's not filtered. If I use "TWEET" or "FROM" alone, it does work. Any ideas?
The RegEx Filter node, part of Text Processing, acts on a tokenized version of the input document, meaning that it looks at each single term (or word) in the document and tries to match it with the regular expression. If it matches, the term (or word) is removed from the document.
This node will not match and remove multiple terms (words) at once because of the tokenization. This is why it seems to work on TWEET or FROM alone, but not on both at the same time.
For what you are trying to achieve the String Replacer node with a Regular Expression filter/replacer will do the job.
Thank you Marco! It worked perfectly now. Appreciate your help!