Textprocessing: URL Filter

julien.divernois · May 14, 2014, 10:36am

Hello,

I have to filter Url's in a table of documents (a plain text description). After filtering punctuation, url's are represented like this:

http//wwwyoutubecom/ex/example

My idea to filter these was to use a Regex Filter node. My regex for those url's is:

https?\/\/(\w*\/)*\w*

Which results OK when i test it on various Regex tester websites. But it still doesn't filter it in my workflow!

Am I missing something?

Yours,

Julien

richards99 · May 15, 2014, 7:45am

First I assume the documents are read in using the flat file document parser or one of the other readers in the IO section of Text Processing nodes.

So If it's in the documents you want it filtered from, you need to ensure deep processing is enabled in the RegEx Filter node. Otherwise the filtering is only applied to the BagOfWords Term column.

hope that delivers what you need.

simon.

julien.divernois · May 15, 2014, 8:27am

I get the descriptions from a DB (Database Reader), convert it into Document (Strings to Document), process the text (-->Filter URL) and then create a BagOfWord. Deep processing is enabled !

julien.divernois · May 19, 2014, 9:54am

I still don't get what the problem is.

If my pattern is "http" it replace correctly http. If it's "http//" it doesn't work. But if I use two replacer's, one "http" and one "//", it works correctly.

richards99 · May 19, 2014, 10:43pm

Hi, I am unsure where you are going wrong.

I just replicated what you were mentioning in your messages.

I took a text file, and deliberately inserted http//wwwyoutubecom/ex/example into the middle of a sentence.

I then converted this into a Document type with String to Document node. Choosing this text column for the Full text in the node dialog. I connected the Document Viewer to this, and when you double click on the Document Title you get the full document text, looking through this, I see the http example I inserted earlier.

I then used a RegEx Filter node using the RegEx https?\/\/(\w*\/)*\w* and chose Deep Processing and selected the Document Column. I connected up another Document Viewer node, and now you see the http text from earlier has been fully removed.

If you then also connect a BoWCreator node, you see the http examples are not in the BagofWords.

You will need to clarify what is not working as I cannot find a problem, or see what happens if you replicate what I just did.

I am also wondering if you have inadvertently applied some unmodifiable flags from the enrichment tagger nodes which is preventing the filtering. Have tried in the RegEx Filter node, ticking the box which says "Ignore Unmodifiable Flag".

Thanks,

Simon.

julien.divernois · May 20, 2014, 10:18am

Hi,

Thank you for your anwser. I finally resolved it using a String Replacer ticking "regular expression" and "all accurences" and then reconverting it with the String to document node. It's working fine this way...

I still don't know why it was not working with the Regex filter node.

josequirozramos · November 20, 2014, 8:19pm

Hi Richards99 I also have the same problem as julien:

If my pattern is "http" it replace correctly http. If it's "http//" it doesn't work. But if I use two replacer's, one "http" and one "//", it works correctly.

I tried your suggestions to no avail. I am downloading data from twitter and storing it in a table. Then I use the table reader to load data in the workflow. Then I convert to document using strings to document node. Then I want to use RegEx filter but I can't remove more than http. I attached the data.

twitterdata.table

kilian.thiel · November 21, 2014, 10:02am

This is due to tokenization. The word tokenizer splits "http" and "://...." into two tokens. The RegEx filter node matches the regex on each word (token)! Applying a regex like "https?\/\/(\w*\/)*\w*" onto these words will match for the first word but not for the second one.

You can see how the text was tokenized by creating a bag fo words. An example workflow is attached.

tokenization.zip

system · June 2, 2023, 9:49pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.