How to remove URLs from documents?

I'm using some of the Preprocessing nodes in the Text Processing folder to clean up some text for analysis. Removing stopwords, converting case, etc., all seems to be working fine.

I'd now like to remove URLs in the documents. I didn't see a Preprocessing node specifically for this, so I created a Java Snippet (simple) node. I put the following code:

return $Document$.replaceAll("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]", "");

I set the "Replace or append result" radio box to "Replace Column: Document". This seems to work well, in that the new Document column has URLs removed. However, the data type of the Document column is now String, so it is not compatible with my next preprocessing node, which is Dict Replacer in my case. I tried inserting a Strings to Document node to convert the Document column back to a document type. However, when I do that, I get the following console error:

ERROR     Strings To Document                Configure failed (IllegalArgumentException): Table specs to join contain the duplicate column name "Document" at position 0 and 0.

Does anyone have any ideas?




Hi, I have the same issue. 

The snipet java node seems to include the previous document, although I don't want it.



I solve this.

Before the java snippet, I added a data extractor and got the text as a String.

After that I added a column filter and exclude the document column.

I work with the java snippet with only strings.

After the java snippet, I put a strings to document node, and it worked without errors.