I'm using some of the Preprocessing nodes in the Text Processing folder to clean up some text for analysis. Removing stopwords, converting case, etc., all seems to be working fine.
I'd now like to remove URLs in the documents. I didn't see a Preprocessing node specifically for this, so I created a Java Snippet (simple) node. I put the following code:
return $Document$.replaceAll("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]", "");
I set the "Replace or append result" radio box to "Replace Column: Document". This seems to work well, in that the new Document column has URLs removed. However, the data type of the Document column is now String, so it is not compatible with my next preprocessing node, which is Dict Replacer in my case. I tried inserting a Strings to Document node to convert the Document column back to a document type. However, when I do that, I get the following console error:
ERROR Strings To Document Configure failed (IllegalArgumentException): Table specs to join contain the duplicate column name "Document" at position 0 and 0.
Does anyone have any ideas?
Thanks,
Steve