Hello!
I would like to know if someone can tell me which node or nodes to use to eliminate the dashes that join two words, that is, in the BoW I get: consideration-published, and what I want is to eliminate the dashe and get two independent terms (term1: consideration, term2: published).
as far as I know, the preprocessing nodes can be used for filtering specific characters (e.g. punctuation), but the term remains as one term and I guess you only want to filter the dashes and no other characters.
There are two possible approaches that came to my mind.
1)
You could use the String Manipulation node. However you have to use this node on String columns. So either you extract the Strings from your documents with the Document Data Extractor, process the String with the String Manipulation node and create documents again with the Strings To Document node, or you put the String Manipulation node in front of the Textprocessing pipeline / first document creation process. Within the String Manipulation node, you could use regexReplace($COL_NAME$, “\b+(-){1}\b+”, " ") to replace the dash with a whitespace character. Afterwards the Strings To Document node will tokenize the String and return the split word as two terms.
Since the Strings To Document node is mainly responsible for the tokenization, splitting of one term into multiple terms is not possible with the given textprocessing nodes.
2)
Another possible solution (may be more convenient):
You could use the OpenNLP Simple Tokenizer for the Strings To Document node in the beginning of your textprocessing pipeline. This tokenizer creates terms based on sequences of characters belonging to the same character class. So “consideration”, “-” and “published” would be three seperate terms, because “-” is from another character class than the other two words.
Hi!
Thanks for your answer. I didn’t know the String Manipulation node, I appreciate your contribution.
I have use the OpenNLP Simple Tokenizer and I I got what I wanted.