we use a workflow to do some data preparation.
... StringsToDocument --> PunctuationErasure (DeepProcessing) ...-> DocumentDataExtrator ... Postprocessing.
When I have a text snippet like "... Monday, January 14, and ..." the result is "... MondayJanuary 14and ..." where not only the punctuation, but also the following blank spaces are removed.
Since I need the terms per document later on, the terms can not be detected correctly.
Do I have some wrong settings?
thank you for the post! This is a bug in the Punctuation Erasure node. I can reproduce it and will fix it asap. As a workaround you can use the "Replacer" node. As regular expression specify something like:
in the dialog. As replacement one single whitespace (no empty string).
Hope this helps.
thanks for the fast reply and the workaround.
I did this and as second alternative used the BOW and the nodes without deep preprocessing and concatenated the individual words later on again to a string.
Btw: I am not sure, but think that NChar Filter also cuts the blanks.
I would also like to know what punctuations are erased by the Punctuation Erasure node. I ask because I am using the Wildcard Tagger to tag multiple terms, and many of them are hyphenated (example, "post-secondary" or "pre-apprenticeship"). But I use the Punctuation Erasure node before tagging, like so:
... --> Punctuation Erasure --> Wildcard Tagger --> ...
so I'm not sure whether to still use hyphens in the regular expressions I set up for the Wildcard Tagger.
The Punctuation Erasure node uses the following regular expression to find punctuation marks: