Punctuation Erasure

Hello,

we use a workflow to do some data preparation.

... StringsToDocument --> PunctuationErasure (DeepProcessing) ...-> DocumentDataExtrator ... Postprocessing.

When I have a text snippet like "... Monday, January 14, and ..." the result is "... MondayJanuary 14and ..." where not only the punctuation, but also the following blank spaces are removed.

Since I need the terms per document later on, the terms can not be detected correctly.

Do I have some wrong settings?

 

Thanks

 

Hi Bernd,

thank you for the post! This is a bug in the Punctuation Erasure node. I can reproduce it and will fix it asap. As a workaround you can use the "Replacer" node. As regular expression specify something like:

[!#$%&'\"*+,.\?:;]+

in the dialog. As replacement one single whitespace (no empty string).

Hope this helps.

Cheers, Kilian

Hello Kilian,

thanks for the fast reply and the workaround.

I did this and as second alternative used the BOW and the nodes without deep preprocessing and concatenated the individual words later on again to a string.

Btw: I am not sure, but think that NChar Filter also cuts the blanks.

Thanks

Bernd

 

 

Hello,

I would also like to know what punctuations are erased by the Punctuation Erasure node. I ask because I am using the Wildcard Tagger to tag multiple terms, and many of them are hyphenated (example, "post-secondary" or "pre-apprenticeship"). But I use the Punctuation Erasure node before tagging, like so:

... --> Punctuation Erasure --> Wildcard Tagger --> ...

 so I'm not sure whether to still use hyphens in the regular expressions I set up for the Wildcard Tagger.

Thanks, 

Vigile

The Punctuation Erasure node uses the following regular expression to find punctuation marks:

"[!#$%&'\"()*+,./\\:;<=>?@^_`{|}~\\[\\]]+"

Cheers, Kilian