N Chars Filter captures whitespaces

After trying to diagnose some strange tokenising behaviour, I've determined that the cause was the N Chars Filter removing punctuation marks paired with spaces. So the string "one, two" became "onetwo". At least, this is how it looks in the Document Viewer. When you create a bag of words, the terms are still separate. BUT, if you then use the Document Data Extractor to turn the documents into strings, the words are combined in the output strings -- and this can be a real nuisance.

Presumably this filter is intended to be used after removing punctuation, but in this case I used it beforehand because I wanted to retain punctuation prior to using the NGram creator (which seems to take punctuation into account, though I could be mistaken).

Anyway, this is easy enough to work around. But still, this does not seem like the most logical behaviour to expect from the N Chars Filter. Wouldn't it make more sense for this node to filter out only strings of non-whitespace characters?

The attached workflow replicates the behaviour I have described, both in relation to the N Chars Filter and the Document Data Extractor.

how have you precisely transformed the text into documents? Strings To Documents node?

Yes, they have come from the Strings to Documents node, but prior to that they came from the PDF parser and then went through lots of modifications with the String Manipulation node.

I am a new user of KNIME. i have been using KNIME to create a sense out of a plain text. However i see the a couple of nodes like - Nchars, Number filter, Punctuation Erasure does not clean the document (i.e the preprocessed document at each stage still has some words that were supposed to have been filtered through the use of these nodes).

Any help will be highly appreciated

thanks