I know this is an old issue, but it still causes me problems, and I’ve reached a point where I can’t find a way to work around it.
Sometimes, I really want to work with string versions of documents that I have preprocessed using the text processing nodes. (In the present case, I want to convert the words to cells in a single column so I can divide the documents into chunks of a certain number of words, or work with a rolling window.) But when you convert documents to strings using the Document Data Extractor, a certain number of terms often get joined together, even though they show as separate in the bag of words. The string version of the text is essentially corrupted and is not suitable for analysis.
Is there any chance that this issue will be addressed in coming releases? As good as the dedicated text processing nodes are, it would really be wonderful to be able to move between documents and conventional strings more reliably.