Terms joined when converting documents to strings

I know this is an old issue, but it still causes me problems, and I’ve reached a point where I can’t find a way to work around it.

Sometimes, I really want to work with string versions of documents that I have preprocessed using the text processing nodes. (In the present case, I want to convert the words to cells in a single column so I can divide the documents into chunks of a certain number of words, or work with a rolling window.) But when you convert documents to strings using the Document Data Extractor, a certain number of terms often get joined together, even though they show as separate in the bag of words. The string version of the text is essentially corrupted and is not suitable for analysis.

Is there any chance that this issue will be addressed in coming releases? As good as the dedicated text processing nodes are, it would really be wonderful to be able to move between documents and conventional strings more reliably.

Hi @sugna

Thanks for reporting this, we already have a ticket open to fix this. I will add your request to it as well. And just to make sure, if you only want the text, you should use “Document body text” rather than “Text”, since Text also adds the title, without a space.

Cheers,
Roland

Thanks - that’s good to hear.

In the meantime, I’ve worked around the issue by using the Replacer node to add a marker (e.g. the “|” character) to the end of every term prior to converting to strings, and then using that marker to split any wrongly concatenated terms in the string output. Works like a charm.

That’s a great workaround, thanks for sharing this! Hopefully, you won’t need it for too long :slight_smile: