Remove diacritics and other strange characters

I want to remove diacritics  on document level (before tagging). Unfortunately removing diacritics is part of string manipulation and so it cant be used for documents. Perhaps the replacer node might be helpful but then I need regular expressions.

Besides the punctuation erasure node removes only the well known punctuations but no strange characters after converting documents before I load them to Knime. Is a solution for this available?

Additional question: After preprocessing documents I want to read the results. Unfortunately if I try to write the table with the document to disk only an extract of the first part can be seen. The same when I convert the document to string. How can I read the processed documents?

Hi Taita,

there is no dedicated node to remove diacritics from documents / terms so far. The only way (without using the RegEx Filter) is to do the string manipulation before you create documents from the strings.

Cheers, Kilian

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.