Remove diacritics and other strange characters

Taita · September 16, 2016, 3:21pm

I want to remove diacritics on document level (before tagging). Unfortunately removing diacritics is part of string manipulation and so it cant be used for documents. Perhaps the replacer node might be helpful but then I need regular expressions.

Besides the punctuation erasure node removes only the well known punctuations but no strange characters after converting documents before I load them to Knime. Is a solution for this available?

Additional question: After preprocessing documents I want to read the results. Unfortunately if I try to write the table with the document to disk only an extract of the first part can be seen. The same when I convert the document to string. How can I read the processed documents?

kilian.thiel · September 20, 2016, 9:11am

Hi Taita,

there is no dedicated node to remove diacritics from documents / terms so far. The only way (without using the RegEx Filter) is to do the string manipulation before you create documents from the strings.

Cheers, Kilian

system · June 2, 2023, 9:48pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.