Is there a Tokenizer specialized for Arabic language? If not, how to create a special one?
Hi @ahmed_gomaa -
Arabic tokenization is not currently supported in the KNIME Textprocessing extension. However, StanfordNLP has some tools for processing Arabic that might be interesting.
Having said that, you might also want to check out this blog post and associated workflow, where we deal with blending multiple languages. It makes use of the simple and whitespace tokenizers available in KNIME:
So, how to implement the StanfordNLP for Arabic inside the KNIME?
to implement your own tokenizer, you have to write an own extension based on KNIMEs Textprocessing extension.
First steps are available in this blogpost.
An example for a tokenizer implementation can be found here.
You have to implement the Tokenizer and TokenizerFactory interface.
However, an Arabic Language Pack (based on StanfordNLP) for our Textprocessing extension is planned for next year.