POS Tagger - Trying to get rid of foreign languages

Me again ;)

I finally got my dataset of tweets to analyse and it contains a fair amount of foreign language tweets which I need to get rid of. For my sample dataset it worked fine to use the POS tagger and afterwards the POS filter and filter out Foreign Words and other useless words.

But for my current data set the POS tagger often tags foreign words as NN(POS) or similar, which I cannot filter out as those are tags for the normal words. Does anybody know why this is happening and whether there is a tagger that might tag the tweets more appropriately?

Thanks in advance,


Hi Pepita,

one problem with tagging Tweets with regular POS tagger models used by the POS Tagger node or Stanford Tagger node is that in Tweets a "short" language is used compared to "proper" natural language. The models are usually trained on "proper" natural language not on Tweets (including words of different languages). Alternatively to the POS Tagger node, using an open nlp model, you can try the Stanford Tagger node and check if the english models fit better to your texts.

The Textprocessing extension is still using the Stanford NLP lib v3.1.4. There are plans to update to the latest release but not for KNIME 2.10. There are models out there that have been trained on Twitter data, e.g. http://www.ark.cs.cmu.edu/TweetNLP/ or http://gate.ac.uk/wiki/twitter-postagger.html but these models have so far not been integrated into KNIME. If you can't wait you can of course integrate these models yourself by implementing your own tagger node. Here is a detailed description about how to do this: http://tech.knime.org/for-developers-integration-of-custom-tagger. If you have questions about that feel free to ask them in the forum.

Cheers, Kilian