Detecting POS in Spanish

peleitor · February 6, 2017, 4:34pm

I've trying to generate POS tags for Spanish texts, and I've run into these issues:

1) When using Flat File Document Parser, for UTF-8 enconded files in Spanish, if you set Word Tokenizer = "Stanford NLP Spanish Tokenizer" you get a lot of "Could not parse file" errors -I couldn't find out the reason for this (note: this problem does not occur if you read one file with File Reader).

Then, I tried to the closest option to Spanish tokenization, which seems to be OpenNLP SimpleTokenizer.

2) In order to determine POS tags, I used POS Tagger, again with Stanford NLP Spanish Tokenizer. But the results were very poor, even basic parts like verbs or nouns are not being properly identified.

Is there any better way of doing this?

Regards

kilian.thiel · February 14, 2017, 5:29pm

Hi Fernando,

1) this means that with the Spanish tokenizer model you get these errors with multiple files but not with the simple tokenizer?

2) This could be because of the wring tokenization before hand. Have you tried to tokenize a single file with the Spanish tokenizer (Flat File Reader) and use the Spanish POS model on that? Are the results poor as well? Internally the POS model of Standford NLP are used for POS tagging.

Cheers, Kilian

peleitor · February 15, 2017, 2:28pm

I've repeated the tests, and actually there seems to be a general issue with Spanish Tokenizer. Please check attached sample.

Regards,
Fernando

issues_with_spanish_tokenizer.doc

peleitor · February 15, 2017, 2:29pm

(attachment)

issues_with_spanish_tokenizer.doc

julian.bunzel · February 20, 2017, 2:01pm

Hey Fernando,

the SpanishTokenizer normalizes some words like "al" and returns "a" and "el". The problems mentioned above are related to this normalization, so we will turn this normalization off! Thanks for your feedback.

Meanwhile, you could try to use the PTBTokenizer, since the SpanishTokenizer derives from the PTBTokenizer.

Unfortunately, there is currently no model for Spanish POS tagging. The "POS Tagger" node is just for English texts. The "Stanford" tagger contains some more models for German, French and English language.

Best regards,

Julian

peleitor · February 24, 2017, 11:35pm

Thanks Julian. I was wandering if you had any plans, or if it would make sense, to integrate well known open source Spanish taggers. A very nice example can be found here. I believe this team was actually involved in building Spanish corpus annotations for CoNLL 2002.

Regards

kilian.thiel · February 27, 2017, 12:35pm

Thank you for the link. The number of languages that are supported by FreeLing is impressnve. Unfortunately this is a C++ library. To integrate it easily into KNIME Java is required.

Cheers, Kilian

peleitor · March 4, 2017, 1:23am

I understand, but maybe it deserves to give a try with wrapper methods and JNI.

Just an idea.

Regards, Fernando