Stanford NL tagger is not working well with french ?

Hi there,

I'm new to KNIME and I run into a strange issue with the Stanford tagger.

I want to analyse a set of text regarding the Tchernobyl disaster.

I've feed the corpus of text into Stanford tagger node (french tagger is selected), I then use the bag of words creator + Tag filter with FTP selected (French TreeBank I guess).

When I look at the results of the Stanford node tagging, a lot of very common name are marked as UNKNOWN. Sometimes the exact same verb will be differently tagged.

Does anyone have an idea why it happens ? What did I miss ?

Any help with french text tagging is warmly welcome !

Thanks a lot

 

Hi kultissim,

the assigned tags, when using the French model in the Stanford tagger are French Treebank tags (FTB). Is your text proper french natural languge? The model was trained and thus works best on proper natural language.

Can you share some data included in a workflow?

Cheers, Kilian

Hello,

I have the same problem with Stanford tagger and French Tree Bank.
I am extracting words from a job offer (from Pôle emploi) : job offer.

Many words are unknown:

What could be the problem?

1 Like

Hey romain,

it seems that our FTB implementation does not fit to the FTB tag set that is used by the POS model.
It should definitely not look like that. I will have a closer look.

Cheers,

Julian

Hey again,

I had a closer look. The standard FTB tag set that we’ve implemented is not used by Stanford CoreNLP. They use the modified FTB set (Crabbé et al., 2008). I created a ticket to add this particular tag set.

Thanks for the hint.
Cheers,

Julian