Stanford NL tagger is not working well with french ?

kultissim · May 2, 2016, 3:52pm

Hi there,

I'm new to KNIME and I run into a strange issue with the Stanford tagger.

I want to analyse a set of text regarding the Tchernobyl disaster.

I've feed the corpus of text into Stanford tagger node (french tagger is selected), I then use the bag of words creator + Tag filter with FTP selected (French TreeBank I guess).

When I look at the results of the Stanford node tagging, a lot of very common name are marked as UNKNOWN. Sometimes the exact same verb will be differently tagged.

Does anyone have an idea why it happens ? What did I miss ?

Any help with french text tagging is warmly welcome !

Thanks a lot

kilian.thiel · May 9, 2016, 6:22pm

Hi kultissim,

the assigned tags, when using the French model in the Stanford tagger are French Treebank tags (FTB). Is your text proper french natural languge? The model was trained and thus works best on proper natural language.

Can you share some data included in a workflow?

Cheers, Kilian

romain · May 29, 2018, 1:11pm

Hello,

I have the same problem with Stanford tagger and French Tree Bank.
I am extracting words from a job offer (from Pôle emploi) : job offer.

Many words are unknown:

What could be the problem?

julian.bunzel · May 29, 2018, 2:58pm

Hey romain,

it seems that our FTB implementation does not fit to the FTB tag set that is used by the POS model.
It should definitely not look like that. I will have a closer look.

Cheers,

Julian

julian.bunzel · May 30, 2018, 4:11pm

Hey again,

I had a closer look. The standard FTB tag set that we’ve implemented is not used by Stanford CoreNLP. They use the modified FTB set (Crabbé et al., 2008). I created a ticket to add this particular tag set.

Thanks for the hint.
Cheers,

Julian

system · June 2, 2023, 9:45pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.