StanfordNLP NE Learner Input Format

Hi there,
I am lost in the input format of the document and dictionary of StanfordNLP NE Learner.

What am trying to do is to train my own NE learner and then use the tagger to performance prediction.

I have read some related posts from here:

and try the example:
08_Other_Analytics_Types/01_Text_Processing/14_NER_Tagger_Model_Training

In the example, the input dictionary contains bunch of person’s names and the Learner regards all the names appear in the document as Person. However, this might work for people’s name but if every word in the dictionary are regarded as the same tag, there might be a problem since Just because there might be ambiguous names.

For example, “White House” can be referred as “The US government” or “a exact location” depending on the context.

Appreciate any help with this problem. Thanks in advance.

Hao
Capture

1 Like

Hi @howellyu,

the StanfordNLP NE Learner can only learn models that tag entities with one tag, e.g. firstname, localtion, etc. For multiple tags, multiple models need to be trained.

The input of the learner node is 1. the corpus of documents containing also the NEs to learn 2. a dictionary of all NEs. These NEs will be tagged in the corpus and then the tagged corpus will be used for training the model.

To use the model the StanfordNLP NE Tagger node can be used.

Cheers, Kilian

Hi Kilian,

Thanks you very much! This clears a lot.

One thing to add, what is the correct input format for the dictionary without getting ambiguity? Could you give an example? Thank you!

Best,
Hao

Hi Hao,

The dictionary input is simply one string column. All strings in that input table and the selected column are used as NEs to be tagged in the corpus.

Cheers, Kilian