Stanford NLP - NE Tagger input data format

I tried the example project for the Stanford NLP NE tagger. but it seems very different from the process used when using the Stanford API either from the command line or in a Java application…

For example using the actual API the training data is created by first tokenizing the text and then the tokens are labeled as follows

My      O
name  O
is        O
John   PERS
G.       PERS
Smith  PERS
and      O
then     O
...

But in the KNIME extension the input is completely different, relying on the text and a separate table of names found in that text.

I’m not sure I understand, Why the difference?

Is it possible to make use of the training data in the above format?

Hey erotavlas,

the KNIME document structure is different than the Stanford one, so we built some kind of a wrapper around the Stanford API to work with our document structure.
To use the StanfordNLP NE Tagger node, you first have to convert your data to documents with the Strings To Documents node. This node also takes care of the tokenization. The tokenizer can be selected in the node dialog. Afterwards you can connect the node to the StanfordNLP NE Tagger which then tags all the documents based on the model selected in the node dialog. To get a Stanford-like view of the data, you can use the Bag of Words Creator node. It creates a new column containing all terms and tags contained in the specific document:

DocumentColumn    TermColumn 
Document1         My[]
Document1         name[]
Document1         is[]
Document1         John[PERSON]
Document1         G.[PERSON]
Document1         Smith[PERSON]
Document1         .[]
Document2         Hello[]
....              ....

I hope, this helps.

Cheers,

Julian

Ok thanks that helps a bit. So each row represents a token with its associated string it came from (aka document)?

What I’m really wanting to know is how to train and evaluate a custom model. So basically I need to create my own set of tagged tokens and train the model with it. Does this extension support creating your own model for evaluation? Or do I do that using the Stanford API directly outside of KNIME, and then reference that model inside the node dialog?

Hey,

there are three different nodes for training, evaluating and tagging. The StanfordNLP NE Tagger, StanfordNLP NE Learner & StanfordNLP NE Scorer.
The tagger can be used with a model built by the learner or with any built-in model that can be selected in the node dialog.
The scorer is used for evaluating a model that has been created by the learner node.

There is a basic model training and evaluation workflow on the example server.
To log in on the example server, right-click on EXAMPLES in the KNIME Explorer and click login.
You can find the workflow there:
08_Other_Analytics_Types/01_Text_Processing/14_NER_Tagger_Model_Training

Feel free to ask, if there is something unclear.

Cheers,

Julian

Hi there,

I am still a little lost after reading the example since I am not a really technical person.

I have the correct training format already, but what should I feed it into after String To Documents? Should I feed it directly to Stanford NLP Tagger or Learner? Or both? I am trying to picture a structure.

Thanks in advance.
Hao

Hey Hao,

it depends on what you want to do. Do you want to do a simple tagging task? Then you can just connect your document data table to the upper input port of the StanfordNLP Tagger node and select a model within the node dialog that fits your needs best.
Or do you want to train an own model? If this is the case, you have to feed the data to the StanfordNLP Learner, train the model and then feed the model to the StanfordNLP Tagger and tag whatever dataset you want.
There is also a example on the example server (Double-click on EXAMPLES in the KNIME Explorer):
08_Other_Analytics_Types/01_Text_Processing/14_NER_Tagger_Model_Training

Cheers,

Julian

Hi Julian,

Thanks for the quick reply.

I am still a little lost in there if you don’t mind. I am trying to train my own model but I am not sure what’s the lower input (input dictionary) format for StanfordNLP NE Learner.

I read through the example 14 you provided but I am not sure what is the input dictionary if I have it like

DocumentColumn TermColumn
Document1 My
Document1 name
Document1 is
Document1 John[PERSON]
Document1 G.[PERSON]
Document1 Smith[PERSON]
Document1 .
Document2 Hello

Thanks again,
Hao

Hey,

the dictionary table is just one word (as String) per row.
So, you feed your documents to the upper input port and the terms to the lower input. The dictionary should only contain words that you want to be included in the model as an named-entity. Based on the words in the dictionary and how they occur in the documents the node will try to create a generalized model to detect words like the words from the dictionary that seem to fit the generalization.

Upper port: Only documents -> one document per row.
Lower port: Only words (as Strings) -> one word per row.

Example:
You want to build a model that detects location names and your training data is:
Document1 : Berlin is a beautiful place.
Document2: I would like to go to Washington and Tokio some day.

Then you would feed the table containing these documents to the upper port.
The table you feed to the dictionary port would look like this.

Berlin
Washington
Tokio

I hope this helps.

Best,
Julian

Hi Julian,

Thank you so much!

Something to add: In your example, I believe that each word appear in the dictionary will be tagged as a location. But what should I do if I want to tag each exact words?

For example: “American” in American Airline and “He is an American” has different tags. The former is an organization while the latter is probably empty.

Is there any way to modify the dictionary or something else can let me do that? Hope the example helps.

Thank you again!
Hao