This workflows shows how to train a model for named-entity recognition. The workflow starts with reading the file. In this case each row represents a chapter in Julius Caesars 'De Bello Gallico'. The first step is creating a document column with the 'Strings To Document' node. For clarity of the table, we filter out all columns except the document column. To create (and later validate) a model, we need two data sets. The 'Partitioning' node splits our table into a training and a test set. The training set will now be used to generate the NER model. For generating a model with the 'StandfordNLP NE Learner' node, a dictionary is needed. For this workflow we used a dictionary with all the names occuring in our training set. So, the model will be build around the training set and the related names. After generating the model, it can be used for the 'StanfordNLP NE Scorer' and the 'StanfordNLP NE Tagger' node. The Scorer retrieves the test set and the model and validates the model. Internally, the test set will be tagged by each dictionary tagger (with the same dictionary) and Stanford tagger (our generated model). After the tagging process, the Scorer counts the results of both tagging processes and returns measurements like precision, recall, true positives etc.. The 'StandfordNLP NE tagger' is used to tag the documents of the test set. After tagging all terms are filtered out with no PERSON tag assigned and a bag of words is created. To see which new names have been found and recognized by the model the 'Reference Row Filter' is used to exclude all names that are contained in the training dictionary.
This is a companion discussion topic for the original entry at https://kni.me/w/JCouWB2f8k6vld6-