I’ve had some success in training the StanfordNLP NE Learner to recognise person names in my dataset, but have also encountered an annoying limitation.
If the training dictionary that feeds into the Learner node contains names that ‘overlap’ or contain one another, like “Smith” and “J. Smith”, or “J. Smith” and “R. J. Smith”, the shorter term tends to override the longer one in the tagging results obtained by using the trained model.
For example, if I include both “J. Smith” and “R. J. Smith” in the dictionary, the tagging results generally will NOT include “R. J. Smith” (or will include only a few instances). But if I omit “J. Smith” from the dictionary, then “R. J. Smith” will be tagged.
Similarly, if my dictionary contains several names ending in Smith (e.g. J. Smith, Robert Smith, R. B. Smith, etc.), but not the name ‘Smith’ on its own, the tagging results will include these names and various others that end in Smith. But instances of ‘Smith’ on their own are not tagged. If I add ‘Smith’ to the dictionary, this name on its own gets tagged, but only at the expense of most of the longer names that contain it, like J. Smith. Most of the names ending in Smith are no longer tagged, and those that are tend to be ones that were in the input dictionary. (In contrast, if I use the inbuilt model in the Stanford NER node, both the single- and multi-part variants of Smith will be tagged.)
This behaviour is not affected by the order of terms in the input dictionary. It happens whether or not the shorter terms are listed first. Listing the short terms first is how I would usually deal with overlapping terms when using the Dictionary Tagger. But it appears that the tagging process within the NE Learner node does not work the same way. Or perhaps the problem lies in some other aspect of the StanfordNLP model that I do not understand (and I’ll admit that I understand very little of the mechanics of the model).
Anyway, I’m sharing this in the hope that either A) this issue is specific to Knime’s implementation of the NE Learner and can therefore be fixed; or B) this issue is inherent in StanfordNLP and someone can suggest a strategy for resolving it (perhaps by tweaking the Learner settings, most of which I do not understand).
Thanks in advance.