The StanfordNLP NE Learner seems to have issues with overlapping names

I’ve had some success in training the StanfordNLP NE Learner to recognise person names in my dataset, but have also encountered an annoying limitation.

If the training dictionary that feeds into the Learner node contains names that ‘overlap’ or contain one another, like “Smith” and “J. Smith”, or “J. Smith” and “R. J. Smith”, the shorter term tends to override the longer one in the tagging results obtained by using the trained model.

For example, if I include both “J. Smith” and “R. J. Smith” in the dictionary, the tagging results generally will NOT include “R. J. Smith” (or will include only a few instances). But if I omit “J. Smith” from the dictionary, then “R. J. Smith” will be tagged.

Similarly, if my dictionary contains several names ending in Smith (e.g. J. Smith, Robert Smith, R. B. Smith, etc.), but not the name ‘Smith’ on its own, the tagging results will include these names and various others that end in Smith. But instances of ‘Smith’ on their own are not tagged. If I add ‘Smith’ to the dictionary, this name on its own gets tagged, but only at the expense of most of the longer names that contain it, like J. Smith. Most of the names ending in Smith are no longer tagged, and those that are tend to be ones that were in the input dictionary. (In contrast, if I use the inbuilt model in the Stanford NER node, both the single- and multi-part variants of Smith will be tagged.)

This behaviour is not affected by the order of terms in the input dictionary. It happens whether or not the shorter terms are listed first. Listing the short terms first is how I would usually deal with overlapping terms when using the Dictionary Tagger. But it appears that the tagging process within the NE Learner node does not work the same way. Or perhaps the problem lies in some other aspect of the StanfordNLP model that I do not understand (and I’ll admit that I understand very little of the mechanics of the model).

Anyway, I’m sharing this in the hope that either A) this issue is specific to Knime’s implementation of the NE Learner and can therefore be fixed; or B) this issue is inherent in StanfordNLP and someone can suggest a strategy for resolving it (perhaps by tweaking the Learner settings, most of which I do not understand).

Thanks in advance.

Hi @AngusVeitch1 I have little experience with the StanfordNLP NE Learner in Knime, but I’ve used several years ago other packages (like Gate) and own Python (NLTK) and R coding in recognizing a group of words in a text.
They all have issues with these overlapping names and often choose only the shortest version from your dictionary for the tagging.
So it would seem a surprise to me this is a specific issue of the Knime implementation of StanfordNLP.
The only way I found around this was own coding and search from long to short entities from the dictionary. This could become quite timeconsuming.
As always it’s a trade off between time and effort against the quality of the endproduct.

Ah, dammit. That’s really helpful to know, @JanDuo , but what a shame that the StanfordNLP Learner has that particular quirk. It really undermines its value, and feels to me like something that could be fixed. I might have to get a bit creative to get the outcome I want.

Thanks again for sharing your experience.

1 Like

The solution I’ve applied so far is to do certain processing steps with String Manipulation with regex on a string type column, then convert the processed string to document type and then apply other steps on the document column. More work but also better control.

Hi @Geo , I don’t quite follow … but oddly enough, your reply does sound like an answer to a different question I might have asked somewhere in the forum!