I need to keep a mapping of each term in a document to its lemma produced by the Stanford Lemmatizer node. So far, I haven’t found a straightforward way to do this. The lemmatizer node creates a new document with the lemmas in place of the original terms, and the Bag Of Words Creator node groups equal terms into a single entry, so I can’t establish a correspondence between original and lemma terms from their order in that node’s output table because several original terms might have the same lemma, which will appear only once.
I intend to create a custom node that will blend the two nodes above, applying lemmatization to a document’s POS-tagged terms and outputting two columns: the original terms (with their tags) and their corresponding lemma (with the same tag). Before doing so, though, I would like to know whether someone has an idea to achieve this goal using existing nodes.
This is an interesting question. I can’t think of a good way to do this with existing nodes (although maybe I am missing something @julian.bunzel?) so creating a custom node seems like a good solution.
It certainly would be nice to have an easy way to see which terms have been lemmatized!
In the end, it was not necessary to create a custom node. The simple workflow attached calls the Stanford library directly to obtain the lemma associated to each input term.
Thanks for the suggestion, @ScottF, but the solution seems too simple for a full-fledged component… The core of it is a single call to the Stanford NLP library.