Stanford Lemmatizer - Relating original terms with their lemmas

I need to keep a mapping of each term in a document to its lemma produced by the Stanford Lemmatizer node. So far, I haven’t found a straightforward way to do this. The lemmatizer node creates a new document with the lemmas in place of the original terms, and the Bag Of Words Creator node groups equal terms into a single entry, so I can’t establish a correspondence between original and lemma terms from their order in that node’s output table because several original terms might have the same lemma, which will appear only once.

I intend to create a custom node that will blend the two nodes above, applying lemmatization to a document’s POS-tagged terms and outputting two columns: the original terms (with their tags) and their corresponding lemma (with the same tag). Before doing so, though, I would like to know whether someone has an idea to achieve this goal using existing nodes.

Hi @mpenalver -

This is an interesting question. I can’t think of a good way to do this with existing nodes (although maybe I am missing something @julian.bunzel?) so creating a custom node seems like a good solution.

It certainly would be nice to have an easy way to see which terms have been lemmatized!

1 Like

In the end, it was not necessary to create a custom node. The simple workflow attached calls the Stanford library directly to obtain the lemma associated to each input term.

relating_terms_with_their_lemmas.knwf (19.2 KB)


@mpenalver Thanks for posting your solution. A short snippet saves the day! :slight_smile:

Maybe you would consider packaging this functionality as a component and posting it on the KNIME Hub?

Thanks for the suggestion, @ScottF, but the solution seems too simple for a full-fledged component… The core of it is a single call to the Stanford NLP library.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.