Creating features from tags and word position

I need to implement rules like: named entity (eg. Abner's PROTEIN) is +/- 2 words from certain verbs.

Is it possible to get the position of a named entity and POS tag in order to achieve this?




I'll try to better explain this issue.

KNIME comes with several native nodes for performing different tagging tasks, like POS tagging or named entity recognition. In order to use the identified tags or terms, you can use the Bag of Words node, which produces *terms* (not words) and associated tags. However, this approach does not detail which tag is associated to each *word*, and neither the order of the tags (or words).

Therefore, if you want to extract features like 'POS tags +/- N words with respect to the actual word' (eg. a words window), how can you?

For example, for 'That city was New York', I would like KNIME to produce an ordered list like:
      <DT, NN, VBD, NN>
(where the last NN would be a named entity).

I don’t see your supposed difference between word and term? After all, the text representation is named a Bag of Words. Or are you hinting at New York being one word instead of two? So your question would consist in how to tag compound words such as New York?

A word is an instance, and a term counts the instances of a word. For example:

"Time goes slowly from time to time" a BOW, you'd get onlye one term "time", while you have two words (actually three: Time, time, time -but we separate the first one because it starts with capital).

These are the "Term to String" results after BOW:


These are the POS tags (you can see this with Document Viewer):

Time [POS(NNP)] goes [POS(VBZ)] by [POS(IN)] from [POS(IN)] time [POS(NN)] to [POS(TO)] time[POS(NN)] .[POS(SYM)] 

But how do you get the individual POS tag for each word (not term), into a structure that you can process? 







I’m facing the same problem. Dictionary Tagger tells me whether the terms I’m interested in appear in e.g. a sentence but not where. So, I’m unable to know how far apart they are. The term nr would be most appreciated.