I have a tagged Document for which I need to access each individual token and its related annotation (POS, NER, etc.).
I don't see how this is possible with KNIME named nodes, but I believe I could access Document (and tags) structure with Python (preferred) or Java scripting.
How can I do this? Where is documentation regarding Document and its tags?
unfortunately it's not possible, since terms are only available as string types in the Java Snippet node. So it is not possible to retrieve any tag information via the Java Snippet node to manipulate them.
If it suits you, you could use the Tags to String node to get a String column containing the tags, but I guess this is probably not a satisfactory workaround, since you can't merge the manipulated strings back onto the term column.
Well, it would exclude the use of the text processing nodes for direct text manipulation, but it would be interesting to see first what the objective of the analysis is… after all, there is the powerful string manipulation node with its regex implementation…
Suppose that you want to create window-based features.
Eg. if sentence is: w1 w2 ... w(i)... w(n)
then for word w(i), I'd like to get POS features for w(i-1), w(i) and w(i+1).
When you use Document Viewer, you are able to see the POS labels, therefore there might be a way for reading this information from the Document. But I could not find it.
Suppose that you want to create window-based features.
Eg. if sentence is: w1 w2 ... w(i)... w(n)
then for word w(i), I'd like to get POS features for w(i-1), w(i) and w(i+1).
When you use Document Viewer, you are able to see the POS labels, therefore there might be a way for reading this information from the Document. But I could not find it.
Thanks Julian. The problem with that workaround, is that it refers to terms, not individual words. In order to work with window-based features, I need the tag(s) of each word.
Thanks Julian. The problem with that workaround, is that it refers to terms, not individual words. In order to work with window-based features, I need the tag(s) of each word.
Yes, I see what you mean, but absent a data structure which would allow you to keep the entire sentence structure (words vs terms), my best guess is that at the current state, the text processing nodes would not suit your needs.
You could still use Julian's workaround (BoW), convert tags to strings, and instead of using BoW to continue your analysis, you'd tokenize the documents yourself (after having converted them back to strings). Then join the tags with your tokenized words. Cell Splitter and Unpivot could be used for such tokenizing, or, as previously mentioned, R.
documents contain all the sentence, term, word and tag information and can be accessed on each level. However, this requires Java coding. You would need to implement your own custom node to do that. On the Java level you can make use of the Document class, containing all information. You can iterate over sentences, terms, word and access also their tag information. Is Java coding an option for you? I am happy to help and guide you through the Document class.