Accessing tagged Document structure

I have a tagged Document for which I need to access each individual token and its related annotation (POS, NER, etc.).

I don't see how this is possible with KNIME named nodes, but I believe I could access Document (and tags) structure with Python (preferred) or Java scripting.

How can I do this? Where is documentation regarding Document and its tags?

Thanks

 

 

Do you have an example of such document? to visualise its structure…

Hey peleitor,

unfortunately it's not possible, since terms are only available as string types in the Java Snippet node. So it is not possible to retrieve any tag information via the Java Snippet node to manipulate them.

If it suits you, you could use the Tags to String node to get a String column containing the tags, but I guess this is probably not a satisfactory workaround, since you can't merge the manipulated strings back onto the term column.

Cheers,

Julian

Well, it would exclude the use of the text processing nodes for direct text manipulation, but it would be interesting to see first what the objective of the analysis is… after all, there is the powerful string manipulation node with its regex implementation…

Suppose that you want to create window-based features.

Eg. if sentence is:    w1 w2 ... w(i)... w(n)

then for word w(i), I'd like to get POS features for w(i-1), w(i) and w(i+1).

 

When you use Document Viewer, you are able to see the POS labels, therefore there might be a way for reading this information from the Document. But I could not find it.

 

Cheers,

Fernando

 

Suppose that you want to create window-based features.

Eg. if sentence is:    w1 w2 ... w(i)... w(n)

then for word w(i), I'd like to get POS features for w(i-1), w(i) and w(i+1).

 

When you use Document Viewer, you are able to see the POS labels, therefore there might be a way for reading this information from the Document. But I could not find it.

 

Cheers,

Fernando

 

Have you checked for such functionality in R using KNIME's R integration nodes (with packages such as NLPqdapopenNLP, tm, tidytext, etc.) ?

@Geo, thanks for the tip and links, I take it into consideration. Anyway, I still need to access the Document per-word tag structure at KNIME.

 

Cheers,

Fernando

Thanks Julian. The problem with that workaround, is that it refers to terms, not individual words. In order to work with window-based features, I need the tag(s) of each word.

Regards

 

Thanks Julian. The problem with that workaround, is that it refers to terms, not individual words. In order to work with window-based features, I need the tag(s) of each word.

Regards

 

Yes, I see what you mean, but absent a data structure which would allow you to keep the entire sentence structure (words vs terms), my best guess is that at the current state, the text processing nodes would not suit your needs.

You could still use Julian's workaround (BoW), convert tags to strings, and instead of using BoW to continue your analysis, you'd tokenize the documents yourself (after having converted them back to strings). Then join the tags with your tokenized words. Cell Splitter and Unpivot could be used for such tokenizing, or, as previously mentioned, R.

Good point Geo, thanks. It's a pity KNIME nodes cannot handle this.

Regards

 

Well, I'd keep it more optimistic: for now, it does not ;-) unless Killian has yet another idea

I'm sure it will! Just a matter of time. For now, I will fill the gap with an external approach.

Cheers

 

 

Hi Peleitor,

documents contain all the sentence, term, word and tag information and can be accessed on each level. However, this requires Java coding. You would need to implement your own custom node to do that. On the Java level you can make use of the Document class, containing all information. You can iterate over sentences, terms, word and access also their tag information.  Is Java coding an option for you? I am happy to help and guide you through the Document class.

Cheers, Kilian

 

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.