Accessing tagged Document structure

peleitor · April 1, 2017, 2:47pm

I have a tagged Document for which I need to access each individual token and its related annotation (POS, NER, etc.).

I don't see how this is possible with KNIME named nodes, but I believe I could access Document (and tags) structure with Python (preferred) or Java scripting.

How can I do this? Where is documentation regarding Document and its tags?

Thanks

Geo · April 2, 2017, 12:32pm

Do you have an example of such document? to visualise its structure…

julian.bunzel · April 3, 2017, 12:27pm

Hey peleitor,

unfortunately it's not possible, since terms are only available as string types in the Java Snippet node. So it is not possible to retrieve any tag information via the Java Snippet node to manipulate them.

If it suits you, you could use the Tags to String node to get a String column containing the tags, but I guess this is probably not a satisfactory workaround, since you can't merge the manipulated strings back onto the term column.

Cheers,

Julian

Geo · April 3, 2017, 8:48pm

Well, it would exclude the use of the text processing nodes for direct text manipulation, but it would be interesting to see first what the objective of the analysis is… after all, there is the powerful string manipulation node with its regex implementation…

peleitor · April 4, 2017, 3:18am

Suppose that you want to create window-based features.

Eg. if sentence is: w1 w2 ... w(i)... w(n)

then for word w(i), I'd like to get POS features for w(i-1), w(i) and w(i+1).

When you use Document Viewer, you are able to see the POS labels, therefore there might be a way for reading this information from the Document. But I could not find it.

Cheers,

Fernando

peleitor · April 4, 2017, 3:18am

Suppose that you want to create window-based features.

Eg. if sentence is: w1 w2 ... w(i)... w(n)

then for word w(i), I'd like to get POS features for w(i-1), w(i) and w(i+1).

When you use Document Viewer, you are able to see the POS labels, therefore there might be a way for reading this information from the Document. But I could not find it.

Cheers,

Fernando

Geo · April 5, 2017, 12:28am

Have you checked for such functionality in R using KNIME's R integration nodes (with packages such as NLP, qdap, openNLP, tm, tidytext, etc.) ?

peleitor · April 5, 2017, 3:36pm

@Geo, thanks for the tip and links, I take it into consideration. Anyway, I still need to access the Document per-word tag structure at KNIME.

Cheers,

Fernando

peleitor · April 5, 2017, 3:38pm

Thanks Julian. The problem with that workaround, is that it refers to terms, not individual words. In order to work with window-based features, I need the tag(s) of each word.

Regards

peleitor · April 5, 2017, 3:38pm

Thanks Julian. The problem with that workaround, is that it refers to terms, not individual words. In order to work with window-based features, I need the tag(s) of each word.

Regards

Geo · April 5, 2017, 11:12pm

Yes, I see what you mean, but absent a data structure which would allow you to keep the entire sentence structure (words vs terms), my best guess is that at the current state, the text processing nodes would not suit your needs.

You could still use Julian's workaround (BoW), convert tags to strings, and instead of using BoW to continue your analysis, you'd tokenize the documents yourself (after having converted them back to strings). Then join the tags with your tokenized words. Cell Splitter and Unpivot could be used for such tokenizing, or, as previously mentioned, R.

peleitor · April 5, 2017, 11:56pm

Good point Geo, thanks. It's a pity KNIME nodes cannot handle this.

Regards

Geo · April 6, 2017, 5:06pm

Well, I'd keep it more optimistic: for now, it does not ;-) unless Killian has yet another idea

peleitor · April 7, 2017, 2:55pm

I'm sure it will! Just a matter of time. For now, I will fill the gap with an external approach.

Cheers

kilian.thiel · May 16, 2017, 12:08pm

Hi Peleitor,

documents contain all the sentence, term, word and tag information and can be accessed on each level. However, this requires Java coding. You would need to implement your own custom node to do that. On the Java level you can make use of the Document class, containing all information. You can iterate over sentences, terms, word and access also their tag information. Is Java coding an option for you? I am happy to help and guide you through the Document class.

Cheers, Kilian

system · June 2, 2023, 9:47pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.