Retrieving sequences of POS tags

AngusVeitch · March 9, 2020, 5:03am

I am wondering if there is a convenient way to analyse sequences of words in terms of the POS tags rather than just their lexical meanings. For instance, I want to know how often certain grammatical constructions appear, such as “pronoun adverb” or “verb adjective”. Basically, I want something like the ngram creator, but with the option of analysing the tags instead of (or as well as) the terms.

Another approach would be to modify the document terms so that they include the POS tag as part of the word. Then I could use the the existing ngram creator and separate the terms and tags myself using string manipulation rules.

Actually I am currently simulating the latter approach by splitting my text into sentences, tagging the POS, then using the bag of words output together with the dictionary replacer to append the POS tags to the terms (for example, ‘strange’ would become ‘strange_JJ’). But this method has obvious limitations, the main one being that you have to guess which POS tags match up with which instances if terms occur multiple times in a document. It also involves a lot of extra processing. It is not a proper solution.

I note that another user asked a similar question a few years back here and here. That specific problem might now be solved with the term neighbourhood extractor, but that is only helpful if you are interested in a few specific terms rather than recurring generic patterns.

It seems a shame not to be able to get the full value out of the tagged text. Or is there a feature I have missed?

Thanks.

ScottF · March 10, 2020, 9:20pm

Hi @AngusVeitch -

I don’t think there is a catch-all node that will do what you’re looking for, at least right now. But you have some good ideas here that I can pass on.

I did take your idea of using the Term Neighborhood Extractor and played around with that a bit, directly after a POS Tagger.

I used a series of Tags To String and Column Rename nodes to extract tags, combine them together with a String Manipulation node, and then group / sort to look for commonly recurring patterns, and plot them:

I don’t know if this is useful in any way. It’s definitely extra processing in any case. But maybe it gives you some ideas? If you like I can clean up the workflow and post it.

AngusVeitch · March 10, 2020, 10:16pm

Thanks for that reply, @ScottF. I would be keen to see your workflow, so I’d be grateful if you could share it. And I’ll maintain some hope that one day we might be able to dig into the tagged text a little more easily.

Cheers.

ScottF · March 11, 2020, 4:14pm

Hi @AngusVeitch -

Here’s the workflow:

Cheers!

system · June 2, 2023, 9:42pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.