Including document meta information in term co-occurrence counts

I've successfully used the Term co-occurrence counter node to analyse co-occurrence between terms within documents. What I'd also like to do is calculate co-occurrence between terms and document metadata, such as the author or source category.

I gather from my efforts thus far that the co-occurrence counter is not designed to do this: it seems to count co-occurrence either within the document body text or within the document title (I haven't actually tried within-title counting, but I assume that's what the title-level co-occurrence option is designed to count). I note that there is an option to skip meta information sections, but unchecking this doesn't have any apparent effect.

Is there any way to achieve what I am describing here? Basically I want a higher-level to the co-occurrence counter's options -- one that considers the document text and the meta information together.

One option that occurs to me would be to artificially 'inject' the desired meta information into the document text itself. I suppose appending this information using the Groupby or String manipulation nodes while the documents are in string form could achieve this. This is far from ideal, though.

Another option would by to construct a different method to calculate co-occurrence that can incorporate additional data fields. But I don't know how to do this anywhere near as efficiently as the Term co-occurrence counter node.

Any suggestions would be appreciated.

Hi,

you are right, the Term Co-occurrence counter is not designed to take meta information into account. Instead of injecting the meta info into the document text maybe you can try to use Frequent Item Set Mining. You can think of authors as items and documents as transactions. Then start with a low minimum support to get many frequent sets. 

Cheers, Kilian

That's a totally new technique to me! If I get around to trying it out and have any success I'll let you know. Thanks.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.