Generate complete co-occurrence matrix

I want to generate co-occurrence data for terms in a large collection of documents, but I'm not sure that the term co-occurrecnce counter can provide what I need.

What I want is a complete co-occurrence matrix (or pivoted equivalent). I want to be able to index the co-occurrence of every term against every other. In matrix form, this would mean a full square (symmetrical) matrix. In pivoted form (which is actually what I want), every term would be listed n-1 times (n being the number of unique terms).

The term co-occurrence counter node does something along these lines but as far as I can tell, it does not package the results in a way that be can readily coverted to the complete format. The distance matrix calculate node creates just the format I want, but I do not want to calculate distances, I want to record co-occurrence.

Is there some way I can do this? I don't care whether my input data needs to be in the form of documents or a table of counts, as I can easily convert between the two. Or perhaps the solution is to manipulate the co-occurrence counter outputs in some way... but so far I can only picture very complicated solutions along these lines. I've also seen references to the Statisitcs node producing an occurrences table, but I don't know how to make it provide the co-occurrence information that I am after.

Any help would be appreciated!

I'm pretty sure I've solved it. I realised I could achieve the desired outcome by upivoting my data (the rows were documents, the columns were terms), filtering out the rows (which were now term-document pairs) with zero counts, and joining the original pivoted columns. Grouping the joined table by term (collapsing the documents) yielded the completed matrix that I was looking for.

I think, anyway! At least it looks good so far.

Dear sugna,

I am trying to generate a matrix view of a table generated by Knime's term co-occourence counter module ... I'd like it to be square (i.e. all unique terms displayed in X an Y axis).

I think I have to manipulate the term co-occourence table in some way, but having trouble for now. Is it possible for you to post a simple / example workflow on how you did it? Thank you!




you can create a full term co-ocurrence matrix by simply using the Pivoting node after the Term Co-occurrence Counter node. Group by term 1 and use term 2 column as pivot. As value you can use the sum of document co-ocurrences. Attached is an example workflow.

Cheers, Kilian