Term co-occurrence count

“Statistics of document-wise co-occurrence may be collected in two different ways. In the first case, fww0=fw0w is simply the number of documents that contain both w and w0. Alternatively, we may want to treat each instance of w0 in a document that contains an instance of w to be a co-occurrence event. Therefore if w0 appears three times in a document that contains two instances of w, the former method counts it as one co-occurrence, while the latter as six co-occurrences.”

So far, I have counted the nr. of docs where two terms co-occur, but now I want to count the co-occurrence events per doc as part of a doc-relevance metric. The Term Co-Occurrence Counter node, however, counts only two co-occurrences in the example above just as if w0 appeared only twice, which is a problematic result as shown next.

Imagine that we have a doc describing the mechanism by which a chemical compound could induce a disease. The name of the disease might appear just a couple of times, whereas the compound might be referred to in multiple occasions as its effect on various bodily tissues is described. The node’s underrated co-occurrence count of disease and compound would, in this case, not be representative of their relationship’s prominent role in the doc.

Hi @mpenalver -

So, if I’m understanding you correctly, you’d like the Term Co-Occurrence Counter node to take into account frequencies, in addition to just listing whether terms co-occur at a particular level?

I’m not a specialist in pharma or drug applications, but I wonder if this could be gleaned from some combination of the NGram Creator and TF nodes with a little finagling… I’ll have to think about that (maybe I’m oversimplifying).

Thank you, @ScottF. Term Co-Occurrence Counter does provide the count of co-occurrences, but not the value I would expect.

Let me give you a simple example. If term1 appears once in a doc and term2 appears twice, the node will report one co-occurrence of both terms instead of two, which correspond to the co-occurrence of the single instance of term1 with each of the two instances of term2. I would like to understand the motivation behind that choice.