Hey everyone,

today I made an interesting observation about term frequencies in KNIME.

I was investigating a corpus of texts, terms with 2-5 capital letters with an optional 1-2 digit number attached are sorted out by a RegEx //[A-Z]{2,5}[-]?[0-9]{0,2}// on a "Term to String"-column using the row filter. Before I do a bunch of preprocessing steps, “BoW” and “Term to String” among them, filtering is performed with “case sensitive” checked).

This works fine so far and I obtain a number of terms. Then I apply the TF node to compute absolute frequencies of these previously filtered terms, which leads to an interesting result:

**some of the previously cropped terms have an absolute frequency of 0 although these terms originate from this dcument (corpus).**

I can imagine to have a very low number when computing a **relative** frequency that eventually might become 0 due to some internal rounding errors (however even this would be satisfactory to my mind). However I do not have any idea how an absolute frequency of a term that was previously cropped from a corpus can be 0. It must be in there somewhere so the minimum **absolute** frequency should be 1, right? (ultimately I’m looking at the term frequency as a sum result of the TF(abs) grouped by the term)

Is the observed 0-frequency due to the internal definition of a “term”?

Could it be that there are certain non-whitespace characters in the text that interfere with the TF(abs) calculation?

Does anyone have an explanation for this strange frequency observation? I'm thankful for any input.