Term frequency calculation issue - term and TF(abs) = 0

 

Hey everyone,

today I made an interesting observation about term frequencies in KNIME.
I was investigating a corpus of texts, terms with 2-5 capital letters with an optional 1-2 digit number attached are sorted out by a RegEx //[A-Z]{2,5}[-]?[0-9]{0,2}// on a "Term to String"-column using the row filter. Before I do a bunch of preprocessing steps, “BoW” and “Term to String” among them, filtering is performed with “case sensitive” checked).
This works fine so far and I obtain a number of terms. Then I apply the TF node to compute absolute frequencies of these previously filtered terms, which leads to an interesting result:

some of the previously cropped terms have an absolute frequency of 0 although these terms originate from this dcument (corpus).
I can imagine to have a very low number when computing a relative frequency that eventually might become 0 due to some internal rounding errors (however even this would be satisfactory to my mind). However I do not have any idea how an absolute frequency of a term that was previously cropped from a corpus can be 0. It must be in there somewhere so the minimum absolute frequency should be 1, right? (ultimately I’m looking at the term frequency as a sum result of the TF(abs) grouped by the term)

Is the observed 0-frequency due to the internal definition of a “term”?
Could it be that there are certain non-whitespace characters in the text that interfere with the TF(abs) calculation?

Does anyone have an explanation for this strange frequency observation? I'm thankful for any input.

Hi Tim,

indeed, every term contained in a document should have a absolute frequency of  at least 1 by definition. Could you please explain you workflow in detail, or better, attach the workflow including a small subset of the data you are using (or artificial data).

I tried to reproduce the problem using the attached workflow but i was not able to create terms with a frequency of 0. What preprocessing steps are you using? What version of KNIME are you using?

Cheers, Kilian