Text Processing: TF Node counts incorrectly?

mwiegand · May 24, 2024, 7:16am

Hi,

maybe I make a mistake as I am not that familiar with the text processing nodes but when providing some aid in:

I happen to notice a way to high count of words using the TF Node. To be precise, the amount is twice as high.

Edit: Here is the test workflow

Best
Mike

k10shetty1 · May 28, 2024, 11:49am

Hi @mwiegand,

Thanks for bringing this up.

You’ve likely assigned the same column to both ‘Title’ and ‘Full Text’ column in the Strings to Document node, The TF node counts words in both these fields. To double check, use the Document Viewer node.

To fix this, either create an empty column named “fulltext” and assign it to Full Text in the Strings to Document node. Or, use the Math node to divide the term frequency by 2.

Best,
Keerthan

mwiegand · June 1, 2024, 5:41am

Good morning @k10shetty1,

you are right but doesn’t this beg the question if either the nodes description is wrong or its behavior:

Computes the relative term frequency (tf) of each term according to each document and adds

Apparently the title belongs to the document so maybe a good compromise is to add the ability in the TF node to select to count the whole document or aspects of it:

Title
Text
Meta: Whole, Source, Categories, Authors

Interestingly, the meta information is not taken into account by the TF node currently.

I updated the test workflow for reference in case anyone stumbles across that subject.

Best
Mike