Absolute Term Frequency (TF) issue - Doubling count

gustavo.velho · April 6, 2016, 7:43pm

Hi all,

I've been using Knime for the past couple of days for some text analysis on social media posts, mostly to identiy frequent keywords and topics. I've handling it pretty well, but had an issue when using TF node, with absolute value. Although a document contains only once or twice a specific word, TF is showing the number double, meaning, when there's once, it shows 2, and when it has 2, it shows 4, and so on. Any tips?

Thanks!

Gustavo Velho

Fab72 · April 18, 2016, 5:54pm

Dear all,
I experience the same problem as Gustavo.
Thanks for your feedback Kilian :-)

Fabrice Latchurie

Geo · April 19, 2016, 12:15am

You've probably allocated the same column to Title and Full Text (because Strings to Document node forces you to choose a column for each).

TF nodes counts the words in both Title and Full Text. You can see why using the Document Viewer node - double click on any line.

A workaround is that before Strings to documents, you should create an empty column named e.g. fulltext and allocate it to Full Text in the Strings To Document node.

An alternative workaround is to divide the term frequency by 2 using Math node.

kilian.thiel · April 21, 2016, 1:11pm

Geo mentioned exactly the same idea that I was going to ask as well. The title will be considered as well when counting words. Are you using the same string column for title and text?

Cheers, Kilian

Fab72 · May 9, 2016, 12:58pm

Many thanks for your comments!
You are right, it was due to the the Title and the Full text.

Fabrice

gustavo.velho · May 20, 2016, 8:23pm

Hi all,

Thanks for your help on this! I figured that out later, after I went on a vacation (why I'm getting back to community only now :) ), and recently I started playing with Knime again.

Again, thanks! This tool is really awesome.

Gustavo Velho

system · June 2, 2023, 9:48pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.