Absolute Term Frequency (TF) issue - Doubling count

Hi all,

I've been using Knime for the past couple of days for some text analysis on social media posts, mostly to identiy frequent keywords and topics. I've handling it pretty well, but had an issue when using TF node, with absolute value. Although a document contains only once or twice a specific word, TF is showing the number double, meaning, when there's once, it shows 2, and when it has 2, it shows 4, and so on. Any tips?

Thanks!

Gustavo Velho

Dear all,
I experience the same problem as Gustavo.
Thanks for your feedback Kilian :-)

Fabrice Latchurie

You've probably allocated the same column to Title and Full Text (because Strings to Document node forces you to choose a column for each).

TF nodes counts the words in both Title and Full Text. You can see why using the Document Viewer node - double click on any line.

A workaround is that before Strings to documents, you should create an empty column named e.g. fulltext and allocate it to Full Text in the Strings To Document node.

An alternative workaround is to divide the term frequency by 2 using Math node.

Geo mentioned exactly the same idea that I was going to ask as well. The title will be considered as well when counting words. Are you using the same string column for title and text?

Cheers, Kilian
 

Many thanks for your comments!
You are right, it was due to the the Title and the Full text.

Fabrice

Hi all,

Thanks for your help on this! I figured that out later, after I went on a vacation (why I'm getting back to community only now :) ), and recently I started playing with Knime again.

Again, thanks!  This tool is really awesome.

Gustavo Velho