I've been using Knime for the past couple of days for some text analysis on social media posts, mostly to identiy frequent keywords and topics. I've handling it pretty well, but had an issue when using TF node, with absolute value. Although a document contains only once or twice a specific word, TF is showing the number double, meaning, when there's once, it shows 2, and when it has 2, it shows 4, and so on. Any tips?
You've probably allocated the same column to Title and Full Text (because Strings to Document node forces you to choose a column for each).
TF nodes counts the words in both Title and Full Text. You can see why using the Document Viewer node - double click on any line.
A workaround is that before Strings to documents, you should create an empty column named e.g. fulltext and allocate it to Full Text in the Strings To Document node.
An alternative workaround is to divide the term frequency by 2 using Math node.
Geo mentioned exactly the same idea that I was going to ask as well. The title will be considered as well when counting words. Are you using the same string column for title and text?
Thanks for your help on this! I figured that out later, after I went on a vacation (why I'm getting back to community only now :) ), and recently I started playing with Knime again.