When creating BoW and vectorization of Tweets, what are the preferred frequency methods to use? There are some suggestions saying TF absolute frequency for Tweets since the documents will have similar lengths. Also, wondering if Bitvector should be applied or not before partitioning and training a model to do classification?
Hi @Huseyin, sorry for the delayed reply here.
Your first question about frequency is an interesting one. I think the most common metric used in practice is actually TF-IDF, which contains information not just about frequency but also some measure of how “important” a particular term is. I think I would try both and see if there is a significant effect on your model results. Note that there is not an explicit TF-IDF node available in KNIME yet - you have to calculate the value using a Math Formula node, but that’s pretty easy to do.
Regardless of whether you use a bitvector or frequency in your term-document matrix, you definitely want to do this upstream of partitioning for classification.
Hope this helps!
Thank you for your reply! I tried both methods with and without TF-IDF, but there was not much accuracy improvement. But ended up using TF-IDF since it’s a more “reliable” measuring method, in the way that it gives more weight to unfrequent terms.
My other question is regarding using Snowball Stemmer in my preprocessing step. When I don’t include stemming in my workflow, there is an almost a 5% increase in my model accuracy (especially with SVM). Downside is increased traning time caused by larger feature space. So, is stemming necessary, or is it possible to make an analysis without including stemming?
Hi @Huseyin -
Regarding stemming, it’s an optional treatment most often used to reduce complexity in the feature space, as you have noted. If your model run time is acceptable without it and your accuracy improves, there’s no need to implement it. Whether that’s a worthwhile tradeoff is for you to judge.
On cross validation, you’re usually going to implement that if you’re not sure about how “stable” your model is - that is, whether the variation is mostly constant across folds. If it is, then you can be fairly sure using regular partitioning is going to work fine. If it’s not, you might want to look more closely at the distributions of key features in your dataset.
About F1, you can read more at this older forum post, or at our blog:
Hope that helps!
Thank you @ScottF for your help!
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.