Hi, I am looking at the workflow example for Lexicon-based sentiment analysis. I noticed that the cut-off point formula in the Rule-engine node to classify the score as positive is the average sentiment score instead of zero. May I know why shouldn’t it be zero instead, and considering I only have a subpar knowledge on sentiment analysis, is there a suggested reference that I can read regarding this classification rule which could elaborate the basics well for beginners?
And secondly, in the workflow example given in Knime, I believe that the actual class has already been predetermined, and that this class column is extracted using the Category-to-Class node (if I understand it right). I would like to know if I am to construct the confusion matrix from my own data, would it be suffice for my manually classified dataset to comprise only a portion of the total document number? For example, I want to study 1,000 online reviews but my confusion matrix data totals up to only 10% of the whole study?
Thank you in advance!
I think the threshold you choose just depends on what you want to achieve. In the case of the example workflow, the goal may be to have approximately equally-sized groups based on the sentiment (though using the median here instead of the mean might make sense). Using zero would make the prediction mean “more negative than positive words”. However, maybe there is some bias in the corpus or the list of negative and positive words, for example some of them might just be more frequent naturally. That’s why mean was probably chosen instead of zero.
In my opinion you could also use a fraction of your test data to construct the confusion matrix, but then, unless you use the whole test dataset for some other quality measures, why don’t you reduce your test dataset and increase the size of your training dataset? E.g. if you use 500 of your 1000 documents for training, and 100 for your confusion matrix, you are essentially throwing away 400 documents that could have made your model better.