I have a project about sentiment analysis which has to categorize sentiments into three class, let say ‘positif’, ‘negatif’, and ‘netral’. At first, I tested the workflow with a few of data and it worked perfectly. Unfortunately, when I try to use the real data, the Bayesian predictor can not classify the data into the right class. It just categorize all of the data into ‘negatif’. I am pretty sure that I have followed the right steps.
Here is the workflow that I made
senti_akakom.knwf (55.6 KB)
And here is the data (in ‘pre-processed’ sheet)
sentimen_akakom.xlsx (76.6 KB)
I have to use Naive Bayes classifier and am not allowed to change the classifier method.
Thank you in advanced for the help!
Hi @ajengayu and welcome to the forum.
I downloaded your workflow to look at it - I didn’t have your stopword list so I just removed that node. I’m seeing the same behavior you are for Naive Bayes. One of our other data scientists pointed out that NB assumes independence between the features of your model, which in the case of text is never going to be true. So from the start you might expect some strange behavior.
A few things for you to consider:
You’re using the English tokenizer in the Strings to Document node, but maybe the whitespace tokenizer makes more sense here.
Your classes are fairly imbalanced, so you may want to look at methods to deal with that (e.g., SMOTE
Stratified sampling in your partitioning node is another thing to think about.
Using the 3 changes above, combined with a Random Forest Learner, I was able to generate a model with 80% accuracy, although I know you mentioned you are restricted to NB. Is there a reason why you absolutely must use NB?
Sorry that I forgot to upload the stopword list and thank you for your response.
I have to use NB because it is for my thesis.
You are right, my data is imbalanced. Then, I tried the workflow with balanced data, but it gave the same output (all classified to ‘netral’). From this point, my questions are:
- Could I combine the SMOTE with NB by using the SMOTE after NB Learner node?
- Why does it give the same output although I use the balanced data?
I tried the whitespace tokenizer, SMOTE (after partitioning) and stratified sampling. The result of the prediction is still ‘negatif’ for all the data test.
First off, the suggestions I made above are really just nibbling around the edges of your problem. They are probably worth correcting since this is a thesis assignment, but aren’t going to fix things on their own. Sorry for not being more clear about that.
What is likely to be the big issue here arises from a few things - you have imbalanced classes, and you have a relatively small dataset. Because of that, it is very likely that you have multiple cases where a particular feature/term is only present in one class. Naive Bayes isn’t going to handle that well at all.
If you play around with the number of terms allowed as input to the model, you can see that the model is quite sensitive. For example, you are using a Row Filter to reduce the number of allowed terms to 329 from 1093 available in the corpus. If you filter this WAY down - I used a Top k Selector node to include only the top 10 most frequent terms - you can build a model that will predict something other than “negatif”. But increase much beyond that and you immediately run into problems.
I would highly recommend using a different algorithm if at all possible. Playing around with XGBoost I was able to get close to 90% accuracy (granted - without dealing with the class imbalance strongly). Because of the nature of your data, NB is going to be challenging to implement I think.
I understand so much from your explanation. I will take the advice into consideration.
Thank you, @ScottF!
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.