I’m currently working on a sentiment analysis project, where my goal is to perform sentiment analysis with ML on tweets with good acc score. I collected around 18 000 tweets and labeled them into four categories; positive, negative, neutral, and mixed. Basically a multiclass classification problem. But I have bad acc score with SVM, Decision Tree, and XGBoost. I tried using SMOTE for minority classes for my training data, but still does not perform well.
Does anyone have a good suggestion on what I can do to improve my scores or any tip regarding other tools? My acc is between 0.6-0.65
Thanks in advance!
With the amount of information given, it’s quite difficult to give any constructive advice. Here’s some questions for further reflection and some hints:
How’s the categories distributed?
What’s the baseline accuracy based on that distribution? What accuracy can you achieve by just predicting the most dominant class?
How does your confusion matrix currently look like?
“Neutral” and “mixed” are probably hard to classify – how’s the results if you limit to positive / negative only? Maybe you can reduce this to binary classification and use the classification probability to deduct neutral/mixed.
What preprocessing do you apply? Tweets are quite special, typical NLP preprocessing will likely not get you very far
Additionally to what you have, definitely have a look at the Palladian Text Classifier which has various simple preprocessing methods (char and word n-grams) and a simple Naive-Bayes-based scoring. It can get you quite far (especially compared to the tuning which you have to invest in conventional text classification workflows)
- You say you have a multi class setup? How do you use the ROC curve?
Thanks for your fast reply. I’m still new to ML and Knime, and not sure how I can answer some of your questions. I apply these preprocessing methods:
I followed the Kaggle US airlines workflow to create a predictive model for supervised machine learning.
When it comes to the ROC curve, it was from an earlier test that I forgot to delete from my workflow. Do you have any suggestions on what I can add something similar for multiclass classification for visualization?
I will try to limit only to positive, negative, and neutral. If I limit only to pos/neg my dataset will be highly unbalanced since the majority of the tweets are negative. I’m also wondering if there is an easy way to filter out mixed category from my csv, since I have a header named sentiment and the rows represent the classes.
I forgot to mention that almost 10 000 tweets are labeled as Negative, 1000 as Positive, and 5000 as Neutral. My acc with three classes, pos, neg and neutral
I tried two different workflows, the IMDB dataset workflow with ML and US airlines workflow, and replaced them with my own dataset. Since I work with tweets, which one should I follow? The preprocessing methods are different. Most workflows also use loops on the ML part, what is the purpose of this and does it have any benefits? One last thing, when inspecting the scorer, correct classified and wrong classified sections don’t add up with my dataset. Shouldn’t the sum of these be around 18 000? My data is highly imbalanced
Please post your workflow with data if you encounter errors like the numbers not adding up.
Tweets are popular for analysis so there are many examples on the KNIME Hub for them:
Thank you! This is my workflow and the confusion matrix for the DT
As you can see from the scorer node, the correct classified and wrong classified don’t add up with the total number of tweets I got on my CSV file.
For machine learning, there is a train/test paradigm. The results are shown only for your test set which is why the number is much smaller.
Thank you for your feedback. Now it makes more sense to me.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.