When I plug in your RegEx into regex101.com I’m getting a syntax error, so maybe check there. (I’m not a RegEx pro by any means, many others here are very good with it though.) What do you intend for your RegEx to filter out?
I did notice that starting at your Number Filter node, you begin to append a new column to your dataset, instead of replacing the existing Document column as you did before. This is going to cause problems when you get to the Bag of Words because only some of the preprocessing steps will have been applied to the column that you end up selecting. Carefully go through your preprocessing nodes and make sure the replace column option is being applied consistently and correctly.
When it comes to labeling, do I have to label some of my tweets manually? Or can I for example use sentiment140 or Kaggle Airline Review dataset to train an ML model and then deploy this trained model with new tweets, which in this case are my tweets? I also see a lot of the examples are using the category to class node, is this necessary when working with supervised-learning?
If you don’t have labels for the individual tweets, one approach is to apply the positive and negative dictionaries to the documents, and then calculate a score based on how many positive versus negative words show up.
If you do have labels, then storing them in the document early using the Category in the Strings to Document node is useful, so you can pull them back out later with the Category to Class node prior to implementing your classification algorithm.
Thanks for the information @ScottF, it was very helpful. Is it also possible to use the Amazon Comprehend Sentiment Analysis Node, which labels the tweets into positive, negative, neutral and mixed sentiments and then implement classification algorithm?
Well, the Comprehend service is essentially doing the classification for you, so there would be no need to implement another classification model afterwards. Also, note that while the node itself is free, use of the Comprehend service is not.
Excuse me for asking so many questions, but can I train my model using tweets that are labeled from let’s say Comprehend service and then test with new unlabeled tweets with the same topic?
I also saw an example where the labeling process was handled by the Java Snippet node. The node was programmed to specify three different categories with relevant keywords for each category. Is it possible to label tweets in this way? Sorry in advance for asking many questions as I’m very new to ML and especially text processing.
I suppose it’s possible to use Comprehend in this way, I just don’t understand why you would want to. If you train a model in KNIME based on tweets labeled by Comprehend, you are building a model based on results of a model. Why not just Comprehend only?
On your question about labeling using Java Snippets, I’d have to see the example you’re talking about, but It sound fairly simplistic.
By definition, when using the unsupervised approach, you don’t have the ground truth for your data. As a result, the Scorer doesn’t having anything to work with, because it compares ground truth (in this case, sentiment labels) with model predictions to create the summary statistics.
If you are determined to calculate accuracy metrics, you may have to dedicate some time to manually labeling a subset of the tweets yourself. Sometimes that’s unavoidable.
Thanks again for the clarification. Could you help me regarding a problem in my workflow. I don’t understand why terms like “afghanistandisast” are not showing up correctly and is tagged with positive sentiment. I also have terms like “http” and square symbols which are included in the BOW. I have tried to use Regex filter, but without any luck. Almost each tweet is tagged with positive sentiment, which should be the opposite since most of the tweets are negative.
Hi again, I’m sending my workflow. I simply don’t know why urls, non-english strings, emojis are present in the BOW even after using RegEx and also some white square symbols. And why some terms are tagged incorrectly. I think I tried dozens of regex formulas but nothing seems to work. Could you please help me with this issue? Text Processing.knwf (2.8 MB)
I’m not a RegEx expert, but the reason things might not be working as you expect is because you’re doing a lot of your filtering last in the series of nodes. For example, this might cause your RegEx to fail if it’s looking for http:// since it won’t find the slashes, because they were already removed by the Punctuation Erasure node. So a simple thing you could try would be to move your filtering nodes ahead in your workflow.
Apart from that, if you want to remove hashtags, I think that would require a separate RegEx. If you want to remove emojis you could try this component from @takbb prior to converting your tweets to documents: String Emoji Filter. I’m not sure why you would expect non-english strings to be removed, since as far as I can tell your workflow doesn’t do that.
It might also be worth stepping back and looking at this from a “10,000 foot view” - if your main consideration is sentiment classification on unlabeled data, then a lot of the extra mess you have (non-english words, URLs, etc) aren’t usually going to affect the results much, since you would primarily be counting words of a particular sentiment type anyway.
Yes, using RegEx earlier in the workflow gave better filtering results. And in fact that some URLs and non-english words are present should not make so much difference to achieve my goal for sentiment analysis. Thanks for your patience! I’m very grateful for your help and support.
Is there way to increase the accuracy of the Naive Bayes classifier or any other classifier for imbalanced data? My accuracy with Decision Tree is 62% and 60% with SVM i, but 1.2% with NB classifier. Even though SVM and DT accuracies are not so good either, it’s somehow acceptable. I’m doing multi-class classification btw. This is my NB workflowNaive Bayes Classifier.knwf (2.8 MB)
I tried using SMOTE on the training set, but got a bit worse accuracy with the XGBoost Classifier, but maybe my problem is caused by too small dataset? I also have a problem with the snowball stemmer, since some of the words are not stemmed correctly. Also is there an alternative ROC curve for multi-class classification?Sentiment Analysis.knwf (3.1 MB)