I am doing a sentiment analysis on tweets regarding Windows 10. I collected nearly 1,5 million tweets tracked with words like 'win 10','win10','windows 10' etc. The problem is, that now i have many non-relevant tweets dealing with promotional competitions or games like " Recovered poacher's rifle #vote IAPF to win $500k" and so on. Does anybody has an general idea how to filter out such non-relevant tweets?
Thank you very much in advance for the help!
I would simply define a blacklist with spammy terms and/or hash tags which you deduct from analyzing your dataset.
If you have further metadata such as information about the author of each Tweet, you might also be able to exploit that.
Thank you very much Philipp for your answer.
Which node you would use for implementing such a deduction based on a blacklist?
Ive tried it with Row Filter by typing an *$* in the pattern matching field while contains wild cards is checked.
The result is, that all tweets which contain $ are deducted.
It works! :)