Analysing Twitter Data - Sentiment Analysis and Frequencies

Hello :)

First of all I want to apologise for possible stupid questions. I have no clue about programming and similar but have to do a project in business informatics about text mining and I have been sitting in front for knime for weeks now trying to find a fitting workflow.

We want to analyse Twitter Data on the World Cup and its sponsors, in particular we would like to analyse whether the world cup has an effect on the sponsors and their twitter representation (via comparing the number and the sentiment of the tweets about / mentioning the sponsors (cocacola and adidas) from February till June 2014 (before world cup) and february till June 2013)

Secondly we want to analyse who does it better (adidas or cocacola) as in who gets more tweets relating the company to the world cup (i.e. mentioning their respective world cup hashtag (#allin for adidas, #worldscup for cocacola) or mentioning some worldcup hashtag and referring to the sponsor (e.g. a tweet including @adidasfootball and #WorldCup2014)

Now I get stuck at all the little things. First of all the data but that I should get through my supervisor at university I hope. I am currently working on the "bundesliga" dataset provided via the knime example workflow for the palladian websearcher

Then I would like to define all the (for us interesting) hashtags and referals as unmodifiable named entities (including the symboly, I'd like a tag cloud to show @adidasfootball etc. in the end). I'd do that with the Dictionary Tagger. Here, looking at the Bag of Words, Knime sometimes splits the hashtag from the following word and sometimes doesn't, thus creating different terms I'd have to tag separately which I do not want. How can I prevent this?

Secondly, I need to get rid of the URLs, which is a bit tricky as the BoW creator splits the http from the rest. I worked around this with two replacer nodes, one replacing http with " " (nothing) and one replacing the rest via replacing "://[^\s]+" with " " (I googled that and it worked alright). This seems rather unprofessional to me, is there a better way to do this?

Then I need to get rid of dates which, in the sample dataset I am working on right now, are included like "2013-11-19t06:09:44.0" (these are also the document titles for the different tweets), but again, sometimes in the BoW output table they are splkit into two terms, sometimes they are not. This makes it difficult to exclude all of them based on a regular expression (if that is the right name for what I did with the URLs). I also worked around this with two replacers (as it seems that the date is either kept as one term or split only in two), but for a larger dataset which I will be facing, this seems tedious.

Also, after applying a few standard preprocessing nodes, the tag cloud seems huge with a lot of words that don't exist. I assume this might make a sentiment analysis difficult. I know there are a lot of abbreviations and half sentences and stuff in twitter, so do you have any suggestion what is a good way to tackle this problem?

I particular with regards to defining positive and negative terms for the sentiment analysis, I reckon I need to get the super messy twitter data somehow clean, but I am so out of ideas.

Sorry for all the questions. I am hopeless. I am certainly not stupid but with zero IT background ( I study business), this is rather complicated and frustrating.

P.S. another task, I just realised I will be facing, is different languages. For simplicity purposes I would like to keep only English tweets (maybe German at a later stage of my seminar if my supervisor wants it). Is this possible?

Just realised this might be part of a reason for the number of non existent words, if the preprocessing nodes have to preprocess partly spanish and italian tweets without "being notified" that they're there.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.