Hypothesis Testing

Hey everybody,

I'm analysing twitter data on world cup sponsors and will need to test whether there are more tweets about the sponsors during the world cup compared to the same time period last year.

I considered the paired t-test but it somehow compares two columns and I will have either two datasets or split the dataset in Knime according to the year or something. Is there any possibility to do what I want to do?



The paired t-test is for comparing multiple data points in one condition and then in another condition. So it sounds ideal for what you need. For example, if you have number of tweets for 1st June 2013, and 1st June 2014, and so on, up until the end of the world cup (the sooner the better!) then the 1st will be your first row and subsequent days will be your next rows of data. The first column in the node you will choose the 2013 data, the second column you will choose the 2014 data. You will then get a statistical result of the likelihood of the null hypothesis being correct.

Does this help, or does the format of the data not lend itself to this.

also instead of each row being a date, it could be a country instead for instance.



Thank you for your reply! Sounds reasonable to me but I am stuck at getting the information into the right format.

Currently I don't have the whole dataset yet so I am testing on a smaller sample. Tweets are from the 18th-20th of June, so I want to perform a t-test not conditional on the year but on the day. I have the tweets in rows with the string and "I" columns:  title=tweet text, Year, Month, Day (and some others I need at another point in my workflow).

So I want to see whether there are significantly more tweets on the 20th than on the 19th for example. So I am comparing by the "day" column. But then I don't quite know how to proceed. Choosing the "day" column for both columns in the test does not really give me a reasonable result ;)

Well, the more datapoints you get, the more likely you will get to see a statistically significant effect.

ideally then if you are comparing it by day, you will want two day columns, a before world cup day, and during World Cup day. So in your example you mention, ideally want on row 1, Wednesday 11th and Wednesday 18th, row2 to be Thursday 12th and Thursday 19th, row3 to be Friday 13th and Friday 20th etc.. Alternatively you may want to forget about direct day to day comparisons and look at a before and during scenario using an independent t test, as it might not be clear whether one Wednesday is a good comparison with another Wednesday. For the before and during scenario, You will need a column with a label to differentiate between "before" and "during".

so the format you want to get to first initially is; day, month, year, number of tweets. To do this use the groupby node, group on all the date columns, and in the options tab, aggregate on Tweet Text and for method use Count.

then to add a label, use rule engine node. Set up a date rule for the start if worldcup. If date is before then add label "Before", if date is after start date then add label "during".

now use the independent t test on this data set using the label column for the grouping column, and pick the labels in group one and two.  For the test column, choose the number of tweets column.

doesthis help.