Hypothesis Testing nodes


Many thanks for the Hypothesis testing nodes around different t-analysis tests, these are absolutely great. They will be very useful in statistically validating ongoing hypotheses.

I do however have a minor point around the Independent groups t-test node. The node asks for the grouping column and then to select the two groups in the dropdown boxes. However, if the grouping column contains more than 2 groups, the node fails, why is this.

Can the node be designed to filter out the other groups first? It would make the usability of the node more straightforward and simple.

If the user must filter out any other groups first, then I'm unsure what the purpose is of being able to select the two groups in the node ?



Hi Simon, 

Thanks for the positive feedback and for pointing out this weirdness with the independent groups t-test node.  Indeed, the dropdown boxes are extraneous and we will be looking to have this cleaned up in a future release.  

Hi Aaron,
Thanks for the prompt reply. My ideal preference for the node would be to keep the drop down boxes and allow the node to filter out any remaining groups.
There can be numerous cases where you have a data set of many groups but you are only interested on seeing significance betwween two specific groups.
Is it possible to get the node to filter just for the two specified groups, it would make the node somewhat more sleek and user friendly.


I tend to agree with you, but we are also entertaining the idea that since data with more than two categories is typically best dealt with via ANOVA and some sort of multiple comparisons analysis, that it might be justifiable to just drop the category selection altogether and simply require there to be exactly two categories of data in the column.  This provides the added benefit that the UI for the node becoming simpler in addition to putting that small roadblock in front of what I would guess is a fairly tempting and easy mistake to make. 

The reason I think I agree with you is that sometimes it is possble to have entries in a table which do not correspond to either category of data in the comparison.  One example of this that I can imagine is that some of the rows of data in your table are from category 1, some from category 2, and some others are part of neither group but are some sort of instrument calibration or control measurements.  Obviously you would't want these included, but on the other hand it should be fairly easy to filter them out beforehand.  Is this the type of situation you are considering?  Do you have anything else that might push this one way or another?  

We'd be very happy to hear from others in the community as well.



I am no expect on t-testing and ANOVA but from what I understand and from my usage of them is that;

ANOVA will only tell you if the x number of categories are significantly different to within x% confidence. It will not tell you very specifically if Category A is significantly different from Category B to within x% confidence. You may have 10 categories which ANOVA shows there is significance, but you may be interested then to see if two specific categories differ, which is where the Independent t-test would come in useful.

For example, you may be looking at average weight of an animal, and you have categories of Cat, Dog, Rat, Mouse, Guinea Pig, Bird, Bat with 100 chosen animals in each category. Of course, I am sure ANOVA will say there is a significance between the categories, but then you may be interested to know if there is significance between a Rat and a Guinea Pig only, which you would use Independent t-test for. 

I know you can do some row filtering before hand, but it would be cumbersome if you have to keep going back changing the row filter nodes everytime when you want to compare two different animals.



I'm not an expert either, and I haven't looked at these new nodes,  but the you need to be careful of running multiple pairwise comparisons and then looking at the 'significant' p-values - if you do 20 tests then 1 of them will likely show significance at 0.05 level.

Tukey's test can be used after ANOVA to systematically do pairwise comparisons and correct for this multiple comparison. Maybe this could be a node or option to the ANOVA node.

I guess to Simon's reuest, its sort of a difference between pre and post-hoc testing. If you know in advance that you are interested in looking at the difference between two groups (Cats and Dogs), then just do a t-test on those. If post-hoc you start looping over all possible pairs, then p-values from t-tests will find the most different, but quoting significance is hazardous and you need to consider the multiple comparison corrections.

After writing this I realize I'm a bit hazy on the details, so if anyone can say it better than me I'd be interested in that discussion!


Thats a good point Dave. As the number of categories increases, the chance of one pairwise comparison showing 95% significance is higher, even when the null hypothesis is true.

What the best hypothesis tool is to deal with this scenario, I dont know. Maybe the Tukey's test you mention, but I am unfamiliar with this one.


Thanks indeed Dave.  I did some reading on the wiki about this as I also hadn't looked at it in a while.  The articles on post-hoc analysis and multiple comparisons were quite useful.  It seems as there are a number of different approaches one can take to account for this effect and some are better than others depending the type of data that you are looking to analyze.

A multiple comparisons node is starting to sound better and better.

This is great information. Not all of us are experts in these matters of working out confidence intervals and what's statistically significant, and post-hoc analysis.

more hypothesis nodes taking into account these different scenarios would be really powerful to knime and it's users. It would be important however to try and explain well in the node description when to use that type of hypothesis node, so we all don't Need to be statistics experts.




Sorry to revisit an old thread, but I'm wondering if a Tukey Test was introduced? I'm looking to use it following One-Way Analysis of Variance.

Also, is there an implementation of the Kruskal-Wallis test alternative to ANOVA for non-parametric data?  While ANOVA is pretty robust I've used Kruskal-Wallis for data that's clearly too far from parametric for ANOVA to be suitable.




Not yet, sorry.  Is this something you would consider doing with an R snippet node?