Hypothesis node around calculating p values of regression models

When using a linear regression model to predict a certain value, you are often using numerous data columns in this model.

the more data columns that are used relative to the number of data points (I.e. degrees of freedom), the more likelihood of getting a chance correlation.

please can you design a hypothesis node which will take two columns for comparison (one is actual data set, the other is predicted data), and look for the correlation coefficient, and then using a user input on number of data columns used for the predicted column, calculate a p score. 



I would instinctively want something like this to go directly into the predictor node, but am guessing that this will be a bit of work to implement. For example, we would probably need to  update our pmml from the learner to include the values needed to calculate these statistics.  It seems possible anyways.

Before we go too far down that road however, please have a look at the linked website, and let me know if these are the sorts of things that you are looking for.  I would expect that a confidence interval for each of the predicted values along with a table containing a summary and another with coefficients fpr listed in the linked site would be everything you always wanted, no?


If we can get the desired behavior described in a bit more detail, I can log it in our system as a feature request.

For now, as always, if you can't do it directly in KNIME, at least you can probably do it in KNIME in R :)


I dunno what a chance correlation means. Good numerical correlation does not imply causation, but it does indicate that the dependent variable can be predicted by the correlated independent variable.

Would this not be possible by using the correlation node & correlation in the following 2ways:

1. Identify all linearly correlated X columns without looking at Y.

2. Identify all X columns above a given correlation threshold say 0.7 with Y.

Keepall columns that satisfy 2. and then among the X columns leftover, looking at their mutual correlation, choose those with a greater correlation with Y.

Some Numerical and row/column Filters and possibly the transpose nodes may be required.

PCA can also be used to find columns with high weights that occur in principle components 1-3.

Hi in silico. The only trouble with using a large number of x columns to build a multi linear model, even with an r of 0.7 is that you could end up with a model which appears to be very predictive but as it happens is just a chance correlation, because there are now so many variables it's possible to get all the x columns to fit and predict the y column. 

It would be very useful to quantify what that chance correlation is (I.e what the p value is) for the given model correlation (r) considering how many data points there are and how many x columns were used.

i.e. a MLR model which uses 3 data columns with an R2 of 0.7 is a much better model than a MLR which uses 6 data columns with an R2 of 0.7 as the risk of a chance correlation is considerably lower. So what is the p value for both?


Hi Aaron, getting the statistics into the predictor nodes would be the best case scenario but I realised this may a lot of extra work. The reported statistics of p values on each independent variable would be very useful and the F value and it's associated p value of the model would be very desirable! As well as the R2 value too. This link also explains the same thing, but I understood this one a bit better; http://dss.wikidot.com/multiple-regression. But yes, the output contents in your link would be perfect!

thankyou very much for looking in to this. As we use the modelling facilities more and more, its becoming increasing important to know the models are statistically significant.

unfortunately I don't do R, I'm allergic to scripting, so I will look have to look forward to a future implementation :-)


Hi Simon,

You are really describing a generic situation with building predictive models in general and regression in particular.

P-values would tell us whether the null hypothesis( that the mean /expectation value is the same give that the data are normal and the variance follows an f-distributuion) is the same given is true between the predicted and known values.

But I can't see how that is in any way related to choosing the correct features/descriptors in building  models. The assumption obviously is that your choice of descriptors is such that it captures essential features of the model. e.g. TPSA for BBB.

You can certainly overfit a model with a limited dataset with a high R. Or the same thing can happen when higher order/degree polynomial is used for curve fitting. This certainly results in a tight fight to the data but is very poor at generalization. i.e prediction of newer values. Neural nets and regression methods are also better at interpolation that extrapolation so that also may give a false impression of a good fit.

I am sure that you are aware of the above pitfalls and the need to have diverse training datapoints (proper sampling), lack of big gaps in Y values, descriptor/property selection algorithms based on their influence on the Y value ( weka in its original form has many such) and cross validation, different kinds of errors, RMS, MAE and the need to remove correlated X columns.

So if these datamining/statistical techniques are followed in knime/weka, it should really address your concern about "chance predictions", which in my view is really aquestion about proper feature/descriptor selection and avoiding overfitting.



Many thanks for the detailed feedback, in silico.

I am certainly no expert on modelling and statistical analysis, but do my best to understand as much as reasonably possible.

I generally follow the points you outlined in your post to try and make sure I dont overfit the model, i.e. in terms of generating a brilliant model with the supplied data, but being hopeless with new data points.

I think my main reason for asking for some quantifier of reliability is when you present findings on a good model and where colleagues are surprised as to how good it is and often question the reliability of it etc.. So it would be useful to somehow provide a piece of statistical data to quantify the chances of seeing the observed correlation when in fact the null hypothesis is true (i.e. there is no genuine correlation). Otherwise its difficult to provide a reassuring answer other than to say that the best protocols for model generation have been followed.