I have been having problems performing a simple linear regression on a set of data.
In the data, there are thousands of rows, but the independent numerical values are z-score/Gaussian normalized across each row - that is to say, if there are 4 independent columns (A, B, C, D), the mean of those 4 values for each row is 0, and the standard deviation of those 4 values is 1 for each row.
When I try to perform a linear regression (using the Linear Regression Learner node), including all of the independent columns produces an invalid model. Removing any single column gives a sensible model. Using the WEKA Linear Regression (3.7) node, I can straightforwardly produce a linear regression model which give me sensible coefficients, so this does seem to be an issue with the Linear Regression Learner node, not a problem in what I’m actually trying to achieve. Though happy to be proven wrong!
I’ve created an example workflow, below, which demonstrates the problem: if all random columns (A-I) are used to predict J, the learner gives an invalid model - take just one of them out, and the results are fine. Any ideas what’s going on? Linear Regression Normalization Issue.knwf (27.8 KB)
Starting with your sample workflow, regardless of whether I use the KNIME Linear Regression Learner, the Weka node, or an R Snippet, I’m producing a model with coefficients of -1 and an intercept of essentially zero (it’s a 10^-16 value) when I predict J (Normalized).
Are you seeing something different? And just for clarity, what version of KNIME and what OS are you using?
are you also predicting one of the normalized columns in your original use-case?
I ran the workflow you provided and got the same result as Scott, which should be mathematically correct.
Let me illustrate how it works:
Let’s assume you have n values v_1 to v_n then the following holds after z-score normalization: v_1 + v_2 + … + v_n = 0
If you now try to predict e.g. v_n from the other values you can immediately see that v_n = -v_1 - v_2 - … - v_{n-1}
(I put the index of the last value in curly braces to indicate that the -1 is part of the index).
Thanks for the helpful responses. @nemad - your explanation makes a lot of sense. I can see how, for an individual row it seems non-sensical, but it wasn’t obvious to me that the conclusion generalizes to an entire dataset. JASP mentions the need for a positive-definite variance-covariance matrix to perform a linear regression. In the real process I’m actually using the normalized columns with a dependent variable which is not part of the normalization, but provided that I include all of the normalized columns in the Linear Regression node, it breaks the model. However, it does seem to me that the WEKA node is able to produce a model, even when predicting for one of the normalized columns? Also, I’m able to see the linear correlations between the normalized variables. How is that working? See the attached workflow, which produces a viable (for random data!) set of coefficients with the WEKA node, but not with the KNIME one, and also shows the linear correlations. Oddly, when piped into the predictor, it doesn’t appear to be using the very large negative coefficients stated in the output of the learner node and produces sensible predictions…? It’s this (limited success) that makes me feel that what I’m expecting is in some way reasonable - if every approach failed I might have given up long ago!