# Linear Regression Learner breaking with row-normalized data

I have been having problems performing a simple linear regression on a set of data.

In the data, there are thousands of rows, but the independent numerical values are z-score/Gaussian normalized across each row - that is to say, if there are 4 independent columns (A, B, C, D), the mean of those 4 values for each row is 0, and the standard deviation of those 4 values is 1 for each row.

When I try to perform a linear regression (using the Linear Regression Learner node), including all of the independent columns produces an invalid model. Removing any single column gives a sensible model. Using the WEKA Linear Regression (3.7) node, I can straightforwardly produce a linear regression model which give me sensible coefficients, so this does seem to be an issue with the Linear Regression Learner node, not a problem in what I’m actually trying to achieve. Though happy to be proven wrong!

I’ve created an example workflow, below, which demonstrates the problem: if all random columns (A-I) are used to predict J, the learner gives an invalid model - take just one of them out, and the results are fine. Any ideas what’s going on? Linear Regression Normalization Issue.knwf (27.8 KB)

Hi @decates -

Starting with your sample workflow, regardless of whether I use the KNIME Linear Regression Learner, the Weka node, or an R Snippet, I’m producing a model with coefficients of -1 and an intercept of essentially zero (it’s a 10^-16 value) when I predict J (Normalized).

Are you seeing something different? And just for clarity, what version of KNIME and what OS are you using?

2 Likes

Hi @decates,

are you also predicting one of the normalized columns in your original use-case?
I ran the workflow you provided and got the same result as Scott, which should be mathematically correct.
Let me illustrate how it works:
Let’s assume you have n values v_1 to v_n then the following holds after z-score normalization:
v_1 + v_2 + … + v_n = 0
If you now try to predict e.g. v_n from the other values you can immediately see that
v_n = -v_1 - v_2 - … - v_{n-1}
(I put the index of the last value in curly braces to indicate that the -1 is part of the index).

Kind regards,