I am trying to execute a logistic regression model with a binary response variable and 16 categorical variables with different levels and 1 numerical variable which is normalized. The data has 850,000 rows. I have only added the 17 variables which has given a proper model in the other software ®. I have checked that there in no missing values in the data.
Error and warnings. The node executed till 91% and then failed with the following warnings and errors-
WARN Logistic Regression Learner 0:5 The covariance matrix could not be calculated because the observed fisher information matrix was singular. Did you properly normalize the numerical features?
ERROR Logistic Regression Learner 0:5 Execute failed: Cell count in row “Row1” is not equal to length of column names array: 6 vs. 3
There is just 1 numeric variable and it has been normalized. I am not sure how the error message can be decoded. Thanks a lot for the help.
as the warning states, the observed fisher information matrix is singular but unfortunately there are different scenarios that can lead to this degenerate case, each of which requires a different strategy to resolve:
The Stochastic Average Gradient algorithm didn’t find a good solution: SAG is roughly based on Gradient Descent and as such it is sensitive to the gradient of the loss function which in turn is sensitive to the scale of features (hence the hint to proper normalization). Another possible perpetrator is a too large step size (can be changed in the advanced tab) although I’ve not yet encountered this problem with the learning rate set per default.
You could actually circumvent the problem by deactivating the statistics calculation in the advanced tab but if SAG messed up, the resulting model will hardly be of any use.
Note that the alternative solver (Iteratively reweighted least squares) is less sensitive to scale and doesn’t require a learning rate, so you could always try to use it as an alternative.
Your features are highly correlated or your data is linearly separable: In this scenario, you can add regularization in form of a prior on the weights (e.g. a Gauss prior which corresponds to L2 regularization) to ensure that the fisher information matrix is invertible. The respective settings can be found in the advanced tab. Note that regularization is currently only implemented for the SAG solver and not for the IRLS solver which won’t work in this scenario.
For completeness: This issue is also present in cases where the number of features exceeds the number of rows because the data is then always linearly separable. Please note that number of features does not necessarily mean number of columns because some categorical columns are mapped onto multiple features using one-hot encoding.
In your particular case it is unlikely that that’s the issue given your large dataset but I wanted to include this point in case someone else runs into this problem.
Finally, may I ask which KNIME version you are using because the error you encountered should be fixed in current versions (3.5.3 and 3.6.0).