I am trying to program a linear regression learner. One of my fields originally had too many values and so I received the error “Column “XXX” has too many different values - will be ignored during training”. To overcome this, I used a One to Many node in order to create a unique column for each value in the original field, and then I started to receive the error:
The following columns are redundant and will not contribute to the model: ABCDE. Coefficient statistics will not be accurate and contain missing information.
When I ran the regression the results were very screwy with an intercept value of 8,600,000 and then all of the individual coefficients with a value of -8,600,000. (The predicted results of this regression should be in the single digits).
Here is a sanitized copy of the data. As you can see I am trying to predict is called “Life Expectancy”. The field which is called “Species” has over 300 values and is tripping up the regression. (Before anyone asks, no it cannot be aggregated into groups of species, it has to remain separate).
A few things come to mind by glancing at your dataset:
I wonder if you have enough observations in your data set? Is the sanitized version the full data set? You have many groups that only have 1 - 3 observations, and you won’t get good results from a regression. (I.e. a line cannot be fitted through a single point.)
Your one to many output should actually contain one less column… My understanding the way dummy variables work is you need n-1. To clarify, if you have a column for every observation (e.g. 327) each row either has a 1 or a 0 for one of the columns… But you need one fewer columns so that an observation with all 0 can be a constant. Wikipedia has a decent explanation on the dummy variable trap: https://en.wikipedia.org/wiki/Dummy_variable_(statistics)
You should create dummy variables for Tier and Level
Also, how important is level do you think to the overall model? I would try and impute the missing values, or get rid of the variable all together.
@Snowy, thank you, your point #2 solved the issue. I removed one of the variables and in the species column and got the results that I expected. Thank you very much!