currently I’m using the xgboost nodes of knime to train an xgboost model on 11.000 data points with different columns. In some of the categorical columns it can happen that a new value is present in new data. For example: The model ist trained on column A with categories a, b and c.
Later on when I want to make predictions it happens that there is a new category c, but it can also be d, e, or whatever.
Now I want to use the xgboost predictor on new data, but I always get the error "Execute failed: Unknown categorical value ‘adnwf’ "
How can I tell the Xgboost Predictor that he should ignore those rows and not predict them or handle them as they are empty?
I can’t alternate all these entries via the rule engine, because there are too much and every time I do predictions there will be new categorical values.
If new rows have new values in this column, is it really a good variable to build a model form?
Yes, it is the best variable in the model. The variable has around 80 classes right now. The model is retrained from times to times, so the new classes are included. Ignoring the data points with new classes is also an option, but don’t know how to do this. Right now I always have the errors, so therefore the xgboost nodes are useless.
When training the model, store a table with all the classes in it.
When predicting, read this “classes table” in and do a reference row filter with your data to predict. That way you can remove the rows with new classes.
Given your description of the issue if this variable is important, not predicting the rows with a new value in it is the only reasonable choice.
note that xgboost doesn’t really support categorical variables. the knime node is probably doing either a one-hot-encoding or label-encoding and hence you get the error. So theoretically the node should be able to be changed with an option to ignore such rows.
On top of that it would actually be interesting to know if it does a one-hot or label encoding. One-hot potentially creates a lot of additional columns while label encoding is problematic especially with XGBoost that internal treats everything as regression (eg numeric) which means that the distance between the labels matters eg. whatever has label 1 will be considered to be closer to label 2 than label 3. This is obviously often not the case as categorical features usually don’t have a measurable distance between them. Hence one-hot encoding should be preferred.
You could do the one-hot encoding yourself which would also solve the problem of the invalid rows (but you should be wary of how these rows get predicted)
@ArminFan, the node currenlty doesn’t offer an option that would allow for your use-case directly.
However, since it supports missing values, I created a ticket to optionally treat unknown values as missing values.
For now, this strategy also gives you an idea how to work around the issue. Instead of completely removing the rows with unknown values, you could split them from your data using @beginner’s suggestion and replace the feature in those rows with missing values. Then the predictor should accept them.
Regarding @beginner’s question: KNIME does a one-hot encoding for categorical variables for the reason you outlined.
thanks for your answer. It worked for me and for now it is fine and good enough, because there are not so many features.
But in general this is not really practicable, because you need to do this for each categorical predictor, which is pretty tedious.
Would be nice to have this feature included in the XGBOOST Predictor node as in the RandomForest Predictor node.
I am having the same issue as described above. I am managing a model factory with many variables and models, and sometime a variable can get a new value, causing all of the models using this variable to fail.
@nemad has the solution you mentioned been implemented? I am using version 4.2.3 currently.
This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.