Hello! I trained a decision tree model in Knime. One of the categorical variable has 4 values, a, b, c, and d. Say values a, b, and d have showed up in training set. But c has not.
When I try to use the trained model on the test set, I got an error that the testing data has value that has not been counted for in the model. How should I handle this in KNIME?
If I recall correctly, in Weka there is a way to specify all values of a variable at the beginning of the arff files. But I am not sure how to do something similar in KNIME. Any hint would be great!
Maybe you should perform a stratified sampling when dividing the data into learner and test set. That will ensure that each category is always represented.
Thanks Geo. That is what I did. But still I have the problem. Any other hints?
Another thing to do before using Weka nodes is to use the Domain Calculator node and apply it on the category variable. More particularly, you'll first have to concatenate the test and training sets (be sure to mark them before concatenation), apply Domain Calculator on the variable of interest, and split the data set again into test and train sets using Row Splitter (remember the marker before concatenation). That should be it.
Not sure whether you can apply Domain Calculator before splitting into training and test sets. Probably depends on the kind of preprocessing that you have to perform prior to classification. E.g. if automatic variable selection is performed based on the train (learner) set, you have to split the data during the preprocessing step already, thus requiring you to concatenate them again later for Domain Calculation, then splitting again...
If I recall correctly, in Weka there is a way to specify all values of a variable at the beginning of the arff files. But I am not sure how to do something similar in KNIME.
Yes, there is something like that. in The Category "Manipulation -> Column -> Convert & Replace" are the nodes "Edit Numeric Domain" and "Edit Nominal Domain".