Thresholds within the Decision Tree Learner

I have created text classification model. Initially, I used the Naive Bayes Learner. However, unfortunately, I can only include 20 prediction classes. I have 56. Does that rule that one out?

So, I turned to the Decision Tree Learner. It has a high predictive accuracy on the training set. However, within the unseen set, it is making a lot of predictions which really should be NULL. The unseen set contains a lot job titles not in the training set. I had hoped that the presence of previously seen words in the training set would allow for an accurate predicition. Is there a way for me to not classify a title if the score is low? I would prefer no classification, rather than an incorrect one. That said, the model is also predicting a lot of accurate classifications i the unseen set. Is there a method to threshold?


Thanks in advance.


Hi James,

What are you trying to predict? You said that you have 56 different classes available in target variable. What are those classes referring to? You are saying that some of the labels available in the test set are not available in the training set. Probably, it make sense to do a stratified sampling on the target variable. Stratification is the process of dividing members of the population into homogeneous subgroups before sampling. You can set these type of partitioning in the Partitioning node before to use the model for the prediction.

Hope that helps,