I am using decision trees to predict a rare event (True at 0.5%). I have increased the proportion of the minority class in my dataset, and I would also like to add misclassification costs to the predictions to further encourage predictions of the minority class.
In my former life, I used IBM SPSS Modeler to do this by requesting this option when I built the models. I had the ability to set different penalties based on the type of misclassification, and then I could experiment to see if any produced decent models in validation.
Is there a way I can add this functionality to a KNIME Decision Tree, Random Forest, Tree Ensemble, or other model?
the KNIME native tree based models currently don’t support this, but you can do so with the H2O nodes. Please have a look at this forum post, which gives you more information about the different options with the H2O nodes:
in addiotion to @Kathrin great answer, there is also the XGBoost Tree Learner in KNIME-Labs, which supports:
Scale positive weight
Controls the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances).
Of course you could also do some over-/undersampleling, SMOTE and so on. But in your case, when the minority class is so rare, you could also think about anomaly detection (DBSCAN, Isolation Forest, Autoencoder).
If you could share the data + workflow we could give it a try.
Is there a license fee to use H20? Or do you happen to know if it is available in a free version? Does it link to KNME with a graphical user interface?
Interesting perspective. My incidence rate is less than 1%, so I suppose it could be a good candidate for anomaly detection! I like getting some discussion from peers in the forums for just this reason. When I am the thick of it on my project, it’s sometimes harder to see just around the corner to some other useful perspectives.
Yes, I have tried oversampling to 50/50 and other oversampling proportions, undersampling, and SMOTE. But I still have a much higher misclassification rate than I would like. So I continue to look for other ways to squeeze more signal out of the data. Currently, we are getting better results from a blend of 3:1 majority to minority class. In other words, a 50/50 split produces worse results on this data than allowing the majority class to be represented as a clear majority in training.
I wish I could share the data, but it’s not that kind of a project!
@BruceJohnson for further ideas you could check out my collection about machine leaning. Especially the entries marked unbalanced.
This thread also was about unbalanced data with multi class but might still give some ideas.
One approach could be. Use H2O auto machine learning and only xgboost (or GBM if you are on a Windows machine) with AUCPR as a metric.
Also you might check your data if there are special anomalies or you can further prepare the data. In the collection there are also links about that but we might need to know more about the data.
Thanks for asking. In typical hurry up and wait fashion, I was under tight pressure, and now we’ve moved on to other things. Thanks for your support and if anything turns up that I can discuss publicly, I will let you know!