misclassification costs to adjust for rare outcome

BruceJohnson · February 26, 2022, 8:55pm

Greetings:
I am using KNIME 4.5.0

I am using decision trees to predict a rare event (True at 0.5%). I have increased the proportion of the minority class in my dataset, and I would also like to add misclassification costs to the predictions to further encourage predictions of the minority class.

In my former life, I used IBM SPSS Modeler to do this by requesting this option when I built the models. I had the ability to set different penalties based on the type of misclassification, and then I could experiment to see if any produced decent models in validation.

Is there a way I can add this functionality to a KNIME Decision Tree, Random Forest, Tree Ensemble, or other model?

Best,
Bruce

Kathrin · March 1, 2022, 7:44am

Hi @BruceJohnson,

the KNIME native tree based models currently don’t support this, but you can do so with the H2O nodes. Please have a look at this forum post, which gives you more information about the different options with the H2O nodes:

Cheers
Kathrin

goodvirus · March 2, 2022, 11:11am

Hi, @BruceJohnson,

in addiotion to @Kathrin great answer, there is also the XGBoost Tree Learner in KNIME-Labs, which supports:

Scale positive weight

Controls the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances).

Of course you could also do some over-/undersampleling, SMOTE and so on. But in your case, when the minority class is so rare, you could also think about anomaly detection (DBSCAN, Isolation Forest, Autoencoder).

If you could share the data + workflow we could give it a try.

Best regards,

Paul

BruceJohnson · March 3, 2022, 4:09am

@Kathrin

Thanks for the suggestion.

Is there a license fee to use H20? Or do you happen to know if it is available in a free version? Does it link to KNME with a graphical user interface?

Thanks in advance for any pointers.

Best,
Bruce

BruceJohnson · March 3, 2022, 4:36am

@goodvirus

Interesting perspective. My incidence rate is less than 1%, so I suppose it could be a good candidate for anomaly detection! I like getting some discussion from peers in the forums for just this reason. When I am the thick of it on my project, it’s sometimes harder to see just around the corner to some other useful perspectives.

Yes, I have tried oversampling to 50/50 and other oversampling proportions, undersampling, and SMOTE. But I still have a much higher misclassification rate than I would like. So I continue to look for other ways to squeeze more signal out of the data. Currently, we are getting better results from a blend of 3:1 majority to minority class. In other words, a 50/50 split produces worse results on this data than allowing the majority class to be represented as a clear majority in training.

I wish I could share the data, but it’s not that kind of a project!

goodvirus · March 3, 2022, 7:06am

Hi @BruceJohnson,

to bad, it would be an interesting case and training exercise.

H2O is open source (most of it), and availible via KNIME extension: KNIME H2O Machine Learning Integration – KNIME Hub

Maybe you can keep us updated, if you could improve your predictions with the weight or anomaly detection approach.

Best regards,

Paul

mlauber71 · March 3, 2022, 7:17am

@BruceJohnson for further ideas you could check out my collection about machine leaning. Especially the entries marked unbalanced.

This thread also was about unbalanced data with multi class but might still give some ideas.

One approach could be. Use H2O auto machine learning and only xgboost (or GBM if you are on a Windows machine) with AUCPR as a metric.

Also you might check your data if there are special anomalies or you can further prepare the data. In the collection there are also links about that but we might need to know more about the data.

Daniel_Weikert · March 3, 2022, 5:47pm

@BruceJohnson
Have you tried other algorithms as well?
br

goodvirus · March 14, 2022, 6:03am

Hi @BruceJohnson,

just being curious, have you successful applied the weights or are you using an anomaly detection and what are your results?

Best regards,

Paul

BruceJohnson · March 14, 2022, 9:50pm

Thanks for asking. In typical hurry up and wait fashion, I was under tight pressure, and now we’ve moved on to other things. Thanks for your support and if anything turns up that I can discuss publicly, I will let you know!

system · June 12, 2022, 9:51pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.