Large Unbalanced Data

User10000 · February 12, 2022, 12:21am

Hi I have been trying to build a model with a very large set of unbalanced data. It is confidential data and cannot be posted. I have cleaned the data and implemented SMOTE to balance all the minority categories in the model. The target column consists of a large set of values (about 15) so I have also used stratified row sampling to be able to work with the data.

What would be a recommended node to use for a predictive model in this situation? I have been working with XGBoosted trees and random forest and have not been able to optimize the parameters for those models in this situation. Does anyone have examples that are similar to this or suggested model/settings on the model to use?

mlauber71 · February 12, 2022, 6:36am

@User10000 welcome to the KNIME forum. If I understand correctly you have a large dataset with a multiclass task (15? possible targets). So one class you do not want (number 16?).

This might be tough if not impossible, I fear. Things you could still try:

combine some of the targets so the classes get reduced
work with missclassification costs (you will have to try a few things. Maybe try to configure this weka node). Check this article though it does not cover imbalanced multi class
Try to weight the classes (also this)
try logloss as your metric. While I fear with your setting it will not help very much. Try combining it with the other efforts

The H2O automl node offers several techniques to handle imbalanced data, read about them and try a few things:

Read about unbalanced debates in the KNIME forum. Try to individually predict your 15 target values as a 0/1 binary model, maybe using AUCPR as the metric (if this brings results you will have to normalize the resulting scores and try to make a decision which prediction is the right one).

Most likely this will mean that smote will not help you very much.

Read the links marked unbalanced:

Having said all this. It is entirely possible that your setting will not have enough signals to solve the problem. So you might have to go back to the business side and try to figure out other strategies:

get more and better data
discuss what might be done with ‘weak’ predictions. Could you put them on some sort of watchlist

More often than not businesses would not want to invest in good old and boring data collection, cleaning and understanding and then hope for some magic AI solution. And even harder than collecting data: think about what to do with processes that fancy AI should serve.

Maybe you can tell us some more about you task without spelling any secrets. If you could find or construct a similar example we might try a few things.

kienerj · February 14, 2022, 6:21am

For xgboost instead of SMOTE (which I really don’t like as even when using it correctly I doubt it is actually useful) use weights.

One option is to create 15 binary models for each class. Maybe that way it works better or you find you can at least identify 1 or 2 of the classes “good enough”.

User10000 · February 14, 2022, 6:48am

This was very helpful, thank you! I think sticking to XGBoost and allowing that to handle the weight of the categories (without SMOTE) is the way to go. Using SMOTE and Random Forest is also giving me good results, but not as much as XGBoost.

User10000 · February 14, 2022, 6:48am

I will try both. Thank you!

mlauber71 · February 14, 2022, 8:02am

@User10000 XGBoost is currently a type of model that works for a lot of tasks and is widely used. Waht you also could do is try “H2O AutoML Learner” and only let it run XGBoost models so you will get an idea where one of the most advanced environments would take you. Currently H2O’s XGBooost is only available on MacOS and Linux based systems (not Windows) unfortunately. GBM might come close.

For directly trying to predict your multiclass you might go for LogLoss metric.

If you try that keep in mind that when comparing your individual models they might have very different baselines of % and therefor different levels of initial scores (that is % of expected targets). SO if you want to compare them later you might have to normalize them, maybe transfer the score into a rank between 0 and 100.

If you try SMOTE there also is ROSE, though I do not have a wider variety of experience with that

system · May 15, 2022, 8:03am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.