Hi I have been trying to build a model with a very large set of unbalanced data. It is confidential data and cannot be posted. I have cleaned the data and implemented SMOTE to balance all the minority categories in the model. The target column consists of a large set of values (about 15) so I have also used stratified row sampling to be able to work with the data.
What would be a recommended node to use for a predictive model in this situation? I have been working with XGBoosted trees and random forest and have not been able to optimize the parameters for those models in this situation. Does anyone have examples that are similar to this or suggested model/settings on the model to use?
try logloss as your metric. While I fear with your setting it will not help very much. Try combining it with the other efforts
The H2O automl node offers several techniques to handle imbalanced data, read about them and try a few things:
Read about unbalanced debates in the KNIME forum. Try to individually predict your 15 target values as a 0/1 binary model, maybe using AUCPR as the metric (if this brings results you will have to normalize the resulting scores and try to make a decision which prediction is the right one).
Most likely this will mean that smote will not help you very much.
Read the links marked unbalanced:
Having said all this. It is entirely possible that your setting will not have enough signals to solve the problem. So you might have to go back to the business side and try to figure out other strategies:
get more and better data
discuss what might be done with ‘weak’ predictions. Could you put them on some sort of watchlist
More often than not businesses would not want to invest in good old and boring data collection, cleaning and understanding and then hope for some magic AI solution. And even harder than collecting data: think about what to do with processes that fancy AI should serve.
Maybe you can tell us some more about you task without spelling any secrets. If you could find or construct a similar example we might try a few things.
This was very helpful, thank you! I think sticking to XGBoost and allowing that to handle the weight of the categories (without SMOTE) is the way to go. Using SMOTE and Random Forest is also giving me good results, but not as much as XGBoost.
@User10000 XGBoost is currently a type of model that works for a lot of tasks and is widely used. Waht you also could do is try “H2O AutoML Learner” and only let it run XGBoost models so you will get an idea where one of the most advanced environments would take you. Currently H2O’s XGBooost is only available on MacOS and Linux based systems (not Windows) unfortunately. GBM might come close.
For directly trying to predict your multiclass you might go for LogLoss metric.
If you try that keep in mind that when comparing your individual models they might have very different baselines of % and therefor different levels of initial scores (that is % of expected targets). SO if you want to compare them later you might have to normalize them, maybe transfer the score into a rank between 0 and 100.
If you try SMOTE there also is ROSE, though I do not have a wider variety of experience with that