Large Unbalanced Data

@User10000 welcome to the KNIME forum. If I understand correctly you have a large dataset with a multiclass task (15? possible targets). So one class you do not want (number 16?).

This might be tough if not impossible, I fear. Things you could still try:

  • combine some of the targets so the classes get reduced
  • work with missclassification costs (you will have to try a few things. Maybe try to configure this weka node). Check this article though it does not cover imbalanced multi class
  • Try to weight the classes (also this)
  • try logloss as your metric. While I fear with your setting it will not help very much. Try combining it with the other efforts

The H2O automl node offers several techniques to handle imbalanced data, read about them and try a few things:

Read about unbalanced debates in the KNIME forum. Try to individually predict your 15 target values as a 0/1 binary model, maybe using AUCPR as the metric (if this brings results you will have to normalize the resulting scores and try to make a decision which prediction is the right one).

Most likely this will mean that smote will not help you very much.

Read the links marked unbalanced:

Having said all this. It is entirely possible that your setting will not have enough signals to solve the problem. So you might have to go back to the business side and try to figure out other strategies:

  • get more and better data
  • discuss what might be done with ‘weak’ predictions. Could you put them on some sort of watchlist

More often than not businesses would not want to invest in good old and boring data collection, cleaning and understanding and then hope for some magic AI solution. And even harder than collecting data: think about what to do with processes that fancy AI should serve.

Maybe you can tell us some more about you task without spelling any secrets. If you could find or construct a similar example we might try a few things.

2 Likes