Unblanced Dataset

mlauber71 · March 30, 2021, 7:41am

@w0rdz this really sounds like an unbalanced dataset and it could be difficult to come up with a perfect solution. It may well depend on the strength of the ‘signals’ that are within the data and how they are correlated to what you want (unfortunately in a lot of cases the signals are not that strong otherwise some simple rule might work.

You could check out the [unbalanced] section of this machine learning collection with some links and discussions.

Then one idea could be to explore Kaggle cases about fraud detection which typically are highly imbalances. SMOTE is one method to try to address such a problem but at least discussions in this forum found it to be sometimes insufficient.