Unblanced Dataset


I am trying to build a predictive model using random forest for classification. Unfortunately my data is very unbalanced with my minority class being having 400 records vs. 140000 records in the majority class. I tired using smote to oversample the minority class but because I have categorical variables, I cannot use smote.

Is this a lost cause at this point or is there something else I could do?

I thought of removing records from the majority class, but I’m assuming this is not a good idea.



1 Like

@w0rdz this really sounds like an unbalanced dataset and it could be difficult to come up with a perfect solution. It may well depend on the strength of the ‘signals’ that are within the data and how they are correlated to what you want (unfortunately in a lot of cases the signals are not that strong otherwise some simple rule might work.

You could check out the [unbalanced] section of this machine learning collection with some links and discussions.

Then one idea could be to explore Kaggle cases about fraud detection which typically are highly imbalances. SMOTE is one method to try to address such a problem but at least discussions in this forum found it to be sometimes insufficient.


What about applying Labelencoder or sth like that to make them numeric?

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.