I have a question concerning treatment of unbalanced data.
I am trying to “machine learn” model which should detect corporate credit issues with Random Forest using KNIME. After cleaning the data i have enough records with couple of features but with proportion = 1 issue : 250 not-issue.
without any modification i have high (99%) accuracy but all categorized as “not-issue” (no detection of “issue” in validation dataset) - no suprise i suppose
i have few feature which clearly have different characteristic in “issue” and “not-issue” group - for example turnover on accounts - taking into account average change: decrease in “issue” group and slight increase in “not-issue” pool.
is it in line with good practice to get rid-off of “not-issue” records with certain triggers - for example delete records with decrease in turnover in “not-issue” group as it is expected (more common) in “issue” case? after such implementation i have more balanced data = 1 issue : 40 not-issue. The outcome is also significantly better = 80% of “issues” classified correctly on validation dataset as “issues”.
do you know any other good practice to treat the unbalanced datasets? if my way is ok - how to find best trigger level for filtering through “non-issue” group?
This is a classic rare-event problem. Instead of deleting data from your “not-issue” group - which is almost certainly going to cause problems down the line - I would recommend some type of oversampling strategy for your “issue” group.
Here’s a thread from last month that describes the SMOTE node in KNIME, and how it could be implemented. Of course there are other ways to deal with this type of problem, but SMOTE might be a good place to start.
Here’s a more general article about unbalanced datasets in machine learning that is focused on implementation in R. It touches on SMOTE as well.
Just to be aware SMOTE should only be applied on the training set and such rows must never end up in the test set,
Also SMOTE doesn’t scale well especially with many features. So be sure to check performance early on with the full dataset.
You could alternatively use the Tree Ensemble learner which is a random forest with more options, more or less. In there you can configure the Data Sampling of rows. So the fractions of rows per tree and set the sampling to stratified to ensure each tree gets to see some rows of the minority class.
Sadly, the learner nodes do not support sample weights. That would be a useful feature for such cases especially gradient boosting. The only option is to use the python nodes with xgboost. But that is pretty cumbersome.