Unbalanced data - good practice

ScottF · April 16, 2018, 2:10pm

This is a classic rare-event problem. Instead of deleting data from your “not-issue” group - which is almost certainly going to cause problems down the line - I would recommend some type of oversampling strategy for your “issue” group.

Here’s a thread from last month that describes the SMOTE node in KNIME, and how it could be implemented. Of course there are other ways to deal with this type of problem, but SMOTE might be a good place to start.

Here’s a more general article about unbalanced datasets in machine learning that is focused on implementation in R. It touches on SMOTE as well.

https://shiring.github.io/machine_learning/2017/04/02/unbalanced