About unbalanced data you might want to consider this article and the hints from KNIME team members from previous threads especially concerning SMOTE.
Then I added another balancing attempt with R and ROSE algorithm. although I am a little bit wary about using it. You might want to consider maybe not balancing your dataset but bring the minority group to 10% or something and take a look at AUC and other metrics not just the scorer that would consider everything above 0.5 as success.
Another attempt you could make is use some H2O nodes which offer you some balancing settings:
Hi @malik ,
there are basically two ways.
The first way is to balance the data before converting it to an H2O data frame. You can find easily more information about this in the forum. See, e.g, Unbalanced data - good practice and SMOTE for further details.
The second way is the “Balance classes” option shown in your screenshot. By checking this option, H2O will automatically balance the classes. With the setting “Define max relative number of rows after balancing”, you can control how much bal…
Also you might see what H2O AutoML would do with your data and if it could come up with some solutions. It also allows for balancing although I have never tried it:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/balance_classes.html
Imbalanced Data : How to handle Imbalanced Classification Problems
SMOTE Hints from KNIME Team members
Hi samer_aamar,
Would it be possible to give more details on the problem you are trying to solve? What is the classification problem that you are working on? Which kind of data are available in the dataset?
Most machine learning algorithms do not work very well with unbalanced datasets. That is why is better to identify a strategy to handle unbalanced datasets.
Moreover, when you want to evaluate the performance of the models in these cases, you may want to use the following metrics:
Preci…
Hi @montecarlo ,
I’ll just had a few KNIME specific comments here as well, perhaps it will help you start your search!
Some common options for dealing with unbalanced data like this, as the article @HansS linked suggests, included over-sampling your minority class, under-sampling your majority class, or adjusting a classification threshold of your model.
1) You may try oversampling your minority class with the SMOTE node, this generates new artificial data points instead of just re-sampling.
h…
Hello zizoo,
the issue in this case is probably the metric.
Accuracy often times suggests a higher generalization performance in case of unbalanced data therefore I would recommend to monitor precision and recall of the minority class as these metrics usually give you a better idea what your model is actually doing.
Concerning the SMOTE node I’d recommend to only oversample the minority class if your dataset is unbalanced because it otherwise won’t remedy the unbalance in your data.
Please n…
Try ROSE algorithm
5 Likes