I have a dataset which is very unbalanced (Binary class: only 1.5% are positive. All the rest are negative samples).
In order to build a model i constructed the following two flows.
I have a got feeling that Flow 2 is wrongly designed... but i am not sure why.
So i cannot explain why the results are totally different.
I need to help please to get better understanding what did i do wrong here.
The results I got are as follows (in average between iterations)
Flow 1: (AttributeSelectedClassifier + SMOTE):
Loop 20 -> Split %80 -> SMOTE on training data -> Train using AttributeSelcetedClassifier -> End Loop -> print AVG of collected scores
Cohen' Kappa: 0.044
Flow 2: (AttributeSelectedClassifier + Equalizer):
Loop 20 -> Equalize Sample -> Split %80 -> Train using AttributeSelcetedClassifier -> End Loop -> print AVG of collected scores
Cohen' Kappa: 0.508
* the configuration of the AttributeSelectedClassifier in both flows is the same:
* in Flow 1, I run SMOTE on the minimal class (the positive) so that it will be same size as negatives
* I tried to change the AttributeSelectedClassifier to RandomForest and got similar results
* The equalizer in Flow 2 is basically doing under-sampling for the big class (the negatives) by randomly selecting similar number of instances like the positives.
Would it be possible to give more details on the problem you are trying to solve? What is the classification problem that you are working on? Which kind of data are available in the dataset?
Most machine learning algorithms do not work very well with unbalanced datasets. That is why is better to identify a strategy to handle unbalanced datasets.
Moreover, when you want to evaluate the performance of the models in these cases, you may want to use the following metrics:
- Precision/Specificity: how many selected instances are relevant.
- Recall/Sensitivity: how many relevant instances are selected.
- AUC: relation between true-positive rate and false positive rate.
The performance of machine learning algorithms is typically evaluated using predictive accuracy. Generally, in these cases, it is not appropriate to use accuracy.
In general, when use techniques such as SMOTE I would suggest you to first partition your dataset and then apply SMOTE only on the training set.
I would also suggest you to have a look at the following paper: https://www.jair.org/media/953/live-953-2037-jair.pdf.
Hope this is helpful,
Equal size sampling in my opinion isn't very useful as you loose a lot of data. SMOTE doesn't scale well if you have many features or simply doesn't work at all if you have ordinal features (integers which are either counts or "numerized categories")
In Knimes Random Forest / Tree ensemble nodes in Ensemble Configuration you can set your Data Sampling mode to Stratified. This ensures that each tree gets to see at least some rows of the minority class.
What KNIME lacks entirely is using class weights. What you could do hence is simply duplicate rows of the minority class (oversampling).Not ideal but might work (Duplicate with Concatenate Node as there is no oversampling node). But the duplication of course must happen on the training set only. Else you skew your Cross validation results.
Another option is to simply adjust the classification threshold or ditch the classification and work with the confidence output of the model (depends on use case, this works best when needing to prioritize work. Just do stuff with highest positive class value confidence first. eg. ranking by confidence for positive class value)