I’ve created a model for imbalanced data using SMOTE Node and learned it using XGBoost Tree Ensemble method, when I perform partition (CV) to see how my model predicts I get fairly good result, up to 99.998% accurate. However, when I create a separate workflow and predict there, using XGBoost predictor, I can’t get any proper classification even on the dataset that I used for learning, which makes no sense. Does anyone ever had any similar issues? Any help is much appreciated.
I use Model writer and Model reader Nodes, data manipulation nodes are identical in both workflows.
For SMOTE, number of nearest neighbors was set at 5 and Oversample Minority Classes box was checked. No Static seed was set, assuming I want to see performance on different subsetting.
To assess model performance I looked at confusion matrix, in particular: error rate/accuracy and Cohen’s kappa (though it’s not that useful in this case).
As far as my workflows, I can’t post them, the data is sensitive information. I can say that the rate of positive class to negative class is approximately 1/30000 (binary classification). The whole dataset size is roughly 1 million observations.
The workflows are simplified examples of original ones:
About unbalanced data you might want to consider this article and the hints from KNIME team members from previous threads especially concerning SMOTE.
Then I added another balancing attempt with R and ROSE algorithm. although I am a little bit wary about using it. You might want to consider maybe not balancing your dataset but bring the minority group to 10% or something and take a look at AUC and other metrics not just the scorer that would consider everything above 0.5 as success.
Another attempt you could make is use some H2O nodes which offer you some balancing settings:
I don’t see the issue in your example? Only 2 instances are wrongly classified.
Accuracy is probably the worst metric to choose with an imbalanced data set. It’s basically unusable metric in that scenario. Choose a different metric.
You use xgboost. Try to use weights instead of SMOTE (which I would avoid like the plague)
One question remains how unbalanced you data is. If you are doing anomaly detection with very low anomaly rates, just doing standard ML probably won’t cut it. if it is more like a 10/90 imbalance, class weights should be able to handle that.
And as an additional comment, if you normalize on all your data and calculate features from that and then split it into train/test, you are leaking information into the training set. The data from test set should not in anyway be used for anything used in training.
Normalizing should only be done on the training set data albeit this alone doesn’t leak that much info. The real problem in your case is using a mean as a feature while mean was calculated over all data not just training set. This is for sure a major issue regardless if and how it impacts your results.