Understanding SMOTE

sjmartin · February 10, 2021, 4:53pm

Hi all,

I am using the SMOTE node to balance my training classes for a classification problem.

I understand that the node selects a real class record and then identifies x nearest neighbours to synthesis new similar data records.

Question 1 - I wanted to understand whether KNIME selects a new record each time from the minority class before it identifies the nearest neighbours and creates the new synthetic record, or are all synthesised records based off a single record picked by KNIME?

Question 2 - I can see that the # Nearest Neighbours is set to 5 by default. Is there any good practice around setting this / is 5 generally a good choice for this parameter?

Thanks for any help!

kienerj · February 11, 2021, 5:21am

My personal opinion is that you should avoid SMOTE in favor of an algorithm that can deal with class or instance weights.

Having said that the key thing about SMOTE is that you absolutely can only use the artificial data for training and never ever for prediction / test / validation set. eg the node must be applied after splitting into train/test.

SMOTE also has the issue that it only works for numeric features and not for discrete or categorical ones.

For the actually theory, refer to the original publication.

sjmartin · February 11, 2021, 9:34am

Thanks @kienerj for your insights.

Noted on only using SMOTE for the training data, I picked this up from other resources.

The problem i’m working on is a multi-class text classification problem. I have created a numeric feature space to represent the term frequency and I am using a random forest classifier to learn the relationship between the classes and the term frequencies.

From my reading SMOTE seems to be the go-to tool for oversampling minority classes to overcome minority class imbalance which can impact the classification performance of random forest. I’d welcome any insights on whether this is the wrong way to go and if KNIME has another better node for balancing the classes.

Thanks all again for your guidance!

kienerj · February 11, 2021, 4:54pm

Best you try out what works for you. Other option would be xgboost with class weights.

sjmartin · February 11, 2021, 10:49pm

Thanks for the tip, i’ll look into this!

system · August 13, 2021, 10:49am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.