I am using the SMOTE node to balance my training classes for a classification problem.
I understand that the node selects a real class record and then identifies x nearest neighbours to synthesis new similar data records.
Question 1 - I wanted to understand whether KNIME selects a new record each time from the minority class before it identifies the nearest neighbours and creates the new synthetic record, or are all synthesised records based off a single record picked by KNIME?
Question 2 - I can see that the # Nearest Neighbours is set to 5 by default. Is there any good practice around setting this / is 5 generally a good choice for this parameter?
My personal opinion is that you should avoid SMOTE in favor of an algorithm that can deal with class or instance weights.
Having said that the key thing about SMOTE is that you absolutely can only use the artificial data for training and never ever for prediction / test / validation set. eg the node must be applied after splitting into train/test.
SMOTE also has the issue that it only works for numeric features and not for discrete or categorical ones.
For the actually theory, refer to the original publication.
Noted on only using SMOTE for the training data, I picked this up from other resources.
The problem i’m working on is a multi-class text classification problem. I have created a numeric feature space to represent the term frequency and I am using a random forest classifier to learn the relationship between the classes and the term frequencies.
From my reading SMOTE seems to be the go-to tool for oversampling minority classes to overcome minority class imbalance which can impact the classification performance of random forest. I’d welcome any insights on whether this is the wrong way to go and if KNIME has another better node for balancing the classes.