Smote Algorithm

The reference paper for the smote algorithm states that
“the minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors”

In the pseudocode section they say:
“SMOTE(T,N,k)
Number of minority class samples T;
Amount of SMOTE N%;
Number of nearest neighbors k”

That said, my question refers to when the option from the smote node is checked. From what I understood from the original algorithm, the user should choose the number N (how much one wants to oversample the minority). Therefore I one selects the only the minority should be accordingly.

I did a toy example with 10 samples (with 1 sample as minority class) and 2 features. When is checked I got a dataset with 18 samples with the additional (synthetic) 8 being represented by the original minority class. Now the data set is balanced but the dataset almost doubled. This can be extremely low if the data set is already big.

Finally the question: do you expect to have a future version with the N as an input for the node so it follows the logic of the original algorithm? That would avoid doubling a data set already too big.

Thanks,

1 Like

Dear @gugadrum,

the SMOTE node is not under active development but we are looking into the option that you mentioned.
In the meanwhile, you could simulate the described behavior by subsampling the majority class before you apply the SMOTE node.

With regards,

Adrian

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.