SMOTE - more efficient way for oversampling

zarniak · August 28, 2019, 11:43am

Hi, i have minor question - i am using SMOTE node to balance unbalnace data:

target class have to outcomes: 0 and 1
0 occurs in ca. 99,9% of cases, 1 in 0,01%

data set is ca. 2M of rows

i am using smote in two alternative ways:
#1 - put node and configure it to “oversample minority class”
#2 - put rowsplitter splitting target class - oversample “1” by smote x1000 and then concatenate what have splitted before

solution #1 is much more time consuming but gives very similar outcomes - is solution #2 proper way to use smote?

mlauber71 · August 31, 2019, 10:33am

You could try your luck with R’s ROSE library.

kn_example_rose_balanced.knwf (905.8 KB)

lisovyi · September 1, 2019, 12:22pm

Hi @zarniak,

thanks for reporting this!

Both options are correct. So you are fine to use the faster option (#2, i.e. filter the minority class, oversample it with a fixed factor and then add it back to the majority class).

We identified the source of this performance difference and will improve the code to eliminate it in the future.

Best regards,
Mischa

system · March 2, 2020, 12:27am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.