Sampling Strategies Comparison

This is a companion discussion topic for the original entry at

First and foremost - thank you for the informative blog entry on sampling strategies, and the published workflow on the knime hub. However, I find the conclusion reached in the blog entry, that “by far the best performing model is the one where class imbalance was trained on undersampled training data”, to be somewhat incomplete. This is only true when the same class probability cutoff of 0.5 is used for classification. Setting the initial threshold method in the binary classification inspector to Max Youden’s Index, allows for a better comparison of what models can achieve after using the different sampling strategies.

Blog entry: Too Much Data or Not Enough? Solve with Statistical Sampling | KNIME