tackle imbalanced data set (please urgent)

UgurErcan · October 28, 2021, 10:26am

There are two methods in bootsrap sampling. Bootstrap samples and holdout samples. I’ve tried both in same model and they give similar results. What is the difference between them? In what condition should I choose which one?

Thank you.

ScottF · November 2, 2021, 4:11pm

I assume you are referring here to bootstrap sampling and cross validation? Holdout sampling is just the method of partitioning your data into training and test (holdout) sets for the purposes of evaluating model performance.

Bootstrap sampling (usually with replacement) is a method for creating artificial subsets of an existing dataset, often for the purposes of downstream testing. Cross validation divides your dataset into “folds” such that you test the performance of all of your data, exclusively and iteratively, often for the purpose of checking stability in your model.

But in your subject line you are asking about imbalance. One simple way to address this is via stratified sampling in your partitioning step, but that will often not be enough. In that case you may want to do oversampling to boost your minority class (using something like SMOTE, though some folks don’t like it much). There are lots of threads about this problem on the forum that you can search for.

Are you working on a problem for a class? If you have a KNIME workflow you are working on, you could upload that if you need some hints.

system · May 4, 2022, 4:12am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.