Problem with unbalanced data with examples attached

tw349 · August 19, 2020, 8:50pm

I am having a real problem with unbalanced data. Model ends with good scores (i.e. “accuracy”), but essentially all the the prediction models do is “put all chips on black” and the ones that are actually white are just “wrong classified”.

I tried 3 different predictor models and SMOTE and Row Sampling but no luck?!?

Anyone have any ideas or suggestions for very unbalanced data?

Thank you in advance!

I will attach workflow and database in next posts.

tw349 · August 19, 2020, 9:01pm

See workflow.

https://drive.google.com/file/d/1wrUzVHJZRY_SoM9xTnfblOzmxoVWKGSm/view?usp=sharing

See database:

tw349 · August 19, 2020, 9:08pm

I put the workflow on the Knime public hub at:

You can also go to public hub and search for: UnbalancedData1

mlauber71 · August 19, 2020, 9:58pm

You could try these things.

Read this article about imbalanced data

Use the h2o auto-machine learning approach with knime wrapper

and choose AUCPR as sort metric
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/sort_metric.html

Also you could try and tell the algorithm to use balanced data. You might have to be careful with that and only balance your training data, not the validation data
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/balance_classes.html

See if H2O comes up with a good cutoff point.

In addition you could try R vtreat and tell it which one is the positive class and see if this in combination with other measures is of any help.

I will see if I can put together an ‘unbalanced’ version of my H2O.ai automl wrapper.

beginner · August 20, 2020, 6:02am

Some comments:

Data needs some cleaning and data preparation (different spellings for same thing)
Why are you converting the number fields to string?
Explanation of features? Hard to tell why you choose some and not others. Unclear manual feature selection
SMOTE is crap, don’t use it
Try xgboost with scale_pos_weight

Some data simply doesn’t have a good signal which makes a lot of sense here. Fire simply has a very high random aspect to it. Still, depending on what the models goal is, a poor model can still be of some help (reduced risk).

system · February 18, 2021, 6:02pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.