Problem with unbalanced data with examples attached

I am having a real problem with unbalanced data. Model ends with good scores (i.e. “accuracy”), but essentially all the the prediction models do is “put all chips on black” and the ones that are actually white are just “wrong classified”.

I tried 3 different predictor models and SMOTE and Row Sampling but no luck?!?

Anyone have any ideas or suggestions for very unbalanced data?

Thank you in advance!

I will attach workflow and database in next posts.

See workflow.

https://drive.google.com/file/d/1wrUzVHJZRY_SoM9xTnfblOzmxoVWKGSm/view?usp=sharing

See database:

I put the workflow on the Knime public hub at:

You can also go to public hub and search for: UnbalancedData1

1 Like

You could try these things.

Read this article about imbalanced data

Use the h2o auto-machine learning approach with knime wrapper

and choose AUCPR as sort metric
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/sort_metric.html

Also you could try and tell the algorithm to use balanced data. You might have to be careful with that and only balance your training data, not the validation data
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/balance_classes.html

See if H2O comes up with a good cutoff point.

In addition you could try R vtreat and tell it which one is the positive class and see if this in combination with other measures is of any help.

I will see if I can put together an ‘unbalanced’ version of my H2O.ai automl wrapper.

2 Likes

Some comments:

  1. Data needs some cleaning and data preparation (different spellings for same thing)
  2. Why are you converting the number fields to string?
  3. Explanation of features? Hard to tell why you choose some and not others. Unclear manual feature selection
  4. SMOTE is crap, don’t use it
  5. Try xgboost with scale_pos_weight

Some data simply doesn’t have a good signal which makes a lot of sense here. Fire simply has a very high random aspect to it. Still, depending on what the models goal is, a poor model can still be of some help (reduced risk).

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.