H2O.ai AutoML in KNIME for classification problems

mlauber71 · February 10, 2020, 8:59pm

H2O.ai Automl - a powerful auto-machine-learning framework wrapped with KNIME

It features various models like Random Forest or XGBoost along with Deep Learning. It has wrappers for R and Python but also could be used from KNIME. The results will be written to a folder and the models will be stored in MOJO format to be used in KNIME (as well as on a Big Data cluster via Sparkling Water).

One major parameter to set is the running time the model has to test various models and do some hyperparameter optimization as well. The best model of each round is stored, and some graphics are produced to see the results.

Results are interpreted through various statistics and model characteristics are stored in an Excel und TXT file as well as in PNG graphics you can easily re-use in presentations and to give your winning models a visual inspection.

Also, you could use the Meta node “Model Quality Classification - Graphics” to evaluate other binary classification models:

ROC Curve and Gini coefficient

A classic ROC (receiver operating characteristic) curve with statistics like the Gini coefficient measuring the ‘un-equality’ – which is what we want to maximize in this case

TOP Decile Lift

A classic lift curve with statistics. Illustrating how the TOP 10% of your score is doing compared to the rest. You have the cumulative lift that ends in 1.0 (the green line ^= the average % of Targets in your population) and the Lift for each 10% step. This graphic and statistics are useful if you want to put emphasis on the Top group.

Kolmogorov-Smirnov Goodness-of-Fit Test

two curves illustrating the Kolmogorov-Smirnov Goodness-of-Fit Test. An indication of how good the two groups have been separated. The higher the better. Also, inspect the curves visually.

Find the best cut-off point for your model

Gives you the idea where the best cutoff might be by consulting two measures

> 0.39 score if you follow Cohen’s Kappa
> 0.31 if you follow the best F1 score

There is always a price to pay. The blue curve gives you the % of non-targets with regards to all cases that you would have to carry with you if you choose this specific cutoff

If you choose >0.39 you will capture 74% of all your targets. You will have to ‘carry’ 7% off all your cases that are non-Targets which overall make 67% of your population.

If you choose a cutoff of >0.68 you get nearly 50% of your Targets with only about 2% of the population as non-Targets. If this is good or bad for your business case you would have to decide. For more details see the Excel file.

The accompanying Excel file also holds some interesting information

The Leaderboard from the set of models run

It gives you an idea

Which types of models were considered?
Also, the stretch of the AUC could be quite wide. Since all the models only trained 2.5 minutes it would be possible that further training time might result in better models
In between there as some other models besides GBM if they would appear more often you might also investigate that further

If you are into tweaking, you models further the model summary also gives you the parameters used.

Further information will be stored in the print of the whole model with all parameters, also about the cross-validations done.

Variable Importance is very important

Then there is the variable importance list. You should study that list carefully. If one variable captures all the importance you might have a leak. And the variables also should make sense.

If you have a very large list and further down, they stop making sense you could cut them off (besides all the data preparation magic you could do with vtreat, featuretools, tsfresh, label encoding and so on). And also, H2O does some modifications.

You could use that list to shrink your y variables and re-run the model. The list of variables is also stored in the overall list.

Fun fact in this case: your relationship and marital status are more important to determine whether you will earn more than $50,000 then your education …

Get an overview how your model is doing in Bins and numbers

I like this sort of table since it gives you an idea about what a cut-off at a certain score (“submission”) would mean.

All numbers are taken from the test/validation group (30% of your population in this case) – you might have to think about your overall population to get the exact proportion.

If you choose a cutoff at 0.8 you would get 92% precision and 43% of all your desired targets. In marketing/cross-selling that would be an excellent result. In credit scoring, you might not want to live with 8% of people not paying back their loans. So again, the cut-off and value of your model very much depends on your business question.

A word about cross-validation

Another aspect of your model quality and stability could be judged by looking at a cross-validation. Although H2O, for example, does a lot of that by default in order to avoid overfitting you might want to do some checks of your own.

The basic idea is: if your model is really catching a general trend and has good rules they should work on all (random) sub-populations and you would expect the model to be quite stable.

Several tests are run. In the end, we look at a combined standard deviation. 0 would represent a perfect match between all subgroups (sub-sampling and leaving one out techniques). So if you would have to choose between several excellent models you might want to consider the one with the least deviation.

Jupyter notebook

Enclosed in the workflow in the subfolder

/script/ kn_automl_h2o_classification_python.ipynb

there is a walkthrough of Automl in a Jupyter notebook to explore the functions further and if you do not wish to use the wrapper with KNIME

SimonS · February 10, 2020, 9:37pm

Wow, impressive workflow @mlauber71!

I will definitely check it out. Thanks for sharing all the effort with the community!

Cheers,
Simon

SimonS · February 11, 2020, 3:03pm

Hey @mlauber71,

I played a little with your workflow. Works pretty well, you provided a nice documentation I’ve let it run about 30mins with the default data you have set and get an AUC of 0.9283 (best model is a GBM). If I train an H2O GBM myself within KNIME using 250 trees and a depth of 6 (training time just a few seconds), I get 0.9292, so I am wondering why the autoML result is worse. I can see that the best model you have trained achieves a better score, 0.9293! Did you use a different configuration for H2O AutoML or simply let it run longer? I can see that you changed the early stopping parameters, those may be interesting to tune. Maybe it’s also just the seed

Another point is: you are using the same data for test set as for validation set. For the used dataset, this probably does not have a bad impact but might make sense to have separate ones for them in general.

Btw, I like the broad usage of KNIME views. Have you seen the newly released Binary Classification Inspector ? Might be a good addition.

Thanks again for sharing this and H2O.ai AutoML in KNIME for regression problems with the community!

Simon

mlauber71 · February 11, 2020, 3:16pm

I think some differences in the very last digits of the AUC might be because of different splits in training and test data. And yes you could use a setting of 3 files test, training and validation to be absolutely sure. You could try various runs of AutoML with the configuration and save the best models available. It is very possible that with further tweaking and feature enigneering even better results are possible. I like H2O AutoML because it gives you a very quick idea where you stand and what (type of) models to explore further. It is absolutely legitimate to fill some learnings into a KNIME H2O node as well

I toyed around with the Binary Classification inspector but it crashed at some occasions and it is more to explore the model by hand than giving a definite recommendation and also the export functions are ‘weaker’ than with my quite complicated R collection - but indeed if you want to decide on your cut-off point you might connect the Inspector at the end.

On windows I experience some problems with the import of the data to H2O at the moment, they might or might not be related to the wrapping with KNIME. Also on windows some might have to tweak the paths and the file separators.

SimonS · February 11, 2020, 3:25pm

I heavily agree with that. Very helpful to get a quick impression of the data and which model to use. I also enabled StackedEnsemble once and was surprised it was not even listed in the leaderboard.

If you are able to find a pattern when it crashes, this would be valuable feedback! (maybe for a new topic then to keep this one related to your workflow)

I am using Ubuntu and the workflow ran perfectly fine. Never ending story with Windows file separators…annoying.

beginner · February 12, 2020, 8:19am

Note for anyone trying this: It needs h2o installed in the configured python environment.

Conclusion:

This is too black-boxy for me.

Even more than other AutoML stuff as all the hard work is done outside KNIME in h2o and if I don’t know how that internally works, well I have no clue whats going on. I actually consider this dangerous if someone not so experienced simply rolls with it. Luckily it probably needs a ton of change to work with a different data set minimizing said danger.

For this data set just using xgboost with default settings gives one the 0.93 score for AUROC in like 5 seconds. Albeit this is a toy data set and it’s easy to get to the limit of whats possible. Still I have yet to see examples were just “throwing as much stuff at the wall as you can” (eg. computational limits) leads to better results and faster than doing it semi-automated with nodes for common repetitive tasks like correlation filter. In fact does this AutoML filter our correlated features? It’s not relevant for this data set but can be for others. Etc. Simply not a fan of AutoML.

SimonS · February 12, 2020, 9:39am

In general, I agree with you @beginner. AutoML systems are usually black-boxes and bring the issues and risks you mentioned. However, H2O AutoML is not too advanced. Basically, all they are doing is a random search for hyperparameter optimization and some model stacking if enabled (disabled in @mlauber71 s workflow by default) . They are not doing any data preparation step, so also no filtering of correlated features, class balancing, feature engineering, … etc. This still needs to be done by yourself. You could argue that it is not and end-to-end autoML tool. So in my opinion, H2O AutoML is a helpful tool to save time when searching the best model and hyperparameters for your data. One can then use this impression to do some further fine tuning of the hyperparameters.

So what I want to say: In my opinion, H2O AutoML should only be used as a semi-automated tool (as you called it) for hyperparameter optimization and model selection. Not as the one solution for all your ML problems.

Cheers
Simon

mlauber71 · February 12, 2020, 9:54am

To a certain extend all (H2O) models are black box and also some inexperienced user could download KNIME or R and easily do stuff that does not hold up to data scientific standards - and we can be glad if people come to this forum and can be sent in the right direction. That is why I mention the limits and possbilities of model and to check thier results all the time.

And concerning H2O there are a lot of blogs and articles explainig what they are doing (some referenced in the documentation enclosed in the above workflow) so you could check for yourself.

Data preparation is still a thing but there are also ways to automate that to a certain extend (vtreat, featuretools are mentioned) and I have some workflows about removing highly correlated variables, normalization and label encoding (using KNIME and Spark) in the work (maybe more on that later). The workflow assumes you have all that done

In the end I always stress that one should be aware of the perils of models and of the business question one tries to answer; and also should test the results in the real world.

I fully agree that we data scientist (if I may) are batteling a hype between magical expectations about the ‘power of AI’ and the frustrations that hard work on data and business questions still has to be done.

If we see the AutoML trend as a tool and das a way to gain quicker results I think we are fine; but terms and conditiond do apply and your warnings are well grounded.

ScottF · February 14, 2020, 9:03pm

Hi @mlauber71 -

I’ve been playing around with your workflow some, but since I’m on Windows I’m running into some issues with importing data into H2O, as you did. Is this the same error you’re seeing?

ERROR Python Script (2⇒2) 4:65 Execute failed: 'charmap' codec can't encode characters in position 0-56: character maps to <undefined>

(I left out the rest of the stack trace for brevity)

mlauber71 · February 14, 2020, 9:08pm

Yes this is the case. I am preparing a workflow showing that. The strange thing is the same command to import does work under the same settings with a pure Jupyter notebook.

So it seems to be a strange constallation that it would occur if Python is run from the KNIME Python node while the error message is from H2O. The problem although not reported by knime but by H2Onis somehow connected to knime.

mlauber71 · May 28, 2020, 5:55pm

If you want to see this approach in action in a big data environment you could check out this collection of workflows demonstrating the development and deployment of H2O.ai auto-machine-learning models on a big data environment with the help of the Sparkling Water node.

There also is a presentation on YouTube showing how it is done:

(the presentation is in German, the slides are in english)

system · November 27, 2020, 5:59am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.

SimonS · July 5, 2021, 3:55pm

Hey all,

we have added two nodes for H2O AutoML with the release of version 4.4, one for classification and one for regression. They do not require Python to run, however provide not all the graphs your workflow is writing out, @mlauber71 . See What's New in KNIME Analytics Platform 4.4 and KNIME Server 4.13 | KNIME for more information. Happy to receive feedback!

Best,
Simon