Solutions to "Just KNIME It!" Challenge 24

alinebessa · July 6, 2022, 1:18pm

This thread is for posting solutions to “Just KNIME It!” Challenge 24, the second part of our four-week series on data classification! How can we automate the choice of a classifier, including its hyperparameter values?

Here is the challenge: Just KNIME It! | KNIME

Feel free to link your solution from KNIME Hub as well!

And as always, if you have an idea for a challenge we’d love to hear it! Tell us all about it here.

AnilKS · July 6, 2022, 3:55pm

My Take on Challenge 24
Node -5 & accuracy >95

rfeigel · July 7, 2022, 1:50am

Here’s my solution.

Victor_G · July 7, 2022, 7:20am

Hello,

I stayed very traditional and simple in my approach for this part 2 : welcome XGBoost !
Moreover, I really like the fact that feature importances are already calculated in the node XGBoost Tree Ensemble Learner, so I can compare the features used in this model vs. features used in the models I have tested in part 1.

Here is my workflow (5 nodes, 95.502% accuracy) : justknimeit-23part2_Victor – KNIME Hub

@alinebessa : Thanks for the hint about variable transformations in the CSV reader node. Thanks to this, I can obtain the results in 5 nodes

lelloba · July 7, 2022, 9:59am

Hello Knimers,

here is my solution:

AutoML node left me speechless! D:

Have a nice day,
RB

ersy · July 7, 2022, 11:25am

Hi everyone,
Here is my solution.
Using AutoML component made the workflow folder ~60 MB in size.
However just using one learner like decision trees folder size end up ~1MB
Automation causes 60x growth in folder size.

MarioNasser · July 7, 2022, 12:30pm

Hi everyone! Just finished the challenge and decided to change a little bit what you guys were doing. I decided to go for a random forest learner, for that, I used the SMOTE Node to oversample the input data. With this methodology, I reached to an accuracy of 95.502% with Cohen’s kappa ~~ 0,8%
You guys can check it out here!

si_daniel_a · July 7, 2022, 2:00pm

Here is my entry for this challenge, using Ensemble Learner and Predictor Nodes which achieved 95% accuracy by tweaking the split criterion and number of models

alinebessa · July 7, 2022, 7:57pm

You’re welcome! A curiosity: how did you get to the decision of using XGBoost?

alinebessa · July 7, 2022, 7:58pm

Hey @ersy, thanks for the feedback! Did you try resetting your workflow before uploading it to the Hub?

Victor_G · July 8, 2022, 5:56am

Hi @alinebessa,

In the previous part, I had already tried XGBoost for performance tuning (even if it wasn’t really the target of last challenge).

Since Tree methods look quite relevant for this dataset, Boosted Tree are very effective at optimizing accuracy/performances, and XGBoost is one of the best performing Gradient Boosted Tree model, XGBoost was the most interesting model for me here.
It combines good performances, explainability, and relative low size (compared to using AutoML for example), so easy to use and deploy, and providing fast computations

Another idea I had in mind was to choose several Tree models and fine-tuning them with the Hyperparameters optimization component, I might give it a try for curiosity

Rubendg · July 8, 2022, 10:06am

Hi,

Here is my solution, for the just knime it 24 challange!

martinmunch · July 11, 2022, 8:13am

hi

Here’s my solution to challenge 24.
I coincidentally solved this in previous challenge 23 Solutions to "Just KNIME It!" Challenge 23 - #10 by martinmunch

So compensate for time saved, I played briefly with feature engineering and normalization. (potentially a topic for future challenges?). I managed to get accuracy 95.80% Unfortunately I forgot to save and only have a screenshot to show for it

I used a “simple” SMOTE node with fixed seeding (so that results can be recreated) and scored 95.65% in accuracy, close but no cigar

cheers!

Victor_G · July 11, 2022, 11:32am

Hi Martin and KNIME enthusiasts,

By optimizing two hyperparameters in an XGBoost Tree model, without touching at the raw data (no transformation, no SMOTE, etc … as mentioned in the challenge), I can get an accuracy very close to 96% (it’s actually 95,952%).
I guess that’s worth a cigar ?

arddashti · July 11, 2022, 2:24pm

Hello KNIMErs, Here is my solution for Challenge 24

paolotamag · July 11, 2022, 3:16pm

Did you try to save the workflow after reset?

A new feature should come up soon where you can require components to not save data inside the component at each node output. This should improve the issue you are facing. I also heard the representation of workflows might be different from the current folder system at some point.

Until then try to write the output of the automl component via a workflow writer node. After that reset the workflow and then save it. It should become lighter if size is an issue. In your case (60MB) it should be all right for most systems.

martinmunch · July 11, 2022, 9:23pm

Very nice. Will surely check it out.
How did you tune parameters? By hand or some node supported parameter optimizer / X-Aggressor?

Victor_G · July 12, 2022, 7:26am

Hi @martinmunch,

I used the nodes Parameter Optimization Loop Start and End (Parameter Optimization Loop Start — NodePit and Parameter Optimization Loop End — NodePit), just like @Rubendg did in his workflow.

The optimization was done on two hyperparameters “Maximum Depth” and “Minimum Child weight”, with a Bayesian Optimization (TPE) strategy.
I got the following optimized values for the hyperparameters : Maximum Depth = 10 and Minimum Child weight = 4,802. As I didn’t fixed a random seed (shame on me), I’m not sure this exact results will be reproducible, but with this framework and the use of XGBoost you should be able to boost your accuracy results.

Victor_G · July 12, 2022, 11:51am

With a proper training-validation-test workflow (and still without data transformation, SMOTE or anything), always with XGBoost and the fine-tuning of same two hyperparameters and with (hopefully) all nodes with a fixed random seed, I can get an accuracy to 95,802%, exactly what you had previously

Here is the link to the workflow : justknimeit-23part2_optimization_Victor – KNIME Hub

Definitely out of scope for this challenge regarding the number of nodes, but still interesting to look at model optimization.
I prefer this optimization version than the simple one with the fine-tuning on test set (@Rubendg), as the final accuracy result obtained the simple way won’t be a fair metric/assessment : the test set, which is supposed to be independant from training and validation of a model to properly assess its performance on unseen data, is used to fine-tune a model, so the assessment is not reliable, independant and generalizable.
With the cross-validation on the training set, test set remains independant and only used at the end for model assessment, so the performance and accuracy metric is reliable.

Still, you can remove all the model optimization nodes to just keep the 5 main needed nodes (and properly configure optimized hyperparameters of XGBoost training node) to enter the challenge with a high accuracy

alinebessa · July 12, 2022, 2:06pm

Oh wow! Really interesting to see how so many of you are already exploring optimization techniques and data engineering for this problem! We’ll be exploring this more in the coming weeks.

For now, here’s our solution for last week’s challenge. We used our AutoML verified component and the chosen model was Gradient Boosted Trees, in a technical tie with XGBoost Trees and H2O’s GBM. Note that it’s not possible to simply state that Gradient Boosted Trees generalizes as the best model for this problem: we’d need more data to draw a proper statistical analysis here, comparing overlaps across confidence intervals for different solutions, for example.

See you tomorrow for a new challenge!