Solutions to "Just KNIME It!" Challenge 25

alinebessa · July 13, 2022, 1:20pm

This thread is for posting solutions to “Just KNIME It!” Challenge 25, the third part of our four-week series on data classification! How can we preprocess our training data to build a classifier that is a bit more balanced for both classes?

Here is the challenge: Just KNIME It! | KNIME

Feel free to link your solution from KNIME Hub as well!

And as always, if you have an idea for a challenge we’d love to hear it! Tell us all about it here.

si_daniel_a · July 13, 2022, 6:24pm

This is my solution for Challenge 25

I used the AutoML Component node from Challenge 24 and added 3 nodes in order to process and resample the given data:
X-Partitioner act like your Start Loop Node where you can also define how many validations/iterations the model needs to run
SMOTE in order to get good classification performance for Churn = 1
X-Aggregator as your End Loop Node

This is the result after validating 20 times

My Cohen’s kappa is now above 80% while still maintaining overall accuracy of above 95%

Note: In my first run, I used 10 times validation, this gave me a 0.799 Cohen’s kappa and accuracy of 88% for Churn = 1, so I presume the higher the number of validation will give me a better Cohen’s kappa value and accuracy. for Churn = 1

Rubendg · July 14, 2022, 11:03am

Here is a solution for the “Just KNIME it” challange 25.
I have used the parameter optimization node to tune two parameters in the model: “minChildSize” and “nrModels”. To test the overall accuracy for each parameter I have used the x-partitioner to make a k-fold (n=5) cross validation. The SMOTE node is used to make an equal amount of “Chun” as “Non-Chun” rows. The best parameters are chosen to create the final model, which has an accuracy of 95% and a Cohen of 0.783.

MEPivnenko · July 14, 2022, 11:53am

Changing random stuff until your program works is “hacky” and “bad Coding practice”, but if you do it fast enough it’s “Machine Learning” and pays 4x your current salary.

KnimeIT_25

Victor_G · July 14, 2022, 2:51pm

Hello KNIMErs,

My last submission from previous challenge works pretty well for this challenge part, so here is the link to my previous contribution with workflow and infos :

Without any change on data, by using and optimizing XGBoost, I’m able to achieve a Cohen’s Kappa of 0,825 (and accuracy of 95,802%), a better score and better prediction of the minority class than the previous solution of last challenge with AutoML component.

I tried to add and test SMOTE, bootstrap oversampling and undersampling, but I obtained lower scores.

I will try some other options of data transformations, perhaps with the autoML component, and also try to figure out a statistical test or fair comparison technique to emphasize on the benefits of data transformations

lelloba · July 15, 2022, 7:08am

Hello,

here is my take of the challenge:

I got 95,4% and 0,806 for Cohen’s Kappa.

immagine

Have a nice day,
RB

rfeigel · July 17, 2022, 3:11pm

Here’s my solution. I added a Smote node to the workflow from Challenge 24 as well as some statistical nodes to compare the results from Challenges 24 and 25.

knime://My-KNIME-Hub/Users/rfeigel/Public/REF%20Challenge%2025.knwf

arddashti · July 18, 2022, 11:37am

Hello KNIMErs, Here is my solution for Challenge 25

#knime #justknimeit-25 #dataanalysis #datascience

alinebessa · July 18, 2022, 3:21pm

Interesting! This means that the optimization you tried was not implemented in the AutoML component. Good to know, and telling of how model engineering is “infinite”.

alinebessa · July 18, 2022, 3:25pm

Your workflow does a really good job at highlighting how the solutions are sensitive to parameter changes (here, a change in the seed you use to resample the data). Note that there’s a chance that these performances, even if a bit different, are statistically equivalent. Still fun to see how they vary here.

AnilKS · July 19, 2022, 2:42am

My take on Just Knime it - Challenge 25
Churn Prediction Part 3
Tried to play around the various sampling techniques to raise the Kappa along with accuracy.

Victor_G · July 19, 2022, 8:05am

Hi @alinebessa,

In the AutoML component you can change the standard settings and optimize hyperparameters, change step size and optimization strategy, that might be more relevant for a specific algorithm.
By default, I have found that for XGBoost Tree Ensemble, the standard settings for parameter optimization were :

Hyperparameters to optimize : max_depth (from 5 to 10, with a step size of 5… ?!) and eta (from 0,2 to 0,3 with a step size of 0,1),
Optimization strategy : Random Search

From my side, I have optimized hyperparameters of XGBoost Tree Ensemble with these settings :

Hyperparameters to optimize : Max depth (from 2 to 10, with a step size of 1) and min child weight (from 1 to 10 with a step size of 0,1)
Optimization strategy : Bayesian search

I think AutoML is a very good component to have a first quick and to deliver a first good prototype, and/or to help decide which algorithms should be more relevant to the dataset. But once the algorithm is chosen, there is still room for optimization

As mentioned, even with data transformation and resampling strategies, I wasn’t able to outperform these first results. Even with normalization and data transformation, the results were just above 0,8 for Cohen’s kappa, but lower than 0,825 I had with optimized XGBoost, so I’m very curious to know how you achieve better results

And I’m quite skeptical about some of the solutions here which are mixing the training and test set, this is not good coding/ML practices, and it provides biased/overestimated performance of the solutions, since test set is seen during the training. A good example of data leakage and what ML practicioners should avoid. I can only recommend the excellent playlist “Making Friends with Machine Learning” by Cassie Kozyrkov, that explains the basics of Machine Learning and the algorithms
MFML 069 - Model validation done right - YouTube

alinebessa · July 19, 2022, 1:22pm

What a thorough, thoughtful and generous comment, @Victor_G! Will definitely look into the playlist soon.

i’ll be posting a solution to this challenge in a few minutes and will address some of the questions you raised in your comment then.

Thanks for contributing to this community!

alinebessa · July 19, 2022, 1:54pm

Hi everybody,

As always on Tuesdays, here’s our solution to last week’s rather tough challenge!

As stated in the challenge, the idea was to optimize the model over the training data and then assess its performance over the test data, so we did not use cross validation over a combination of both datasets here. We also did not experiment further with model optimization (I mean, we’re still using the AutoML component here, but we did not try anything different from this, or played around with the component itself). Rather, we focused on data engineering. Since the AutoML component already does a fair amount of feature engineering, we worked on altering the data distribution by using sampling techniques. Our goal was to be didactic with these challenges, so we refrained from playing with many aspects of ML engineering at the same time.

As a result, we got a Cohen’s Kappa of 81.1% and an accuracy of 95.4% – both better than last week’s – but we haven’t evaluated the statistical significance of this change.

Stay tuned for this week’s challenge!

rfeigel · July 19, 2022, 6:42pm

The Knime team solution may be analytically elegant, but its not practical. It takes hours to run.

alinebessa · July 19, 2022, 7:13pm

How much memory did you allocate to your KNIME Analytics Platform? It runs in about 15 minutes on my machine. Thanks for your feedback!

rfeigel · July 20, 2022, 1:36am

8 GB memory allocated

alinebessa · July 20, 2022, 1:59am

Hey, that’s exactly how much I have here. Do you mind telling me what version of the platform you’re using?

rfeigel · July 20, 2022, 2:32am

Mea culpa. I freed up some memory and it ran in 15 minutes.

rfeigel · July 20, 2022, 2:34am

Knime Platform 4.6.0