Decision tree learner seems to be overfitting

Beside the already given (great) advice
Decision Tree node should allow you to apply pruning which can help you in regards of overfitting.
Also you might give ensemble models (random forest,…) a try which might be of help here.
@aworker is right.
You always want to give some sample if possible



@mlauber71 I tried turning off the configuration setting of the average split point setting for the decision tree learner. However, that did not work either and it was still overfitting. Also, the other models that are supported by PMML such as SVM, clustering, etc. are not suitable for my evaluation. I currently have a dataset of 5000 ECG values, from which some are classified to be normal (1) and others as abnormal values (0). From this dataset, I need to be able to predict the abnormal values correctly. Either I train the model using a 70-30 train-test split or I train it with only normal values and try to make the model test with the whole dataset and predict the abnormal values. So for that a random forest, isolation forest or decision trees seem to be quite suitable to my knowledge. If there are other models that would be suitable please do let me know as I may have missed out on them.

Also @mlauber71 you said the MOJO format from would also be an option, but the output from the trainer node in MOJO format does not support PMML. So I am a little confused on how I can move forward from here. Hope you could clarify this thank you!

I have posted my dataset below, sorry for the inconvenience!

@Daniel_Weikert I did try pruning but that too does not seem to help :confused:

And if I am not wrong, only some ensemble models can be used to create a PMML model right. And those models I think would not be suitable for my dataset and the goal I want to achieve.

Thank you for your replies everyone, I appreciate it.

Hi @akhtarameen

Thanks for your complementary information.

I’m afraid but cannot see the data in your post. I believe it is missing :thinking:
It would really help to better understand why the model is over-fitting.



Oh apologies. I completely forgot to attach the file! You can access the file using the following URL:

Thanks @akhtarameen for the data file. I’ll try to see what is the problem based on the data and workflow.



Thanks @akhtarameen for the data file.

I tried your data with your workflow and these are the statistics results I get on the Test data:

It sounds quite good to me.

What makes you think the DT is over-fitting ? Could you please develop ?

For me everything sounds good :wink:



1 Like

Thanks for testing it out. And oh wow that looks good! My confusion matrix only gives values for the first column. The second column is just 0. So I am assuming it is because all the data points from the predictor node are overfitting to be just one class. Can I see your workflow and decision tree learner node configuration settings if you don’t mind? That would really help me solve my issue.

1 Like

Please find it below

20220107 Decision tree learner seems to be overfitting.knwf (3.3 MB)




Alright so I still have no idea how my workflow never worked. But yours seems to work just fine! Thank you again for your help!


You are welcome @akhtarameen :blush:

@akhtarameen the Decision tree of @aworker does a wonderful job. I wanted to see what else is possible so I took it up a notch. and XGBoost are able to do an even slighly better job with more correct classification (on the test set).

Admitteldly would add a further level of complication since it would use an ensemble of models to predict the outcome. If you data would differ over time very complicated models might also be less robust than a tree model. But from my experience a lot of systems (and namely KNIME) are good at handling MOJO files - so this might be an option if explainability is not the most important thing.

Using this approach with preparing the values first with vtreat brought down the number of miss-classified lines in the test file down to 5 although that might be an accident - or it might indicate that if you invest in data preparation even more precision might be doable.

Looking at variable importance Column104, Column117, Column105 are very important but not in a way that they would be considered a leak. But you might have to be careful about Column104 which carries something lik 40% of all explaning power - so to speak.

Also the AutoML leaderboard is full of GBM models so this might be an indicator that (Boosted) tree models might be a good fit; indicated by XGBoost taking the lead in the first place :slight_smile:


Oh nice. I did run the AutoML node as well but got some different results. Maybe I will try it again and see. The important thing for me was to convert the trained model into PMML in order to test my model in python. That is why I was limited to using certain learner nodes in KNIME.

Furthermore, does anyone have experience in importing a PMML model into python and trying to predict results from there? I tried doing so but my confusion matrix in python again seems to be giving me similar results to this:

Does anyone know how I can fix this? I feel like it is an issue with the way I am importing my PMML model which is as follows:

I tried another method of importing the PMML model which is as follows:

But importing the model using the second method gives me this error:

It would be really helpful if anyone can maybe help me fix this error or help me with the weird confusion matrix values I am getting!

1 Like

Just for my understanding.
Why do you use KNIME and then export the model instead of using python directly if you want to load it in python anyway?

Thanks for you response! I want to compare the accuracy of the trained KNIME model after exporting it through KNIME as a PMML model by finding predictions and evaluating the model through python, with the results that I obtain when I directly train the model through python and test the model through python as well.

@akhtarameen you can import MOJO models into KNIME, R, Python and Spark (and back). That is why I like to use this format - for example I can develop a model on an R machine and then deploy it on a big data cluster.

This collection actually contains Jupiter notebook that basically does the same thing. So you might want to try that. There you might also see how to use relative paths in a notebook.

This might be because of different splits in test and training, data preparation with vtreat and longer runtime with model building.


Hi @mlauber71 do you have (by any chance) also a detailed workflow using PMML Nodes (model appender, ensemble,…) I have not found an example on the hub and was curious about that.
thanks and best

@Daniel_Weikert well in fact I had a collection that would try and pack some transformations into a PMML ‘collection’ then the model and then do just one compiled transformation and one for the model. But the whole concept seems to be buggy and would crash all the time. Maybe I put that on the hub, just to play around with.

I think I once collected some Spark transformations into a PMML, compiled that and executed it successfully on a Big Data cluster.

But it seems these PMML things are not really well maintained …

OK and since we are hijacking this thread. An additional information for @akhtarameen. There in fact is a node that would convert Gradient Boosted trees to PMML. So there might be another chance to get a more advanced model into PMML-production (not sure how stable that would be …).

1 Like

Thanks for sharing your experience. Seems I need to think twice before I decide to dive deeper this rabbit hole with bugs.

1 Like

@Daniel_Weikert with PMML it is complicated. There are a lot of nodes that would support PMML in principal but not all would warn you if there is incompatibility and sometimes the problem would only occur after you saved the transformation/model and applied it to a new dataset - which is what you might have to do if you want to put it into production.

This is why I sometimes try to store data preparation steps in SQL code that has been written by KNIME. Maybe not the most elegant way but quite stable. And if you use PMML I would try only to use it step by step and store the individual transformations in a step.

On very large data tables on a big data system and if you have extensive operations like replacing strings with numbers (globally), then you might benefit from compiling transformations for Spark.

OK since we are still hijacking @akhtarameen thread :slight_smile: - another model ensemble which might be converted to PMML:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.