Over fitting in Decision Tree

Hi All, @AlexanderFillbrunn, @mlauber71,

I have the decision tree but its getting over fitted. I have tried using Parameter loop start and End but this is not helping at all. Is there any way to deal with this?

I would appreciate your help.

Thanks in advance!!!

Hi,
which parameters do you adjust in the loop?
Kind regards
Alexander

Hello @AlexanderFillbrunn,

In the parameter loop end i have tried maximizing Cohen Kappa.

Regards,
Chetan

Hi,
but which hyperparameters of the decision tree do you actually change in each iteration?
Kind regards
Alexander

Hi,

I have tried number of records in the leaf. But i can put that up to 15 not more than that.

Also, if i try to maximize Cohen Kappa, the above becomes of no use.

Please help me out here if i am doing anything incorrect.

Regards,
Chetan

Hi,
can you share your workflow and data? Without any of that there is nothing really we can do to help you.
Kind regards
Alexander

I am sharing the workflow. Please note that i can not share the data as its confidential.

Hi @ChetanP

Take a look at this workflow from the KNIME hub. It shows how to do the parameter optimization. Maybe you can use this as an example for your workflow.

gr. Hans

@HansS @AlexanderFillbrunn,

I have shared the workflow screenshot above. @HansS i would like optimize parameters in decision tree. I have tried optimizing number of records. Can you help me out optimizing other parameters if any?

Hi,
What’s the accuracy you achieve with your current model? What is it on the test and training set. Can you produce a line plot of accuracy VS different values of min leaf size? If we don’t know the data there is little we can say here, but of course we understand that confidentiality is of utmost importance. Have you enabled error pruning? The decision tree learner does not offer any other options that help with that, but min leaf size should be sufficient.
Kind regards
Alexander

@AlexanderFillbrunn,

I am getting 83% accuracy on the training and on testing it is 82%. Accuracy seems fine.

i am facing problem in specificity and sensitivity it has both very low. Also, if i look at Cohen’s Kappa on training its 0.121 and on testing its 0.048 clearly the model is over fitting.

You mentioned producing a line plot of accuracy Vs different values of leaf size, how can i do this? Also can i do it for Cohen’s Kappa as well. Please share the workflow

Hi,
I don’t have a workflow for the line plot, but it should be easy to do. You already have the parameter optimization loop, so you can simply collect the hyperparameter value and the Kappa at the loop end node, then attach a line plot. Your data seems very heavily imbalanced, is that true? Maybe some over- or undersampling helps? Your Cohen’s Kappa is low even for the training set. Have you tried other classifiers?
Kind regards
Alexander

@AlexanderFillbrunn,

Sharing the screenshot of parameter loop end, but when i select that, i am unable to execute the parameter loop end. Yes, you are right i have high imbalance data. Here, number of records is the parameter that i have set.
Capture1

I would like to try Random forest but i have no idea as to how to run and how to evaluate it

Hi,
that is the wrong configuration. The maximize option just determines whether you maximize or minimize. Giving it the value of the Cohen’s kappa flow variable won’t work. You actually don’t have to use the “Flow Variables” tab at all. In the “Options” tab, select “Cohen’s kappa” from the ComboBox at the top and then just check “maximize” at the bottom.
Kind regards
Alexander

Hello,

Okay. I got that. Again, if i select that then i wont be able to tune the parameter “Number of records”. How can i do this? When i select the maximum Cohen’s Kappa and run the decision tree again its over fitting. In many terminal nodes i am getting 1,2,3,… such records. Clearly its over fitting. Is there any way i can look at both Cohen’s Kappa and minimum number of records?

Hi,
have you configured your Decision Tree Learner to actually use the value of the flow variable as setting for the minimum number of nodes per leaf? There you have to do that in the “Flow Variables” tab.
Kind regards
Alexander

@AlexanderFillbrunn Yes, i have. When i select Maximum Cohen’s Kappa that (minimum number of records) becomes redundant. Number of records i have set using parameter optimization loop start

.

Why does it become redundant? In the loop you adjust “Number of Records” in every iteration, measure Cohen’s Kappa and in the end you choose the setting that maximized it. But from the numbers you had earlier, it seems like no amount of hyperparameter optimization will help you here. How does your data look like? How many data points, features and classes do you have? How skewed is the class distribution exactly? What is the problem with using a random forest? You just have to insert the Random Forest Learner node instead of the decision tree learner.
Kind regards
Alexander

@AlexanderFillbrunn,

Thanks for clarification. I have only 11K records and there are 300+ features and 6 classes.

My data is highly Skewed. Let me try random forest and see.

Thanks you once again!!!

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.