Choose the best 'Min number records per node' for Decision Tree Learner

Hi to all ; i am new to this forum. I have this problem : I want to classificate about 92000 examples with 15 attribute and I chose the Decision Tree. I would know if there is a method , like Cross Validation ecc. , to choose the best value of min number records per node


Hi Domenico,

you can do this optimization using our parameter optimization loop start and loop end nodes.

You can install them via our KNIME Labs Extensions.

Cheers, Iris

Hi Iris , 

thanks a lot for the answer. I see two packages about what you said : KNIME Optimization extension and KNIME Decision Tree Ensembles. Which of them?

It is the first one :-)


(But the second one contains a random forest, which is also pretty nice)

Thanks again. An answer : I created this workflow and trained Decision Tree with 80% of Data set (Training Set) and tested it with Test Set (the remaining data). I have some doubts I hope you'll help me to remove: the use of Partitioning node is also called Hold Out? In which cases I can use Cross Validation?


In the second image there's the Confusion Matrix from Scorer , including the column of reject option. I t si fair to say that I calculate the Error without considering the column 'rejected'?


Hold out is one type of crossvalidation. You can read more here

We also have dedicated CrossValidation nodes. However, CrossValidation is a Model Validation technique, it is not used for parameter optimization.

Hm, your second question mainly depends on if this value (rejected) is of any interest for you. But basically the scorer takes all values into account for its accuracy.

Thanks . 

If I don't want to use the Parameter Optimization nodes , it is correct to use Cross Validation , increasing from time to time the 'Min number records per node' and then making the right observations?

No this is methodological not correct. If you do crossvalidation, you leave one subset out for training, which is than used for testing. If you in addition change a parameter, you don't know if the improved quality is based on the parameter or just a random effect of this specific Training-Test-Set Combination.

Seems that Parameter Optimization is the only way. So , where I can find an example of its implementation with Decision Tree?