understanding of Decision Tree learner/prediction

it would be nice if someone could explain Decision Tree learner/prediction to me

example: lets say i have a customer table with 200 000 rows, and the following columns (possible values in brackets)

id, age(0-150), gender(male,female), profession(self-employed,student,unemployed,other,employee), monthly income(0-100000)

so lets say i want to use Decision Tree prediction to predict which group buy something (maybe a computer, car or whatever) - before i can do that the Decision Tree learner needs trainingdata

first question how big should be the training data? (10%?)

now i add a new column in the training data and insert yes/no (buys item or not) and give that file to the learner and after that i apply that to the whole data


Well the training set can be any size, but of course the bigger the percentage the more accurate it is likely to be.

So to the learner node you will feed in the variables (i.e. age, gender etc) along with the buying item column. But you want to keep some of this data back (about 10%) (so dont put all of it into the learner), so you can feed this into the predictor node as the test set, and therefore you can compare predicted result (predicted buy yes/no) versus actual result (actually buy yes/no). You can quantify this comparison using the Scorer node.

If you are happy with the results from the test set, then you can feed in some new data to the predictor node and get a predicted result. And based on your test data outcome, you can decide how reliable the predicted data is likely to be.

There are a multitude of other decision tree predictor/learners in the weka section. Use the Weka predictor node in combination with one of the Classification Learner nodes.

And there are some advanced features to use too in the Meta section on Feature Elimination and X-Validation.  Feature Elimination helps work out which factors (i.e. age, or gender) have the most influence on the outcome result, X-Validation helps assess the accuracy of the model buy undertaking multiple test datasets across the known data.

But to get started, the above is good to be going on with.

Hope this helps,


I'm trying to do something similar as orsol, I usually use 30%-40% of training data. But I've noticed that the decision tree(learner and predictor) doesn't allow me to choose what variables i want to use in the model. So my model is using the client_ID as an input.

I filtered the column (id) to see the prediction but then I can´t "recover" the id...

is there a way/node (before the descicion tree learner) where I can choose incoming data for model??? 

then on the other hand once i have an efficient model is there a way of seeing the actual algorithim??in an sql version perhaps?

Thank you!!! 


The learner does automatically use all the data within the table. So you rightly need to filter out the Client ID.

You can recover this after using the Joiner node by Joining by RowID with the data after the Decision Tree nodes with the data prior to the column filter.

Hope this helps


You can also see the algorithm used in viewer output of the learner node.


Thanks Simon, you've been very helpful.

I've come to have a slighter bigger problem with the decision tree. When I use the predictor for new data (as I use historiccal data for training) it predicts at a 99% "?" missings. I've checked the new file and it has the same data input. I can't possibly understand what I'm doing wrong, any ideas?? 




Hi EF,

without seeing the data we can only guess. Can you outline which kind of variables your data currently contains?

Our current decision tree learner predicts a missing value (hence don't know) if the attributes value is unknown in the last reached node.

This is based on the current hard coded notruechildstrategy.

With the next release there will be an option for changing this behavior.

Best, Iris



Hi Iris,

I've solved the problem, it was predicting missings as one of the input variables didn't have the same categories/values.

thank you!!! 

Hi again,

This time I have a question on the predictor node. It allows you to append the column of probabilities which is great but I'd like to know how does it calculate this distribution??and if there is a way to see in the tree the prob "assigned" to each branch??

thank you!!!

'Variable selection in Decision tree learner'


Hi :)

my name is  sunkyung.

my woriing at finance as a junior consultant in Korea.

What I am curious about is 'Variable selection in Decision tree learner'.

What is a method of change of variables when separating a node.

I don’t know that I choose a variables to desire.

Sorry my English is not very good, thank u for reading!


Hi sunkyung,

Automated "Variable selection" is called "feature selection" in KNIME, so if that's what you're after these are nodes to check out. If you're more worried about your manual filters (like never training on case IDs etc.) looking into the settings of "enforce inclusion" or "enforce exclusion" will get you there -- new columns will then either always be ignored or always be acceted. Other data changes on existing variables can be trickier, so you shoud re-assess your models frequently through re-training.

Hope this helps,

Hi EF,

a decision tree recursively partitions the training data. The leafs of the tree represent the final partitions and the probabilities the predictor assigns are defined by the class distributions of those partitions.

Maybe a little example can help:

Let's assume we have two classes A and B, and a leaf partition that contains 10 training rows.
6 of those rows are of class A and 4 are of class B. Then the probability for class A in this leaf is 0.6 and for class B 0.4. The predictor node assigns those probabilities to all rows that end up in this specific leaf.

If You open the view of the decision tree learner, You will see a tree with multiple nodes. Each node contains a table that shows the class distribution of the training rows in the partition it resembles.

Hope this helps,