I read from the previous thread that the KNIME native decision tree learner uses C4.5 algorithm. I will like to find out in more details if it is true that for categorical variables, it will not be repeated in a tree, while quantitative variable can?

For example, if a record with attribute, cp=4, that record should not further split later on with a cp=1 or 2 or 3, right?

Hi @ctienche

The reason why a Decision Tree cannot repeat a categorical variable down the branches of a recursion splitting is that the logical rules that are built down the tree are an AND combination of single variable conditions, for instance:

`$Fare$ <= 7.4 AND $Fare$ <= 10.79 AND $Embarked$ = "S" AND $Fare$ <= 23.25 AND $Pclass$ > 2.5 AND $Sex$ = "female" => "YES"`

in a real DT rule generated to classify the Titanic dataset.

Given that the logical rule can only be made of AND operators, the Decision Tree cannot use the same categorical variable twice with two different categorical values.

For instance, in the case of the Titanic dataset, no logical rule can be built using an AND combination of `$Embarked$ = "C" AND $Embarked$ = "Q"`

because the leaf would obviously be empty.

This is something that does not depend on the DT algorithm version but on the nature of the variables and the way DT rules are built using AND operators.

This is not the case for numerical variables because they are ordinal (they represent a value that can be sorted) and hence at a branch level of the DT one can have the condition `$Age$ > 18.5`

and later in a connected branch have the condition `$Age$ <= 64`

to end up with a leaf rule equal to `$Age$ > 18.5 AND $Age$ <= 64 => NO`

for survival.

Hope it helps.

Best

Ael

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.