Knime Decision tree learner configuration

Hello just started working with Knime I have a data set that is fed into the 'Decision Tree Learner' and its 'Decision Tree Predictor' the predictor then plugs into a scorer how do I interpret their outputs and what do the configuration of the tree learner mean? I literally do not know what the predictor is trying to tell me. Can anyone point to a resource or has an explanation would very helpful.

Thank you

Good Morning

 

Have you looked at the node description? The decision tree learner has a comprehensive explanation of the configuration and a link to a paper describining most of the functionality (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.152&rep=rep1&type=pdf). 

 

I assume you are trying to do classification. The scorer node will be reporting a number of performance measures on the predictions that have been made. 

 

Try doing the following configuration:

 

1) Class column: the column containing the value you wish to predict

2) Quality measure: Gini index

3) Pruning method: MDL (reduces the size of the tree by pruning)

4) Min number of records per node: 5 (this was an arbritrary choice, I don't have your data)

 

Anything not explicitly mentioned leave as the default and then build the tree. 

 

The output for this node is the PMML model and two views representing the tree. If you want to investigate the tree have a look at the views. 

 

You then use the predictor node on new data to make a prediction. If you are doing validation you can then use the scorer node to compare the experimental value to the predicted value to get a measure of accuracy, sensitivity, specificity .... etc. 

 

Cheers

 

Sam

Is there a way to find the most accurate model by looping the decision tree learner/predictor with various train/test partitions?

Try the parameter loop optimization node in optimization category

I tried the parameter optimization loop start node and chose 3 steps (1 to 3 by increments of 1).  Then at the loop end I selected to maximize the accuracy.  I also passed the random seed into the partition node as a flow variable; however, at the end of both loops, the accuracy for all 3 iterations is the same.  Attached is a picture of my workflow.  Am I skipping a step?  Does passing/selecting a random seed mean that the data is set even if I try to partition it differently?

Hi Tangerooo

 

I assume this relates to your issue in the other thread?

 

Is this the goal: build a optimised model on different partitions and investigate the model building approaches stability to changing training data? 

 

What setting the seed does: "If either random or stratified sampling is selected, you may enter a fixed seed here in order to get reproducible results upon re-execution. If you do not specify a seed, a new random seed is taken for each execution.". If you select a static seed you will always get the same partition given the same input table. 

 

You have a list of random numbers to use as the seed for the partition node? I assume this is so you can fix the seed within the paramater optimization loop? Alternatively if you did the paramater optimisation after the partitioning node you could just get the partitioning node to generate your random number and report what one it used.  Turns out it doesn't fill in the  value unless you specify one. 

 

What parameters are you trying to optimise using the optimisation loop? What is being incremented by 1? 

 

I've put together an example workflow of what I think you are trying to do. I've used the Iris dataset to learn and predict. I've set 5 different seeds "sudo randomly" and it iterates over theese values. For each iteration in the outer loop a new partition will be made. The partition of the data is the same for each inner loop run of paramater optimisation. 

I've set one variable to be optimised: the minumum node size. It starts at 1, ends and 5 and imcrements by 1.

At the end a table is created containing the accuracy of the model for the 5 values of min node size for each seed. 

 

Workflow and screenshot attached.

 

Good luck

 

Sam

 

Yes it's related to the other thread.  I tried your workflow and also looked at your results from the Iris dataset.  It seems like the accuracy results still end up being the same?  I think what's happening is that if a random seed is chosen AND a partition percentage is chosen, then the partition is always the same despite any kind of loop or rerun.

Ah, that was silly of me. I scanned the table top to bottom not right to left when checking. 

 

The problem is the Iris dataset was too easy to learn from and the models are stable under the changing partition and the change in parameter.

 

I've switched to the glass dataset, turned off pruning and incresed the number of steps in the optimisation loop. The performance of the model differs.

 

I also double checked the partitioning node is behaving properly and it is indeed working fine. It produces 4 different partitions (1 for each seed). When you process the the next iteration of the outer loop you will get a new partition. You will have the same partition for every run of the parameter optimization loop. 

 

You can save a CSV file containing your partitioned data each iteration if you want to ensure you are getting a different split.

 

Regards

 

Sam

 

 

Hi Sam,

I want to do a parameter optimization with decision tree algorithm, similar to the one in this thread. I use parameter optimization loop nodes for parameters like number of models, minimum split node size, etc. which require integers. However, there are other options in the dialog regarding the algorithm such as split criterion (drop-down menu selection) and use mid point splits (checkbox). Is it possible loop them in the optimization process?

I saw it is possible in some other threads, but no detailed explanation. If it is not too much to ask for, can you please post an example workflow optimization with the dialog options I mentioned above.

 

Regards

 

Bora

I am new to decision tree. I have a data set containing category coded as movies, automotive, toys, electronics and collectables, currency,ratings,opening date, competitive(coded as 0, and 1).

In the decision tree learner node, I want to select competitive for the class column but because it is coded as 0 and 1, Knime treats it as a continuous variable. My questions are:

1) how to make the 0s and 1s to be categorical variable?

2)Category contains many different values. Do I need to create dummy variables for each of the categorial predictor?

Thank you.

Joanie
 

Hi, 

I am brand new in Knime. I built decision tree (and I used example workflow as well) but I cant see the result anywhere. I belive that model was built (console say so) but I don't know where exactly I could see image of tree and table with statistics or with score. 

 

Anyone knows?