Random Forest algorithm flow variable documentation

Are definitions of the flow variables in KNIME modeling algorithms available?
Here is what I want to do. I heard from someone at KNIME that some modeling algorithms accept the input training data set, which is automatically divided internally in a 50:50 ratio into training and testing sets for iterative building of the model. I would like to know which algorithms do this, and how can I change the ratio. Is this done via a flow variable like the dataFraction flow variable in Random Forests? If so, what is the syntax.

Hi @dataminer_1 -

Not sure exactly which algorithms you might be referring to, but the typical practice is to explicitly split the data set using a Partitioning node, using whatever ratio you specify, and then branch the workflow so that your training set is fed to a learner node, and the test set to a predictor node.

There are several examples of this on the Hub, like one below that uses a Decision Tree. Of course you could substitute whichever classification algorithm you prefer here by swapping out nodes.

Does that answer your question?

1 Like

Scott:

I am aware of the usual method of partitioning, and I use it. IBM SPSS Modeler (which I have used since 1995) divides the input data set into 3 pieces, a training set, a testing set, and a validation set. Both the training and testing sets are submitted to the algorithm, and the validation set is used to evaluate the model. I asked Michael Berthold at a KNIME conference about how to implement the formation of 3 data sets. He said that the KNIME machine learning algorithms (like Random Forest) divide the input data automatically in a 50:50 ratio to generate the internal training and testing sets to use for evaluating the error after each iteration of the algorithm. The bottom output of the Partition node is to be used as the validation data set. Modeler gives me the option to control the percentage division of the input data set to generate all 3 data sets. KNIME allows control for the modeling data set and validating data sets only, where the training and testing set ratio is hard-wired with no control of ratio). The full control of Modeler is useful; I have used it several times in the past. I wondered if KNIME could provide that option. Your reply makes me wonder if my understanding of Michael’s answer is not correct. If it is not correct, is there some other way to control the ratio. Maybe I missed the presence of a flow variable that can do it. If there is no way to do it, you might consider adding that feature in the future.

This is an important question, because I use KNIME for my classes in Data Science at the University of California at Irvine. I need to make sure that I am training my students properly.

Bob Nisbet PhD

Hi Bob,

Even for a Decision Tree? I would have to check which algorithm they are using. Of the top of my hat, I wouldn’t know what they are using it for. I know that some publications use this during the pruning procedure, but in general, there’s no need for an additional split of the training data. When talking about Random Forest, it might be related to the out-of-bag error computation. But in that case, our implementation doesn’t use fixed ratios but shows a row only to the trees that haven’t seen it during training.

Could you point us to the documentation of Modeler? Maybe this helps in understanding what they are doing/referring to.

Best,
Stefan

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.