Tree Ensemble Learner - different results 3.1.2 and 3.3.2

lilia1 · July 21, 2017, 1:41pm

Created some time ago a WF with RF model using Tree Ensemble Learner in KNIME version 3.1.2. Recently tried to re-create model in version 3.3.2, but node Tree Ensemble Learner changed. I have tried a number of diferent options that should match setting in previous version, but could not create the same model using the same training dataset.

What settings I should use in new version of node Tree Ensemble Learner to reproduce model I have created previously? thank you.

nemad · July 25, 2017, 12:02pm

Hello lilia1,

yes, we changed quite a lot of stuff around the tree ensemble learner in 3.2 but in principle all results should be approximately reproducible. Approximately because random forests are, as their name states, based on randomness and even if you have the same random seed, results might change slightly. That's due to the fact, that the additional functionality introduced, can influence the order of things happening. Further I recall that there was a bug concerning nominal features which might also impact results.

The bottom line is, if you are trying to exactly (in the sense of getting for every row exactly the same probabilities) reproduce your results from 3.1, you will have to use the deprecated nodes (the fact that we can't guarantee exact reproducibility is part of the reason the node was deprecated).

Otherwise I will need to know your exact configuration in 3.1 in order to tell you an equivalent configuration for 3.3 (or 3.4).

The most likely reason for the deviation is the use of binary nominal splits, you can switch back to multiway splits by turning of the respective checkbox in the tree options tab.

Another big change involves missing value handling but as the node failed on missing values before, it is not likely that this is the reason in your case.

If you have other questions, feel free to ask.

Cheers,

nemad

lilia1 · July 27, 2017, 12:32pm

Hi nemad,

thank you for reply. I have "switched" back to the depricated node, and this generated the same model (as I was hoping). To reduce randomness I have used "enable static seed" option in Equal Size Sampling node in 3.1 version - a screenshot below:

For Tree Ensemble Learner I used following settings (3.1. version)

I am using RDKit descriptors for RF model, and all the values are numerical (double or interger), so " binary nominal splits" should not affect the model building process. And I do not have missing values.

I know it is possible to extract PPML model, which is another possible option to "transfer" old model into a new version of KNIME. But If there is an option to re-build model in new version that will be my preferred option.

nemad · July 28, 2017, 2:14pm

Hello lilia,

ok know I better understand what you want to do.

I guess all you can do is try whether using the same static random seed yields the same model but there is no guarantee for that (in the new implementation the order in which operations are performed might be different and thus also the sampling). Since what you want to do is not affected by the changes, this might work (unfortunately this is just a gut feeling..). Otherwise your best choice is to extract the decision tree from the random forest and use it as a stand-alone decision tree (it's a single model ensemble after all) in the new KNIME version.

I am sorry for this inconvenience. If you would use a larger forest, the differences between the old and new model would be much smaller. In any case if your forest consists only of a single tree, it will terribly overfit your data as decision trees are low bias models. If you use the model only for understanding your data this might be useful although I am not sure whether the sampling in the random forest will be beneficial here.

In case you rely on the single model random forest in order to build an ensemble that is learned with equal-size sampling, it might be worthwile to have a look at the additional sampling possibilities in the latest tree ensemble learner (which includes random, stratified and equal-size sampling).

Cheers,

nemad

lilia1 · July 31, 2017, 11:42am

Hi nemad,

thank you for reply. I have created 1 tree each time, but I have put in the loop of 100, so effectively it is a model containing 100 trees. I will follow your suggestion to extract the model, and use it in new version of KNIME. Thank you again for the explanation and advice.

Thanks,

Lilia