I have a regression problem and compare different types of regressiont trees (simple regression tree, random forest, gradient boosted tree and the tree ensemble learner. The data I use has been joined, filtered and preprocessed in many ways. For this reason I use a domain calculator and drop possible values and min/max values of alle attributes.
I found differences between models with or without the use of the domain calculator (using cross fold validation). Most models score considerably lower with the domain calculator, especially the gradient boosted tree shows a serious decline of the scored statistics. One model seems not to be feasible for the use of the domain calculator: the simple regression tree. What is the reason for this? Do some models make use of domain information to learn the model and the simple regression learner not?
the nodes you use are all related as they use the same code for their data handling and tree building the only difference between the nodes is that they build different kinds of ensembles (random forest and tree ensemble are actually pretty much the same but the tree ensemble gives you more options and is therefore more complex).
To answer your question: All of the mentioned nodes handle domain information in the same way which makes this strange behavior very odd.
Do the nodes give any kind of warning message like "x column(s) were ignored due to missing domain..." ?
The ndoes will drop domains without domain information by default, so if most of your columns do not contain domain information, most columns are not used for training and hence it is not surprising that the resulting models score considerably lower.
You also mentioned that the simple regression tree could not be build. Can you provide the error message?
I hope we can figure this one out.
Thanks for your reply. I did some additional examination and discovered the differences are caused by the button Restrict number of possible values in the domain calculator nodes. The results now are the same.