Hi, I am running a tree ensemble/random forest for a regression. The target column I am hoping to predict does not have a gaussian distribution. I was able to train a few promising models using a random partition and a random seed, but when I remove the random seed to create new training, cross validation, and test sets, the model doesn’t perform as well. I am just wondering if someone has any advice on how to partition this type of dataset where the target column is skewed. I am trying a stratified partition and was wondering if there’s a way to stratify a continuous variable. is there a way for me to get a wholistic partition of my data (get a few points from each part of the distribution). Since the dataset isn’t too large, I suspect that in some of the random partitions I may be getting a training set localized in a specific part of the distribution so it doesn’t know how to deal with the very different data that is in the cross validation or test set. Just for a method of measuring, I got a good model with an r^2 of 0.8 for cross validation/test set. The model was made with a random partition and a random seed. The random seed seems to have given a fair distribution of my target variable in the training set. When I run the same model 10 times with a completely random partition, the r^2 drops to 0.6. Also, is there a specific way I should be measuring the effectiveness of my model? I was doing multiple new models with random partitions so that my models are exposed to new training, cross validation, and test sets to see how the models deals with this.
to have something similar to stratified sampling for a continuous target variable, I would first sort the table according to the target variable (Sorter node) and use afterwards the Partitioning node with the setting option “linear sampling”. In this way you will have a few points from each part of the distribution in the test and the training set.