Random sampling in cross-validation

Hi,


The X-Partitioner node supports creating folds by randomly sampling the rows, however the dataset will change for each iteration. Would it be possible to specify a random number seed to be able to achieve the same result for each run? I would highly appreciate if such a feature could be added, because it is very important to obtain consistent results among different runs of the same workflow.

Best regards,
Sebastian Pölsterl

Hi Sebastian,

you are, of course right about the importance of reproducibility - but you can achieve that pretty easily with KNIME: simple choose "linear sampling" in the X-Partitioner Node and it will use consecutive partitions as test data sets. If you want to make sure the data is shuffled randomly before you do that, you can use the "Shuffle" node and this one does allow you to specify the random seed.

But I'll put this request onto our enhancements list, it would be convenient to be able to do this within the X-Validation nodes directly as well.

Cheers, Michael

Thanks a lot Michael.

I tried your suggestion by using linear sampling and shuffle which works fine. However, when I select stratified sampling I end up with different results on each run. I this a known issue?

Best regards,
Sebastian

Hi Sebastian,

stratified sampling also uses a random number generator so yes, this is the same issue. If you are really running on such small data sets that this is an issue you'll need to work around this by building your own cross validation loop but otherwise you can simply stratify first (using the Equal Size Sampling node with a given seed) and then running cross validation on consecutive partitions. We'll add the option of the fixed seed to the X-val node set.

Michael