Partition node in linear regression

Hi, I’m running a linear regression model with an 80/20 partition split for my training/test set. I am noticing that whenever I reset the partition node, the r^2 value of my model is subject to drastic changes. For example, sometimes I will get an r^2 of 0.8 and at times it will be as low as 0.3. I was just wondering if there’s a certain way to avoid this. Is there a certain way to partition the data to eliminate this randomness? Alternatively, if there is no way to eliminate the randomness, is there a node that allows for an average over 5 iterations of partitioning and running the model. My dataset is not too small as it has 154 items.

Hi @Haseeb23, welcome to the forum.

What kind of sampling are you using? This is crucial information.

If you’re using random or stratified sampling, then you should use a random seed if you want the results to be reproducible.

The description of the node says this:

Use random seed
If either random or stratified sampling is selected, you may enter a fixed seed herein order to get reproducible results upon re-execution. If you do not specify a seed,a new random seed is taken for each execution.

2 Likes

Hi @elsamuel, I am using a random sampling method right now. I tried the random seed option selected and the partition stayed the same upon every iteration. Thank you very much for your help!!

If your results change very much depending on the selection you are training it you might not be able to get a stable model. You might have to think about some cross-validation to make sure your results are somewhat stable. And 154 items is a very small dataset BTW.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.