Partition node in linear regression

Haseeb23 · July 7, 2020, 3:02am

Hi, I’m running a linear regression model with an 80/20 partition split for my training/test set. I am noticing that whenever I reset the partition node, the r^2 value of my model is subject to drastic changes. For example, sometimes I will get an r^2 of 0.8 and at times it will be as low as 0.3. I was just wondering if there’s a certain way to avoid this. Is there a certain way to partition the data to eliminate this randomness? Alternatively, if there is no way to eliminate the randomness, is there a node that allows for an average over 5 iterations of partitioning and running the model. My dataset is not too small as it has 154 items.

elsamuel · July 7, 2020, 7:15am

Hi @Haseeb23, welcome to the forum.

What kind of sampling are you using? This is crucial information.

If you’re using random or stratified sampling, then you should use a random seed if you want the results to be reproducible.

The description of the node says this:

Use random seed
If either random or stratified sampling is selected, you may enter a fixed seed herein order to get reproducible results upon re-execution. If you do not specify a seed,a new random seed is taken for each execution.

Haseeb23 · July 7, 2020, 2:46pm

Hi @elsamuel, I am using a random sampling method right now. I tried the random seed option selected and the partition stayed the same upon every iteration. Thank you very much for your help!!

mlauber71 · July 7, 2020, 4:51pm

If your results change very much depending on the selection you are training it you might not be able to get a stable model. You might have to think about some cross-validation to make sure your results are somewhat stable. And 154 items is a very small dataset BTW.

system · January 6, 2021, 4:54am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.