Y-randomisation

Subha_D · August 18, 2021, 11:28am

Hello,

I need to make a y-randomised model for my QSAR models. I was wondering if I use both the training and test data to do that ? or use only the test data? I am a little confused as I’ve seen people using just the test data for a y-randomised model using the target shuffling node.

Thank you

aworker · August 18, 2021, 11:40am

Hi @Subha_D

If your aim is to know whether your QSAR model (without y-randomization) is performing better than random, it is only the test set y values that you need to randomize (shuffle).

Hope this helps.

Best

Ael

Subha_D · August 18, 2021, 12:47pm

Hello @aworker ,

Thank you it helps! Will I have to partition the test data again…?
This is how I’ve done it:

Is this correct?

Thank you,
Subha

aworker · August 18, 2021, 1:06pm

Hi @Subha_D

You are welcome. I’m afraid to say it doesn’t seem correct to me. I wonder whether I was clear enough because you are shuffling before partition and hence the training set too. Would it be possible for you to upload here and share your workflow ? Thanks in advance.

Best

Ael

aworker · August 18, 2021, 1:39pm

Hi @Subha_D

Below you’ll find a snapshot of what I meant:

If your Numeric Scorer results on non Y-Shuffled Test data are close to those obtained by the Numeric Scorer results on your Y-Shuffled Test data, then your learning regression model is not doing better than random.

Hope this helps.

Best

Ael

Subha_D · August 18, 2021, 1:46pm

Hello @aworker ,

I was just trying to upload my workflow but it is quite big. Yes this helps a lot thank you!
Will I need to run many iterations for this or is running this once enough?

Thank you,
Subha

aworker · August 18, 2021, 2:27pm

Hi @Subha_D

Don’t worry. The important thing is that you got the idea.

If your data is big enough, you are not obliged in a first instance to do a K-fold cross validation to know whether your model is performing well, just a simple CV test, as shown in my straightforward example, would be good enough. However, it is always more rigorous and informative to do a K-fold CV because you can then calculate Mean & STD on prediction results.

20210818 Pikairos Y-randomization.knwf (355.7 KB)

I’m posting here both solutions (with and without K-Fold CV)

Hope this helps.

Thanks for having validated the answer !

Good luck and best wishes,

Ael

Subha_D · August 18, 2021, 2:55pm

Hello @aworker ,

Thank you ver much for you help!

SD

system · August 25, 2021, 2:56pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.