Parameter Optimization, two X-Partitioners and potential Data Leakage

Question for the pros: How great is the risk of data leakage in the setting shown? Or is it sufficient to use different fixed seeds for both X partitioners?

Thanks in advance
Christian

Hi @Christian_Essen,

I think that using two cross-validation loops on the entire dataset risks data leakage. The second round of cross-validation is not a true test, since every row was already used in the first round to tune parameters.

I would suggest the following:

  • First, split your data once, for example, into 70% for training and 30% for testing.
  • Then, run cross-validation only on the 70% training set to find the best parameters.
  • Train the final model on that 70% and test on the untouched 30%

Alternatively, you can use nested CV.

Hope this helps.

Best,
Keerthan

3 Likes

@k10shetty1 : Thank you very much for your support. I’m now doing it with nested cross-validation.

Inside the Component:

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.