Parameter Optimization, two X-Partitioners and potential Data Leakage

Christian_Essen · November 19, 2025, 12:11pm

Question for the pros: How great is the risk of data leakage in the setting shown? Or is it sufficient to use different fixed seeds for both X partitioners?

Thanks in advance
Christian

k10shetty1 · November 24, 2025, 8:41pm

Hi @Christian_Essen,

I think that using two cross-validation loops on the entire dataset risks data leakage. The second round of cross-validation is not a true test, since every row was already used in the first round to tune parameters.

I would suggest the following:

First, split your data once, for example, into 70% for training and 30% for testing.
Then, run cross-validation only on the 70% training set to find the best parameters.
Train the final model on that 70% and test on the untouched 30%

Alternatively, you can use nested CV.

Hope this helps.

Best,
Keerthan

Christian_Essen · December 3, 2025, 10:09am

@k10shetty1 : Thank you very much for your support. I’m now doing it with nested cross-validation.

Inside the Component:

system · December 10, 2025, 10:10am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.