Normalize data before or after train-test split

Nafeeza86 · November 19, 2019, 5:48pm

Hi,

I have a doubt on how to normalize the data before feeding it into the neural network. My question is: Do I normalize the data before or after splitting it into the training and test set. In many discussions, I found that it should be done after partitioning the data.

Thanks,
Nafeeza

krosis · November 19, 2019, 11:39pm

I usually to normalize my datasets before splitting the data for k-fold partitions. Once yo data are normalize you can feed the cross-validation node and change the algorithm inside this node with a neural network algorithm. Of this way, you achieve a k-fold cross-validation of your algorithm with your dataset normalized.

beginner · November 20, 2019, 6:49am

If by test set you mean your holdout set that never is used until the end for final check / validation, then you absolutely have to normalize after the split or else you leak information from test set into the training set. And that information is the range and distribution of your features.

nemad · November 21, 2019, 8:24am

Hi @Nafeeza86,

@beginner is absolutely right!
Although there are many examples where normalization is done prior to the split, this contradicts the purpose of the split itself, which is to assess how well your model generalizes to unseen data.
Note as well that your model isn’t just the neural network you train but also all of the pre- and postprocessing steps you apply to get from the raw inputs to the desired outputs.

Cheers,

Adrian

system · May 21, 2020, 8:24pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.