How to use normalization in a data science solution

This is how to apply normalization correctly in a data science problem. The normalization model is built on the training set and only applied on the test set! The same normalization model is used to denormalize the numerical attributes back into their original ranges.

This is a companion discussion topic for the original entry at

Have never seen this before, but tried it on my own data. When I apply the training normalization to the test set I get:

WARN Normalizer (Apply) 3:43 Normalized value is out of bounds. Original value: 4.0 Transformed value: 1.3333333333333333 Upper Bound: 1.0

This leads me to believe that if you want to do normalization it should be done on the entire data set (i.e. train and test set together) before splitting.