Is it advisable to always normalize values in a dataset?

I am working through an exercise and noticed no normalize node. While some methods like k-means make good reason for usage of normalization in that you’d want the clusters to be within the area, others like decision trees don’t have a rationalization for there even being one. Still, is it good practice to generally normalize and then denormalize?

EXERCISEDECIS1.knwf (14.8 KB)

EXERCISEnormalizeDenormalize.knwf (14.8 KB)

Here’s Normalizer node


Anyway, normalize or not depends on what you are try to do or method in use requirements.

1 Like

If you normalize your data you make it easier for models to compare different types of data. Eg. you have income and height. Without normalisation it might be more difficult to compare them (if that makes sense is another matter) but if you scale them from 0-100 you get an index and can compare them.

There are several tweaks and things to consider. If you have strong outliers (eg Bill Gates in your income sample) your Index might not be good. Bill would get the 100 while we mortals would get 0.125 or something. In such a case it might be better to use something like logarithmic scaling to keep the ‘structure’ of your values and the informations about the difference between incomes 50k and 250k intact. Or you have to throw Bill out.

But keep in mind: logarithm allows no zero and no negative values so it might be necessary to bring your data to a positive scale. Convert zeros to 0.2 and missing to 0.1 or something. But in the end it depends on how your data is structured.

Maybe you test the different scalings that KNIME has to offer like @izaychik63 has suggested.

3 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.