How to keep track of transformed data using the Category to Number node?

Hey all,

Thanks for your help on the previous question. I am currently working on a ML project that estimates the price of a car given certain data about other cars, their features and prices. As I am data prepping, I have been trying to convert a few columns that were strings into integers that I can plug into a linear regression formula. For example: I have one column that is named, “Color” and lists the colors: Black, Red, Yellow, Gray, White, Orange, and Blue. When I use the Category to Number node, it seems to translate the category into a number(int) but how do I keep track of the mapping of the colors and integers they have been assigned (ex: how I do I keep track that 0 = Black, 1 = Red, 2 = Yellow, etc.)? Thanks for your help in advance!

Hello @fischers97
I can understand from description that you are trying to factorize/‘dummyrize’ qualitative data for ML purposes… but I don’t know if you are trying to simplify your question to much.

There are plenty of literature about; in the case of your example (car colors), this qualitative variable doesn’t have ordinal value. This is relevant as most of ML algorithms work based in euclidean distance. Then if there isn’t previous incremental correlation of price vs color property, and you want the algorithm to find statistical significance about it, then you will have to approach by Dummy Variable columns. This is, for each color property create a column of TRUE (1) / FALSE (0) : $color_black$, $color_red$, $color_yellow$

… being aware of dummy variable trap: one of the variables must be removed to avoid double imputation. In your example black should be equal ( red == 0 && yellow == 0 ) because of multicollinearity, then you won’t need a black column.

In your example within brackets, you are ‘factorizing’ the color property; assigning an euclidean distance assumption. This is: red ==1 assumes higher value than black == 0 (?) …

For factorizing categorical value, you will need to keep the dictionary replacements and replace back in the final results for visualization purposes.

Hope this helps.


beside the useful tipps from @gonhaddock
your colors don’t have a hierarchy (already mentioned) I would rather use one hot encoding (one to many node in KNIME eg).
There might be some experienced data scientists here who might want to dive deeper

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.