Category to Number vs one-hot encoding

Subha_D · May 19, 2021, 2:32pm

Hello,

I am trying to do a linear regression for a sales price dataset with a lot of categorical variables (about half the dataset is categorical and the rest numeric). I was just wondering if it is okay to use the category to number node to transform all the nominal attributes to numeric to perform the linear regression? Is this better or worse than doing one-hot encoding and making dummy variables?

Thank you

iperez · May 19, 2021, 2:58pm

Hi @Subha_D it is better to create dummy variables the values you assign to each category might affect your regression results.

kienerj · May 20, 2021, 6:43am

The reason why you should not use category to number is because many if not all algorithms then assume that 1 is further away from 10 than from 5. But if your categories don’t have an implicit distance between each other, this will then be misleading. Each discrete (integer) columns represneting a category should only be used if the distances are correct. so catgeory 1 and 2 are exactly as far apart as 11 and 12 and 1 is exactly as far apart from 11 as 2 is from 12.

If you have categories like non-smoker, casual smoker, regular smoker and chain smoker this is not true as there is no inherent distance between the categories.

Subha_D · May 20, 2021, 8:36am

Thank you for your explanation @kienerj! I understand using integer encoding for nominal variables is not technically correct if they have no inherent order. However when I used this on my dataset it gave me quite a good model despite the many nominal variables. I can’t use one-hot encoding as it greatly increases the cardinality of the dataset. In this case is it okay to stick with integer encoding or do I have to use count encoding/something else?

Daniel_Weikert · May 20, 2021, 5:30pm

Is your one hot encoder the “one to many node” ? If so I think it leads to multi-collinearity with the current KNIME implementation.

mlauber71 · May 21, 2021, 11:01am

One way to treat such variables is to use a tool like vtreat to automatically re-code your data and either convert them into single encoded new variables or bring some categories together.

Another approach could be to use something like dictionary vectorizer.

Subha_D · May 21, 2021, 11:59am

Yes it is. one-hot encoding seems to work for other models but not linear regression

Subha_D · May 21, 2021, 12:00pm

Thank you! I’ll see what I can do

system · November 20, 2021, 12:00am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.