I am trying to do a linear regression for a sales price dataset with a lot of categorical variables (about half the dataset is categorical and the rest numeric). I was just wondering if it is okay to use the category to number node to transform all the nominal attributes to numeric to perform the linear regression? Is this better or worse than doing one-hot encoding and making dummy variables?
The reason why you should not use category to number is because many if not all algorithms then assume that 1 is further away from 10 than from 5. But if your categories don’t have an implicit distance between each other, this will then be misleading. Each discrete (integer) columns represneting a category should only be used if the distances are correct. so catgeory 1 and 2 are exactly as far apart as 11 and 12 and 1 is exactly as far apart from 11 as 2 is from 12.
If you have categories like non-smoker, casual smoker, regular smoker and chain smoker this is not true as there is no inherent distance between the categories.
Thank you for your explanation @kienerj! I understand using integer encoding for nominal variables is not technically correct if they have no inherent order. However when I used this on my dataset it gave me quite a good model despite the many nominal variables. I can’t use one-hot encoding as it greatly increases the cardinality of the dataset. In this case is it okay to stick with integer encoding or do I have to use count encoding/something else?
One way to treat such variables is to use a tool like vtreat to automatically re-code your data and either convert them into single encoded new variables or bring some categories together.
Another approach could be to use something like dictionary vectorizer.