Hi I would like to understand when it is necessary to do feature encoding. For example, I am using models like XGBoost and Random Forest. Do I need to encode categorical variables? For instance, if I have a variable like ‘education’ with levels: high-school, bachelor, master, phd, should I recode the feature as: high-school=1, bachelor=2, master=3, phd=4? Or is it better to use one-hot encoding, creating 4 new columns? What if the categorical variable doesn’t have a natural order, like ‘country of birth’? In this case, is it correct to only apply one-hot encoding?
@pippo This will very much depend on your task. XGBoost and random Forest can handle categorical data themself. Although advanced data preparation can help to improve performance.
More on Data Preparation and Machine Learning you can find here:
There are more advanced preparation techniques like vtreat and other Python packages. Some examples are here (I have some code still on my machine that might get published in the future):
1 Like