When to do feature encoding with XGBoost, random forest

pippo · January 8, 2025, 11:41am

Hi I would like to understand when it is necessary to do feature encoding. For example, I am using models like XGBoost and Random Forest. Do I need to encode categorical variables? For instance, if I have a variable like ‘education’ with levels: high-school, bachelor, master, phd, should I recode the feature as: high-school=1, bachelor=2, master=3, phd=4? Or is it better to use one-hot encoding, creating 4 new columns? What if the categorical variable doesn’t have a natural order, like ‘country of birth’? In this case, is it correct to only apply one-hot encoding?

mlauber71 · January 8, 2025, 12:50pm

@pippo This will very much depend on your task. XGBoost and random Forest can handle categorical data themself. Although advanced data preparation can help to improve performance.

More on Data Preparation and Machine Learning you can find here:

There are more advanced preparation techniques like vtreat and other Python packages. Some examples are here (I have some code still on my machine that might get published in the future):

system · April 8, 2025, 12:51pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.