Categorial values + Mining models (RF, GBT, etc)

Hello,

I would like your comments related to how to use categorical values (string values in a column) in a mining model, like a Random Forest, Gradient Boosted Trees, etc.

My question is related to one hot encoding:

  • Do I have to use the "One to Many" node to use correctly the categorical columns?
  • Do I have to add one column with a 0 or 1 for each unique value in the categorical column? Or I have to use N-1 columns?
  • Is the Random Forest implementation (and others) automatically converting the categorical columns to 0 and 1 colums?
  • Does it work like in R with the Factor vectors? Source: https://stats.stackexchange.com/questions/49243/rs-randomforest-can-not-handle-more-than-32-levels-what-is-workaround

Also, related to normalization:

  • Do I have to do column normalization to correctly use the methods?

I have seen and read several threads in the forum related this topic, but haven't found one that tells the necessity. And I don't find comments related in the method's help.

Thank you for your insights.

Regards,

Juan Pedro

Hello Juan,

in case of tree based models like RF and GBT, your life is easy, they handle categorical values out-of-the-box.

By default, binary splits are calculated i.e. the possible values are split into two distinct subsets and rows that have a value in the corresponding subset fall into respective child node (in the tree).

However, you can also use multiway splits by disabling the "Use binary nominal splits" option.

It's of course also possible to perform a one-hot-encoding but I doubt that it will give you much.

Regarding your question about normalization:

Tree based models like RF and GBT are invariant towards scale and therefore normalization is not needed.

If you have more questions, please feel free to ask.

Cheers,

nemad

nemad,

Thank you very much for your reply.

I had my doubts about it... that's why I wanted to check if my feeling rights.

Do you happen to know how it works for the Spark ML nodes?

Hi @juanbretti

The spark nodes cannot accept String columns, you will see that when you try to configure them. (Spark ML does not work with categorical values like Strings) That’s where the Category to Number and Number to Category nodes come to play.

If you want to see how these nodes should be used and how they need to be applied before/after a Spark ML node, take a look at this Spark MLlib Decision Tree example workflow.

@oole, thank you for your reply.

My question is: do I have to do something like “OneHotEnconding” before using any of the Spark ML models?
Do I have to use the “Spark Category to Number” selecting “BINARY”?

Or the implementation of Spark ML takes care for itself?
I mean, will it interpret the integers as integers and not as continuous elements? Again, 1 is not more related to 2 than 100 .

Thanks!