Data Manipulation of categorical data in the background in ML process

moritz_skb · August 10, 2023, 8:29am

Hi everybody,

I have a question regarding KNIME’s data manipulation in the machine learning process.

In KNIME it is possible to put categorical data as a column (i.e. as a string) in a ML model, without a manual preprocessing step.
I am interested in what is KNIME doing with these data in the background? Are they doing a numeric transformation or an one-hot encoding?

The following models are relevant:
XGBoost Tree Ensemble Learner
RandomForest Learner
Gradient Boosted Trees Learner
H2O Gradient Boosting Machine Learner
H2O Random Forest Learner

I would really appreciate an answer to this topic.

sanket_2012 · August 15, 2023, 1:27pm

Hi @moritz_skb ,
Welcome to the KNIME Community!
One-hot encoding is one of the steps that could be performed. In KNIME you can try to use the One to Many nodes to achieve this.

Another way could be to use the Domain calculator node.

There are related workflows by our community members as well as some KNIMERs that you can check to see how they used them.

Thanks,
Sanket

Daniel_Weikert · August 15, 2023, 4:53pm

Hi @sanket_2012
I think the question is how this is implemented within the algorithms in KNIME.
eg Trees do not need one hot encoding by default however in sklearn I think it is still required. So if someone drags a learner node into KNIME what happens “under the hood” with the data? E.g. Does it do ohe in the background or how is data handled? Detailed Documentation could be helpful if this is somewhere available
br

mlauber71 · August 15, 2023, 8:49pm

@moritz_skb every KNIME nodes has some sort of documentation like the XGBoost Tree Ensemble Learner:

There you will find additional links and literature to explain what has been done. In this case to the official documentation and there you will find what the algorithm does:

https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html

The KNIME implementation might ‘compress’ such settings or provide some of them by default or with switches in the node.

If you want to do the data preparations yourself you could automate that using tools like vtreat or KNIME nodes (like Category to Number):

Also there is this approach:

For H2O Gradient Boosting Machine Learner and H2O Random Forest Learner this list also can serve as an overview of what methods are widely used:

https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html

You will in this case find the options in the KNIME nodes:

sanket_2012 · August 16, 2023, 7:35am

@moritz_skb Apologies for the wrong interpretation.
Thank you @Daniel_Weikert for pointing it out and @mlauber71 for jumping in with the detailed explanation.

@moritz_skb Let us know if you have any other queries.

Thanks,
Sanket

system · November 14, 2023, 7:36am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.