what kind of data can be used as a xgboost model?

hello, i have a question. i try make a predictive model with xgboost, my data 11000 row and 57 column. my data contains a categorical data and every categorical data has an id, can that id use for predictive model xgboost that need numerical value?
thanks.

Hi jerem1,

simply said no, you should not use numerical IDs of category to make predictions. Ideally you go google and figure out yourself why in a more extensive article. Here I’m simply stating that often machine learning algorithms assume that 1 and 2 are closer related than 1 and 9.eg that the numerical order has a meaning which it rarely has with categorical data. Whether you are black, asian or white doesn’t have any order so assign numbers to them doesn’t make any sense and could even confuse the ML algorithm.

To solve this you can do one-hot-encoding. Google it if you don’t know what it means. In KNIME this is done with the one-to-many node. If you have a column with 4 categories, this will generated 4 new boolean columns (meaning 0 or 1) marking whether the record is part of that category or not.

This is the theory.

BUT…

You specifically mentioned xgboost and xgboost is a tree-based algorithm and how susceptible it is to above explained potential issue depends on the exact implementation and your categorical values.

See for example this blog post. The point being that there is no clear answer and it’s complex. And it depends on your data.

If you have just a few columns all with just a handful of possible values (probably good to clean these first!) then go for one hot encoding. However if you have many categorical columns each having many possible values, then one-hot-encoding could itself also lead to problems because each column will have very few instances of 1 and will mostly be 0 which will lead the tree to “think” the column is unimportant as it barley contains any information.

So again ~5 columns with ~5 possible values? Do one hot encoding. Else? reply here with better description of your data.

3 Likes

thank you for quick response @beginner
my data has 8 categorical column, and the categorical column is like sales name, company name that have a lot value. and then my target is with this data to predict the the transaction is sucess or not. i have do the one to many node to my categorical column as represent one hot encode, but with my simple workflow, the predictive model show 100% accuracy.
any suggest idea again ? thank you so much.

share your workflow and if you can’t share the data, obfuscate it and use that for sharing the workflow.

100% accuracy means you are doing a mistake.

2 Likes

sorry for late reply,
this my simple workflow. of course 100% accuracy means that any mistake there. but i have try using a different way to make this issue, but until now can’t solve this problem.

WIth share I meant an actual export of your workflow. Can’t really tell much from a screenshot.

One idea could be to use a H2O GBM model and take a look at the variable importance. That might give you and iead what is going on. If one varibale takes all the explaining power that could give you a hint.

Also you could try AutoML and see where this leads you

If you use MacOSX or Linux you could exclude all model types besindes XGBoost and see waht that does, maybe to get some ideas about good parameters. But always be careful with the split of test and training and validation data.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.