# what kind of data can be used as a xgboost model?

hello, i have a question. i try make a predictive model with xgboost, my data 11000 row and 57 column. my data contains a categorical data and every categorical data has an id, can that id use for predictive model xgboost that need numerical value?
thanks.

Hi jerem1,

simply said no, you should not use numerical IDs of category to make predictions. Ideally you go google and figure out yourself why in a more extensive article. Here Iâ€™m simply stating that often machine learning algorithms assume that 1 and 2 are closer related than 1 and 9.eg that the numerical order has a meaning which it rarely has with categorical data. Whether you are black, asian or white doesnâ€™t have any order so assign numbers to them doesnâ€™t make any sense and could even confuse the ML algorithm.

To solve this you can do one-hot-encoding. Google it if you donâ€™t know what it means. In KNIME this is done with the one-to-many node. If you have a column with 4 categories, this will generated 4 new boolean columns (meaning 0 or 1) marking whether the record is part of that category or not.

This is the theory.

BUTâ€¦

You specifically mentioned xgboost and xgboost is a tree-based algorithm and how susceptible it is to above explained potential issue depends on the exact implementation and your categorical values.

See for example this blog post. The point being that there is no clear answer and itâ€™s complex. And it depends on your data.

If you have just a few columns all with just a handful of possible values (probably good to clean these first!) then go for one hot encoding. However if you have many categorical columns each having many possible values, then one-hot-encoding could itself also lead to problems because each column will have very few instances of 1 and will mostly be 0 which will lead the tree to â€śthinkâ€ť the column is unimportant as it barley contains any information.

So again ~5 columns with ~5 possible values? Do one hot encoding. Else? reply here with better description of your data.

3 Likes

thank you for quick response @beginner
my data has 8 categorical column, and the categorical column is like sales name, company name that have a lot value. and then my target is with this data to predict the the transaction is sucess or not. i have do the one to many node to my categorical column as represent one hot encode, but with my simple workflow, the predictive model show 100% accuracy.
any suggest idea again ? thank you so much.

share your workflow and if you canâ€™t share the data, obfuscate it and use that for sharing the workflow.

100% accuracy means you are doing a mistake.

2 Likes