Decision tree

Hi there! I have a data with 20% of the “age” column missing. Is there a way to predict the missing ages using decision tree? or would you recommend any method for us to predict/fill in the data

Hi,
You have to try it out. Take all your data where the age is present, split it up into training and test set (e.g. 70/30) and then train a regression tree on the training set. Then apply it to the test set and see how well it can predict the actual age. If the results are good enough, train the tree again on all data where the age is known and predict the age in the records where it is missing. If the predictions on the test set are too far off, you can consider simple methods like filling in the mean age. How you treat it also depends on what you want to do with it later. If you want to train another model for a target variable, it might also make sense to compute a binary feature “ageMissing” that is true if the age is missing and false otherwise. Maybe older or younger people are less likely to disclose their age and therefore there is then some correlation between this new ageMissing feature and your target variable.
Kind regards,
Alexander

3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.