date of birth - missing value

Hi there,

 

I´m dealing with some missing values for the variable "date of birth". Do you guys have any suggestions on how to treate those missing values? Choose the median or delete the missing values (but there quiet al lot!)

 

Best regards

Janette

Well it all depends on your data and whether you think those with missing values could be predicted. 

Take a look at the Missing values node, there are lots of options to choose from.

simon.

Are you performing supervised analysis? What’s the target variable? What is the role of date of birth? Is it relevant to your analysis? Can it be replaced by a good proxy for which there are no missing values? Can you see any pattern in the missing values? (e.g a subpopulation for which there are more missing values for the said variable than for others)

The answers to these questions will guide you through the process.

Imputation is an option if :

  • there are few missing values;
  • or you know or have tried exploring why they are missing;
  • or the variable is not important to the analysis but still relevant;
    etc

However, imputation by the mean or the median as well as kicking the observations with missing values are not best first choice strategies for dealing with missing values. First, explore why there are missing values. Random imputation methods (e.g random hotdeck) work well when you know the reason for missing values and when you can formulate that reason in form of strata. Finally, there may be another data set for the same population with less missing data on the same or similar variables.

Many thanks for your answers!!! I 

On the basis of historical purchase data I want to generate a model that predicts the probability that a certain purchase is converted into a return. 

The data set contains about 480.000 rows - about 50.000 rows contain missing dates of birth. Unfortunately I do not know much about the why there are missing values. 

First, try to understand how these data are collected. This may already provide the main reason for non-response.

For further understanding, create a boolean variable which equals true if date of birth is missing and false when it is not missing. Then use a (ensemble) decision tree or JRip learner on the data set to explain that boolean variable. Don’t include the actual target variable!! i.e. the one that provides return probability. Check which variables are used most in the model(s) and figure whether that provides you with any explanation. Please note: There may not be any such explanation at all for the non-response.

Depending on your model type, you could even leave date of birth missing. That state itself could then be used to explain the target.