Important Question Regarding Missing Data Imputation for Purchase Probability Model

Hello everyone,

I have an important question regarding a machine learning task.

I’m testing sales data to predict the probability of purchase for a specific product. For the model to learn effectively, the dataset must contain negative samples (non-purchases), which is represented in the Target column (0/1).

After performing a natural cross join, I ended up with missing values in my dataset. Most of the values are manageable, but I’ve run into a specific challenge with some key missing data points:

  • Invoice Number Column: I decided to drop this column.

  • Branch Column: I chose to impute this with any branch since the products are similar across all locations, making the specific branch irrelevant for the prediction.

  • Payment Method (Cash/Card): This will be NULL/NaN.

My biggest confusion is with the Date column. Should I:

  1. Set a fixed, placeholder date like 1900-01-01 (as I read somewhere)?

  2. Use NaN (Not a Number)?

I tried consulting AI tools, but the results were disastrous—conflicting advice, to say the least.

What is the best solution that won’t negatively impact the machine learning model’s performance?

Note: I plan to use AutoML, as I believe it will be the most suitable choice.

Any suggestions or experiences would be highly appreciated.

Thank you all!

@mohammad_alqoqa several things. First: I think IDs are nearly never a good variable for predicting anything since the model would learn that an individual item would behave in such a way. But this specific item might not come along in new data or the training data. So you are right to drop an ID.

Concerning dates I do not like to keep fixed dates in a model since in the the dates will shift (naturally) and new items have new dates. I would use a relative variable like the age of an item at the time of the model computation or the months / days since … then with new data the feature would still carry the same meaning.

One thing you could do about missings is try to impute them. Either with simple values like marking a missing as -99 or NA (in case of strings). More advanced techniques would involve imputing them with models.

As it happens I have a relatively fresh collection of such imputations out there for numeric and string data - could give this a try.

Another thing to explore is the use of a model like vtreat that would automatically try to prepare your data in the bast way possible. You can read about this in my article:

https://medium.com/p/efcaf58fa783

And since we are at ist. I can offer a collection of algorithms including AutoML to try and build classification models:

KNIME also has a Component that uses a lot of models in an AutoML setting:

1 Like