Hello everyone,
I have an important question regarding a machine learning task.
I’m testing sales data to predict the probability of purchase for a specific product. For the model to learn effectively, the dataset must contain negative samples (non-purchases), which is represented in the Target column (0/1).
After performing a natural cross join, I ended up with missing values in my dataset. Most of the values are manageable, but I’ve run into a specific challenge with some key missing data points:
-
Invoice Number Column: I decided to drop this column.
-
Branch Column: I chose to impute this with any branch since the products are similar across all locations, making the specific branch irrelevant for the prediction.
-
Payment Method (Cash/Card): This will be NULL/NaN.
My biggest confusion is with the Date column. Should I:
-
Set a fixed, placeholder date like 1900-01-01 (as I read somewhere)?
-
Use NaN (Not a Number)?
I tried consulting AI tools, but the results were disastrous—conflicting advice, to say the least.
What is the best solution that won’t negatively impact the machine learning model’s performance?
Note: I plan to use AutoML, as I believe it will be the most suitable choice.
Any suggestions or experiences would be highly appreciated.
Thank you all!
@mohammad_alqoqa several things. First: I think IDs are nearly never a good variable for predicting anything since the model would learn that an individual item would behave in such a way. But this specific item might not come along in new data or the training data. So you are right to drop an ID.
Concerning dates I do not like to keep fixed dates in a model since in the the dates will shift (naturally) and new items have new dates. I would use a relative variable like the age of an item at the time of the model computation or the months / days since … then with new data the feature would still carry the same meaning.
One thing you could do about missings is try to impute them. Either with simple values like marking a missing as -99 or NA (in case of strings). More advanced techniques would involve imputing them with models.
As it happens I have a relatively fresh collection of such imputations out there for numeric and string data - could give this a try.
Another thing to explore is the use of a model like vtreat that would automatically try to prepare your data in the bast way possible. You can read about this in my article:
https://medium.com/p/efcaf58fa783
And since we are at ist. I can offer a collection of algorithms including AutoML to try and build classification models:
KNIME also has a Component that uses a lot of models in an AutoML setting:
1 Like