dataset piggly wiggly

@camvinga_19 First: it would go a long way if you could provide the data set so we might inspect ourself. But of course, this might not be possible.

A few remarks:

  • it is not really clear what your data does contain. Are these sales from one period or do you have an additional time variable
  • one-hot encoding might not address your problem with the categorical data sufficiently
  • if you have the product as a categorical variable and you have nearly as many products as you have lines/prices this might already be the most important variable since each product might have a price? Then the question would be what else is there - you might want to have a look at correlations and see which variable does provide the best predictive power towards the price
  • also you have the individual store. What role does that play? Are they supposed to set their own prices? The combination of store_id and product might already be enough to ‘predict’ a price (but that might not be correct since you might not have other factors). And you might not have enough data to handle new combinations of store and price
  • you could try techniques like label encoding (cf. below) for these categories. Some algorithms do offer that already like some H2O methods
  • you might also try H2O’s AutoML stack [will also give you variable importances] but it will not address the conceptional questions regarding time and number of products
  • you could try to reduce the dimensions (1|2) by using principal components or some advanced data preparations with vtreat or featuretools (1|2|3). But with so few data this might not help and you might have difficulties explaining your results and it might be overengineering (and again not solve the problem you might not have enough data to answer your question).
  • then I did not understand if your second file does contain all the data of the training dataset minus the price
  • and another note about your statistics. R square is widely used but has its limitations. Kaggle competitions often use RMSE as a measure to judge the quality of models. I would always advise to also inspect some graphs to see what your prediction does.

Maybe you tell us something more about your thoughts and the data and the progress of your task.

----------- label encoding ---------------
You might want to be careful with this knife. This will to a certain extend intentionally ‘leak’ information about the Target (price) to the training data. That might be OK if you expect the relationship between product and price to remain stable. So it might be legitimate to represent a product with the average price.

The problem will be if you encounter lots of new and unknown products - you might represent them with an average or some missing replacement techniques. But again that might not work. You would have to see.

2 Likes