Hi Everyone

Im student and I’m doing a work with a new dataset for a final project. I have a large dataset like 8000 rows. My goal is to predict sales for the 1500 products that I have in my dataset for echo store. Im doing linear regression in different forms, random forest regression, parameter optimisation…but in all i have to throw the product ID because they don’t support a large product ID of 1500 products. Can anyone help

Hi,

Is the product ID in any way helpful for the regression? What are your other features? You should probably rather build a model for each product separately than include the ID as a feature.

Kind regards

Alexander

but we have 1500 products i don’t think that i will build a model for each product. Im trying to make predictions with all type of model and i don’t raise from a rˆ2 of 0,58 and its not very good for a prediction.

Hi @camvinga_19,

I agree with @AlexanderFillbrunn, does the product_ID help you in making a model? I doubt it. The same could be said for de Store_ID. Maybe it’s even worthwhile to start with only the Product-features (to see if sales can be explained only on the product-features and thus independent on the store). More is not always better.

Another thing to remind yourself when using regression: have you normalized all numbers upfront? Large numbers in a column tend to influence a model significantly.

Well, did you do any data preparation? Have you checked which features are actually helpful? Do you have sales for each product at different points in time or just at one particular time? Just throwing some regression algorithm at it might not be enough. Are there any other people who have worked with this data set and published results?

Kind regards

Alexander

I have changed all the features that i could to 0-1-2… then i did normaliser and now im trying with some regression learners… and my results are nor more than 0,58 rˆ2

Changing nominal values into 0-1-2-… might not give expected results. You cannot capture e.g. 7 different categories in one numerical column. This would need 6 dummy columns. One of them would have a 1 the rest 0, or all of them would have a 0.

The next text is taken from the description of the Linear Regression Learner Node.

Values

To specify the independent columns the should be included in the regression model. Numeric and nominal data can be included, whereby for **nominal data dummy variables are automatically created** as described in section Categorical variables in regression.

@camvinga_19 First: it would go a long way if you could provide the data set so we might inspect ourself. But of course, this might not be possible.

A few remarks:

- it is not really clear what your data does contain. Are these sales from one period or do you have an additional time variable
- one-hot encoding might not address your problem with the categorical data sufficiently
- if you have the product as a categorical variable and you have nearly as many products as you have lines/prices this might already be the most important variable since each product might have a price? Then the question would be what else is there - you might want to have a look at correlations and see which variable does provide the best predictive power towards the price
- also you have the individual store. What role does that play? Are they supposed to set their own prices? The combination of store_id and product might already be enough to ‘predict’ a price (but that might not be correct since you might not have other factors). And you might not have enough data to handle new combinations of store and price
- you could try techniques like label encoding (cf. below) for these categories. Some algorithms do offer that already like some H2O methods
- you might also try H2O’s AutoML stack [will also give you variable importances] but it will not address the conceptional questions regarding time and number of products
- you could try to reduce the dimensions (1|2) by using principal components or some advanced data preparations with vtreat or featuretools (1|2|3). But with so few data this might not help and you might have difficulties explaining your results and it might be overengineering (and again not solve the problem you might not have enough data to answer your question).
- then I did not understand if your second file does contain all the data of the training dataset minus the price
- and another note about your statistics. R square is widely used but has its limitations. Kaggle competitions often use RMSE as a measure to judge the quality of models. I would always advise to also inspect some graphs to see what your prediction does.

Maybe you tell us something more about your thoughts and the data and the progress of your task.

----------- label encoding ---------------

You might want to be careful with this knife. This will to a certain extend intentionally ‘leak’ information about the Target (price) to the training data. That might be OK if you expect the relationship between product and price to remain stable. So it might be legitimate to represent a product with the average price.

The problem will be if you encounter lots of new and unknown products - you might represent them with an average or some missing replacement techniques. But again that might not work. You would have to see.

Hi @camvinga_19 -

I’ve seen a few requests in recent weeks for help on class problems involving the Piggly Wiggly dataset. Do you mind sharing which class this is for, and who your instructor is?

We like to keep track of who is using KNIME in the classroom where we can. If you’d rather not share here, would you mind dropping me a quick note at scott.fincher@knime.com?

Thanks!

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.