I would want to forecast the numeric sales figure per customer for the next 360 days and I have quite a lot of attributes that the model can use including (yet not limited to):
Historic sales per customer for the last 360
Demographic and geographic information about the customer
E-commerce site visits and frequency/length of these
Past product purchase preferences
Etc.
I’m looking for the best modelling approach to tackle this this kind of problem and I was wondering if anyone has any tips or ideas that could help me get started?
I can think of this example where the team from KNIME tried to solve a Kaggle competition about how many future visitors a restaurant will have using the H2O.ai nodes. You might be able to adapt that
Then there was a Kaggle competition prediction the number of future sales for german retail chain. The results might also help to give you new ideas (although the winner did not use KNIME):
That also was mentioned in this little discussion.
Thanks for sharing all of the above information and using the H20 nodes seems interesting. After some consideration I decided not to try to predict individual customer spend per day but rather of a full year instead (i.e. based on the previous 365 days of spend, how much will this customer spend the next 365 days). The input attributes I have prepared for modelling are:
CustomerID - The idea of the customer
FrequencyScore - A numeric score from 1 to 100 depending on how often the customer shops
LatencyScore - A numeric score from 1 to 100 depending on when customer shopped last
SpendFirstHalfYear - The monitory spend of historic purchases from the first half of the year
SpendSecondHalfYear - The monitory spend of historic purchases from the second half of the year
My target or output value is then ‘SpendNext360days’. I’m quite keen on trying to use H20 regression nodes for this but what would be a good way of sequencing the nodes and what is the best way to evaluate the accuracy of the regression model?
@ScottF provided us with simple workflow that shows how you would set up a regression model and do an evaluation using the RMSE ( Root-mean-square deviation) => the lower the better. The KNIME numeric scorer node has you covered there.
You could switch out the type of model used:
We had a discussion about that here:
And also further information about ways to determine the quality of a model (mostly 1/0 but check out the link).
Further metrics I have explored are Correlation coefficients like Pearson and Spearman to see how your prediction and your real data do align.
Also I have toyed around with the concept of Bland-Altman plot but I am not sure yet if that gives me any more insight. Basically it should combine the question of correlation (if your score hints in the right direction) and agreement (if you actually hit the numbers).
Thank you so much for your help and guidance here and for sharing useful information. I’ll give the H20 regression a try using the @ScottF’s regression workflow. If anyone should be interested in the outcome of this, I’ll also post some learnings after the modelling work is complete.