Thanks for sharing all of the above information and using the H20 nodes seems interesting. After some consideration I decided not to try to predict individual customer spend per day but rather of a full year instead (i.e. based on the previous 365 days of spend, how much will this customer spend the next 365 days). The input attributes I have prepared for modelling are:
CustomerID - The idea of the customer
FrequencyScore - A numeric score from 1 to 100 depending on how often the customer shops
LatencyScore - A numeric score from 1 to 100 depending on when customer shopped last
SpendFirstHalfYear - The monitory spend of historic purchases from the first half of the year
SpendSecondHalfYear - The monitory spend of historic purchases from the second half of the year
My target or output value is then ‘SpendNext360days’. I’m quite keen on trying to use H20 regression nodes for this but what would be a good way of sequencing the nodes and what is the best way to evaluate the accuracy of the regression model?
@ScottF provided us with simple workflow that shows how you would set up a regression model and do an evaluation using the RMSE ( Root-mean-square deviation) => the lower the better. The KNIME numeric scorer node has you covered there.
You could switch out the type of model used:
We had a discussion about that here:
And also further information about ways to determine the quality of a model (mostly 1/0 but check out the link).
Further metrics I have explored are Correlation coefficients like Pearson and Spearman to see how your prediction and your real data do align.
Also I have toyed around with the concept of Bland-Altman plot but I am not sure yet if that gives me any more insight. Basically it should combine the question of correlation (if your score hints in the right direction) and agreement (if you actually hit the numbers).
Thank you so much for your help and guidance here and for sharing useful information. I’ll give the H20 regression a try using the @ScottF’s regression workflow. If anyone should be interested in the outcome of this, I’ll also post some learnings after the modelling work is complete.