Predicting sales - best algorithm to start in a simple way

Hi. I’m doing my first predictive mode. I’ve real purchase data since 5 years ago and I want to predict possible purchasers of product “A”.
I’ve more data to feed the model, but I want to keep it simple in the first attempts. So, I’ve the data organized in this way:

Client ID
Client variables (like country or employee’s quantity)
Quarter+product_type columns indicating revenue. EG: if we sell a product named B, I’ve columns like 2015Quarter1_ProductB_Revenue, 2015Quarter2_ProductB_Revenue and so on for each product and quarter until 2019Q3. So, my target is buyers of product A in 2019 quarter 3.

So, to start with a simple model, how algorithm do you recommend? I need the prediction to return 1 or 0.
And, to end, how I can measure the performance of the model?

I’ve made some attempts with Polunomial regression and ROC curves, but I want to learn of the ones that knows!


Hi @Coco2806,

Do you also have a target variable in your dataset with values 1 and 0? If not, could you please share at list the first few data rows of the dataset?
If the aim of the analysis is to predict a variable with two classes, i.e. 0 and 1 then you’ll need to use an algorithm that solves a classification problem.
For instance, in KNIME Analytics Platform you can start with a Decision Tree. It uses a tree-like model of decisions and their possible consequences, including chance event outcomes.
Then, you can measure its performance with the Scorer node. It will compare two columns by their attribute value pairs and will show the confusion matrix.

On the EXAMPLES Server you will find several workflows that implement a decision tree. For instance, take a look at this one: knime://EXAMPLES/04_Analytics/04_Classification_and_Predictive_Modelling/01_Example_for_Learning_a_Decision_Tree

Instead, if the aim of your analysis is to predict the future values of a time series you can use the past values to predict the future value of the target time series, y(t+N), N being an arbitrary integer number.
In this case you could start with a auto-regressive model, say with the “Polynomial Regression (Learner)” node that you mentioned. This node models y by using a polynomial regression of y on (x1, x2, x3, …).
In this case you will be able to measure the performance of the model with the numeric scorer node.
To implement this, you can take a look at the whitepaper that we published on the KNIME website:
It is about time series prediction and I think it might be helpful for you!

Hope that helps!