I am implementing a churn prediction analysis with a 'tree ensemble learner' node. It works pretty well showing an accuracy above 80%.
Nevertheless this high accuracy comes mostly from the 'not churn' prediction which is over 95%, what means that the 'churn' prediction is poor (about 60%), it is the same that say that the offdiagonal terms from the confusion matrix are different by some orders of magnitude or to put it in an easy way, the algorithm is predicting well 'no churn' but is predicting quite bad the 'churn' category.
The good new here is that 'tree ensemble learner' allows to give a 'Confidence' value to every single row. So as far as I understand when the confidence is over 50% it categorize that user as 'churn' and when it is below such threshold value it categorize as 'no churn'. I wonder if there are any option within that node that allows me to decide the category by means of the confidence, setting for instance values over 70% to 'churn' and below that to 'no churn'. From my perspective it will allow to move some of those non confident 'churn' to 'no-churn' and I do believe it is going to rocket the accuracy of my model.
Obviusly I can do so by using appropiately some more nodes at the output, but it is such a basic setting that I wonder if it is not already available somwhere.
Thanks for your response!
you can use different strategies to tackle this problem.
Usually such results are caused by unevenly distributed classes meaning that one class is much more frequent than the other (in your case I guess the 'not churn' class is the frequent one). In order to counter this problem you can use the Equal Size Sampling node to create a training dataset in which all classes are equally frequent. This is refferred to as downsampling. Another possiblity is the use of upsampling, where you artificially create new records for the minority class in order to balance out the class distribution. For this KNIME offers the SMOTE node which creates artificial records for the minority class, the drawback is that it only works for numerical features. An alternative is to simply duplicate some of the records in the minority class.
Provided you have enough data to begin with, it is probably better to use a downsampling approach.
Another strategy is the one You already suggested. But instead of the Confidence I would suggest to append the probabilities of the classes (select the option "Append individual class probabilities" in the Tree Ensemble Predictor). You can then use for example the Rule Engine to change the prediction depending on the probability of the 'churn' class.
Long story short: No, this option is not available, at least not in the Tree Ensemble nodes. But, as achieving the same functionality is as easy as adding another node, I don't think it is necessary to add this option in the future.
Thanks a lot Nemand for your answer.
I finally did the option I commented, i.e by using the 'math formula' node put those 'churn' predicted with confidence below to 65% (by induction) to 'no churn' and the accuracy of the prediction has growth from 80% up to 87% plus now both offdiagonal terms in the confussion matrix are balanced. So it turned out to work pretty well.
However I will give a try to the other methods you suggested.