[feature request] add option Laplace smoothing to Naive Bayes Learner/Predictor

Geo · December 20, 2021, 12:44pm

Hi,

This ticket is about requesting Laplace Smoothing as an additional option in the Naive Bayes Learner node, with a pseudocount alpha as a parameter. Concerning the implementation, I do not have a preference between adjusting the counts table as calculated by the NB Learner node and handling the actual adjustment in the NB Predictor node.

The Naive Bayes Learner offers the parameter default probability threshold to handle zero probability situations. Instead of 0, the learner, or rather the predictor, then applies the default probability threshold. The zero probability situation is thus dealt with after the probability calculation.

For categorical data, Laplace Smoothing handles the zero probability situation during the probability calculation, i.e. by adding a pseudocount both in the numerator and denominator (in the latter this pseudocount is multiplied by the number of target class values). This way, a probability cannot become zero.

When to use Laplace Smoothing over default probability when the data are discrete ?

Laplace Smoothing adjusts both overly optimistic (probability of 100%) and overly pessimistic (probability of 0%) situations, while default probability only addresses the zero probability situation.
The impact of the correction applied by Laplace Smoothing is asymptotic and therefore vanishes with an increasing number of instances. This property may have a slight intuitive advantage in text mining, an application for which it could be relatively more difficult to define a sufficiently low default probability when faced with very low counts. Under the condition of a high enough internal precision of the probability calculation, the asymptotic property still preserves the benefit of dealing with the non-0 and non-1 probability situations, even in case of a very large number of instances.
When the overall number of instances is low, chances are that the underlying probability calculations tend to be biased. LS allows to address this situation as well. Obviously, having more data (a higher number instances) in a first place rather than having to adjust the probability calculation afterwards is statistically more sound, but that is another discussion.

steffen_KNIME · January 12, 2022, 2:58pm

Dear @Geo :

thanks for your request!
We filed a ticket in our internal system. However, we are currently focusing our development on other components and it may be a while until we get around to this.

Kind regards
Steffen

Geo · January 12, 2022, 5:08pm

Thank you @steffen_KNIME for considering the feature request !
Kind regards
Geo

system · July 14, 2022, 5:08am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.