Obtaining the class probabilities through the Naive Bayes Predictor node is easy. However, I am encountering some difficulties re-calculating those probabilities. I wonder whether this would be due to precision or the actual formula.
Suppose a nominal class prediction problem with the following document to predict:
word1 word2 word2, with both words being columns of a value of
1 for that instance in the document vector. Suppose the class could take any of the two values A and B.
Please note that in KNIME, I have applied the Number To String node to the term columns in the document vector, so that Naive Bayes Learner basically sees
"0.0" for each term - if I don’t do this, it appears that NB Learner will not calculate the counts correctly.
Given the above, I’d assume that Naive Bayes Predictor would score the instance as follows:
score(class = A) = p(class=A) x p(term = word1 | class = A) x p(term = word2 | class = A)^2
score(class = B) = p(class=B) x p(term = word1 | class = B) x p(term = word2 | class = B)^2
Would that be the correct formula for calculating the score ?
Furthermore, would the predicted class probabilities be calculated using the following formula ?
probability(predicted class = A) = score(class = A) / sum(score(class = A), score(class = B))
probability(predicted class = B) = score(class = B) / sum(score(class = A), score(class = B))
Following further analyses, I’ve noticed some differences in the calculations for the
p(class = ?) part, which I have yet to confirm at the counts level. For now, I have noticed this at the probability level.
NB Learner provides the counts table broken down by feature and by class, it does not provide any separate counts table only for the class distribution. For audit purposes, this would be nice to have, even if it should be quite trivial to calculate the class distribution before the NB Learner or to derive it from the provided counts table.
EDIT: The PMML output of the
NB Learner does contain the target counts, so that I have been able to compare the counts. It is rather intriguing that the counts that I have calculated on my side and the ones provided by NB are identical. Nevertheless, the probability calculations, including the
p(class = ?), are not.
Based on the PMML 4.2 documentation for Naive Bayes, I have been able to successfully re-perform the calculations done by the KNIME Naive Bayes Prediction node. The reason I have wanted to do this for is to be able to understand why any given class wins over another one: knowing that it is due the prior or due to the terms is already a great piece of information. Obviously, with several hundreds of terms, the analysis is a bit more complex.
Here are my observations:
- the counts table does contain the class count (contrary to my remark here above), you only have to find it in the attribute jungle in case of a document vector: filter on the name of the class variable within the Attribute column;
- when having to deal with several hundreds of terms, the final formula does not only take into account the terms that are flagged as present but also those terms that are flagged as absent (note: absent is not the same as missing in this context). This also explains the necessity for having nominal values
"1" for each term in the document vector. Consequently, when classifying documents with NB, one has to reason in terms of the whole dictionary of words for each document (i.e. the document vector feature space) and not just the specific bag of words for each document;
- the documentation referenced here above uses
count(class) instead of
p(class) as a prior. In terms of class probability, this difference in formulation does not impact the outcome.
I hope this will help anyone who may have a similar question in future.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.