Differences in implementation of kNN in KNIME and WEKA

torczyk · October 23, 2014, 10:50pm

I've found a strange thing - the support values returned by k Nearest Neighbor differ from support values returned by IBk (same data set, nominal class, same k value, no distance weighting). The more samples are in the training set the more results are similar. In theory the support value of class c is the quotient of neighboring number of samples which belong to class c to the number of all neighboring samples (v(c)=nn_c/nn).

The second strange thing was the fact I couldn't find any information how these values are calculated.

Please see the attached file for an illustrotion of the issue.

knn_example.zip

thor · October 24, 2014, 11:15am

I don't know how Weka computes the probabilities but KNIME's kNN node computes the probabilities exactly as you described (in the non-weighted case). In your example many query points have more then one 3rd neighbour, e.g. 4 has 4 (as first neighbour), 3 (as second) and 2 and 6 (as third). In such cases we take into account all 3rd neighbours therefore there are in total four and not three neighbours resulting in probabilities that are multiples of 1/4.

torczyk · October 24, 2014, 8:39pm

WEKA somehow adds a component of the training set cardinality. If the training set is small the class probabilities are lower. This has to be 1/n factor, as with a growth of training set, the differences between WEKA and KNIME implementations of kNN become smaller, and smaller.