I've got a question about how class probabilities are calculated in the KNIME K Nearest Neighbor Classifier. I'm implementing kNN with Feature Projection based on class probabilities, but I get strange results. When I try to leave-one-out cross validate the first attribute of the Iris dataset (sepal length). If I set k=3, then I obtain class probabilities for the 1st row as Iris-setosa=0.875, Iris-versicolor=0.125, Iris-virginica=0? I assume that class probabilites are calculated according to the formula P(Yi|x)=(Ni)/(k), where: x - classified instance, Ni - number of instances that belong to class Yi in the neighborhood of k instances. It seems like 8 neighbors were analyzed in some cases instead of 3, but why? There was no tie as the final result between Iris-setosa and Iris-versicolor was 7:1.
I'll answer to myself: there are no ties on class labels count, but there are ties on the distance vector. If within k nearest neighbors there are more than k same values the k-window is expanded to take all the equal distanced instances. It's not a standard solution and it's a shame that it is not explained in the node description. I've found a hint in the R knnflex description, which has a parameter for choosing how to solve ties (take random instances or expand k-window to take them all).
Hi Torczyk,
the default aproach is to the best of my knowledige using all items of the distance within the k-window, as everything else would be an heuristic as well. Randomly chosing would decrease the probability of chosing the correct one even more.
Thank you for noticing this in the description, we will add the information in our next release.
Best, Iris