Some bugs in the k-Means node

redbird · June 4, 2014, 2:02am

Lately I was developing a node, which also contained a column filter box and the option for "always including all columns". Since I wanted to know how the default knime nodes (with this option and a column filter) behave when switching the input data, I used the k-Means node as an implementation reference. However it seems that this node also shows some strange behavior when the input data is changed.

The famous iris data set was used to produce the following errors:

Case 1 ("Triggering a null pointer exception"):

Create a new k-Means node and filter out one column.
(Optionally execute node)
Filter some attributes with a column filter (e.g. so that only one column is left) and connect it's output to the k-Means node.
The error "ERROR k-Means Configure failed (NullPointerException): null" is thrown.

Case 2 ("Creating a false PMML description"):

This a smaller error which leads to a different PMML description even though the same variables were used.

First, use a column filter and like before filter out all columns except e.g. one.
Connect it's output to the k-Means node and execute the node.
Now connect your data source (e.g. filter) directly to the k-Means node and execute it.
Have a look at the generated PMML file. It will only contain the ClusteringField entries of those columns which were specified earlier when executing k-means with the data from the column filter.

You can compare this PMML with the PMML which will be generated when opening the k-means node dialog and pressing apply. The latter PMML will contain all ClusteringField entries of the columns which were really used.

I hope this report is helpful and I'm still wondering if there is a more suitable node implementation which can be used as default reference when using column filters and the always include all columns option.