k-Means: how to customize?

Please advise where to look for documentation on adjusting k-means clustering for:

1) Fixed-length bit vectors.

1.1) How to format input file with ascii text representing bit vectors as a fixed-length string of zeroes and ones, such as:

010001000100101    1

110010101011000    2

...

001010111100001   100

where first string is a binary vector and second is a numeric id

1.2) How to make k-means use Hamming distance for bit vectors?

2) How to format input file with ascii text representing  sparse floating point vectors where  vector components have names in the "__n__m" form. For example here is a single vector with 9 componets:

        "__176__177" : 0.0006153676851746713,
        "__101__19" : 0.000013023706311005352,
        "__289__290" : 9.643125665067138e-7,
        "__261__6" : 0.00000894408087718124,
        "__164__59" : 0.0004399090771997391,
        "__17__4" : -3.7288170254628644e-10,
        "__279__280" : 0.0037703221385939046,
        "__108__109" : 0.04906173622636322,
        "__157__59" : 0.000005552830476144654,

Other vectors may have different number of components. Max number of components is not known a priory.

Thanks!

For fixed length bit vectors try using one of the nodes from the Erlwood Community release nodes called Fingerprint Expander. It will convert all the bits into individual columns.

Does this help?

Simon.

Thanks!

Where can I read description of text formats that "File Reader" node supports?

Also I plan to use KNIME library to write my own app that does k-means, cross validation, etc. Is there any example of writing such apps?

Simply use the File Reader node, if the format is as shown above where the two columns are separated by spaces, it should detect the two columns automatically.

However, parse through the bit vectors as a string column. In the preview window, right click on the column containing bit vectors and choose to parse it as a string column., not a double column.

Now use the BitVector Generator node, select Bits From String column and choose the column name containing the bitvectors. For kind of string representation, you should choose BIT, but it appears there is a bug in the node, and you need to choose HEX to get out BITS. I'll post this bug!

Now you can connect up the Erlwood Fingerprint Expander node to convert this Bitvector column into individual BIT columns for use in k-means.

 

Hope this helps,

 

Simon.