Hierarchical clustering of columns instead of rows?

jakajalan · January 18, 2015, 2:30pm

Hi,

I am a newbie in KNIME so sorry for a very simple question:

How do switch the hierarchial clustering from clustering the objects (=rows) to clustering the variables (= columns)?

E.g in Orange, there is such switch in the distance calculator node - in Knime, I was unable to find it, although it is probably simple :-)

Thanks in advance for help,

Jack.

thor · January 18, 2015, 3:04pm

If you don't have too many rows (let's say < 1000) then you can first transpose the table and the perform the clustering.

jakajalan · January 18, 2015, 4:20pm

Dear Thor,

thanks a lot. I have cca 250 000 records so I did not tried trasnposing the table. Is there any other option?

Thanks,

J.

Christophe.Georges.Molina · February 13, 2015, 7:54pm

Hi Jakahalan,

A Hierarchical Tree is normally made after computing a distance matrix between all the leaves of the tree (btw this is done for all the levels of the tree, one distance matrix per level.) In your case that would mean to compute the distance matrix between all your columns considered as points being represented in a 250.000 dimensional space.

Please have a look at the definition of what is the "Curse of Dimensionality" at the following link (Wikipedia: http://en.wikipedia.org/wiki/Curse_of_dimensionality). In essence, what the Curse of dimensionality says is that it doesn't make sense to compute distances in a very high dimensional space (which is your case.) The Tree you will get won't make much sense.

If you want to implement Aaron's suggestion (which for me is perfectly right), you will need first to reduce the dimensionality of your columns to a reasonable size while preserving the underlying relationship between columns. My suggestion is to do a K-means clustering of your data (with for instance K=1000) before transposing the table. Then you need to convert the k-means vectors into a table (of 1000 rows x # columns), transpose this table (# columns x 1000 rows) and eventually achieve the Hierarchical Clustering on the transposed table of K-mean vectors. If clustering your 250000 data into 1000 clusters by k-means makes sense, then you should get a pretty nice tree representing the hierarchical organization of your initial columns.

Hope this helps.

Best regards,

Christophe