Distance matrix feature requests

richards99 · March 8, 2014, 10:05am

Hi, the distance matrix tools has undergone some really nice improvements recently with the very useful pair extractor and similarity search. However it would be nice to have some further advancements;

- clustering is limited to k-medoids and hierarchical. Can the k-means and fuzzy c-means be extended to distance matrix columns. Also is there any possibility of clustering options where you specify the desired cluster distances and it generates the most suitable number of clusters (I.e. Rather than specifying the number of clusters) and it can handle large data sets.

- really useful and missing functionality would be a diversity selection node which takes the distance matrix and selects the n most diverse rows from the table.

I would also probably recommend moving the distance matrix category from Misc to Mining.

thanks,

simon.

thor · March 10, 2014, 9:51am

Concerning diversity selection... What kind of diversity do you have in mind? There are quite a few definitions of diversity and all distance-based diversity measures lead to NP hard optimization problems (see e.g. http://dx.doi.org/10.1021/ci100426r). Having said that there are a few nodes in KNIME that can do diversity selection, e.g. the multiobjective row selection node or the Score Erosion node (just use it without scores). Both approaches are described in the linked article.

richards99 · March 11, 2014, 11:45pm

Hi Thor,

Many thanks for the reference, what a fantastic read. Very useful.

In reference to that, something like the p-centre measure diversity is most ideal, but the p-dispersion min-sum should be suitable also. I am performing some vHTS analysis as you talk about in this paper.

The Score Erosion node certainly seems upto the task for what I need.

I notice in the article you calculate the numerous different diversity scores around p-centre and the various p-dispersion measures. Are there tools in KNIME for calculating these scores?

Cheers,

Simon.

thor · March 12, 2014, 11:39am

Yes there are nodes for calculating these measures :-) The Optimizatione extension from Labs contains the Multiobjective Score Computation node. It needs a table with a column containing collections of row IDs and in your case a second table with a distance matrix column (with distances between all row IDs). In the node's dialog you can enter several expressions. The names are a bit different from the ones in the paper but they are all explained.

There is an (updated) example workflow 014_Optimization/014002_SubsetSelection on our public example server that also shows how to use the score computation node.

richards99 · March 12, 2014, 2:28pm

Brilliant this is perfect. Is there anything KNIME cannot do!

Simon.

thor · March 12, 2014, 6:08pm

We are still working on making coffee...