Manipulating Hierarchical Clustering (DistMatrix) results

atomcrat · August 10, 2015, 9:31am

The target is to classify binary flags (100+ pcs) in data entries, in order to create clusters of the flags, and in this way classify the data entries. Now, I manually programmed logic to write a Distance Matrix, and used the Hierarchical Clustering (DistMatrix) node to classify them. Unfortunately, the output format is a strange data type called DistanceVectorDataCell, which is accepted only by a viewer node and the Model Writer. This is supposed to be a collection of DoubleCells, but no node can split them. The Model Writer writes binary-format FILE files. How can I extract the hierarchical tree? First of all, it's way too large to be conveniently viewed in KNIME's viewer. Second, I want to manually prune and process it using regular KNIME logic. I know already from the nature of the data that many of the classifications will be spurious or misleading, and need to be converted by hand into an actually useful classification, which will then be again reused in another context.

For Distance Matrix, the format is readable by Split Collection Column, and can be written with CSV writer then read by Distance Matrix Reader. Are there similar nodes for Model? In development? Currently the only method I have is to manually copy the results from the viewer, which is probably not what the developers intended.

I could try the Hierarchical Clustering node, but the data is binary and categorical, so the distances from e.g. flag 14 to flag 15 or flag 103 are meaningless, and the Euclidean distances make no sense. Currently I use Tanimoto distances with some multidimensional reprocessing. The issue with the data is indeed this: the flags are largely meaningless even by themselves, and need to be classified into a smaller set of hierarchical nodes, before they can be used to classify the data in a manner than makes sense to a human reader. I could do this by hand, but for this, the number of flags is rather large, and it risks inconsistency.

thor · August 10, 2015, 9:52am

Did you have a look at the Distance Matrix Writer?

atomcrat · August 10, 2015, 10:42am

Hi,

The input to Distance Matrix Writer is a regular table, symbolized by a white arrowhead. Inside this table there is a DistanceVectorDataCell. Whereas, the output from Hierarchical Clustering (DistMatrix) is a gray square. Model Writer is the only other node than Hierarchical Cluster Assigner or Viewer that accepts the gray square as an input. I didn't find any nodes that allow to convert the gray square into a white arrowhead.

The same problem appears with Distance Matrix, but can be circumvented.

Best regards

thor · August 10, 2015, 11:57am

The output of the hierachical clustering node is an internal tree model without much use except showing a dendrogram and to assign data points to clusters. It's not meant to be processed further. Also it's quite hard to convert a tree into a table in a sensible way.

atomcrat · August 10, 2015, 1:58pm

As indicated in the original message, KNIME is not a good viewer, and even if it was, it would still be necessary to manually edit the classification before it is practically useful. For example, it seems to misclassify many flags - because the training data is imperfect - that would need to be moved manually to right classes. The Assigner is not useful unless the model is correct, and if the tree cannot be edited, the whole Hierarchical Clustering family of nodes has no actual use. I am trying to quickly produce "natural" clusters that can then be considered for devising a hierarchy of "normative" cluster.

In practice, the flag names are so long that the Viewer abbreviates them. This can't be allowed; they must be displayed in full. But this is beside the point - I was looking for a programmable tool, not a viewer. I could find another viewer, or just as well write the viewer myself, for instance.

Representing hierarchies as tables is certainly possible and has advantages that may not be apparent when simply trying the view the cluster.

aborg · August 10, 2015, 4:45pm

Hello,

There is another viewer for hierarchical clustering. called HiTS. You need the latest feature ie.tcd.imm.hits.exp.feature (ie.tcd.imm.hits.feature and ie.tcd.imm.hits.3rdparty.feature as those are its dependencies).

What its hierarchical clustering nodes able to do:

visualize with a heatmap
sort the branches in order to minimize the sum of the distance between the leaves
sort the branches in the opposite order
sort the rows based on the order of leaves in the hierarchical tree.

It is moving to https://github.com/aborg0/hits/, where lots of the obsolete things will be removed (I should create a deprected plugin/feature for them), restructured. In case you need something changed, please create a ticket on github. (Please do not expect a release before the next KNIME release, I am quite busy now.)

Thanks, gabor

atomcrat · August 12, 2015, 12:05pm

aborg, thank you for the link.

The viewer is indeed better. Its export image function allows to print out the whole tree rather than just a printscreen. I could print my tree, which is too large for a screen and barely fits on an A3.

I was not able to understand the spec for the input. I found it does accept a Distance Matrix, but I see no heatmap. What sort of data does it accept? The node description says simply "Data with same keys (in the same order) as the Cluster tree has.", but it doesn't accept multicolumn data.

But, again, the tree is not available in a reusable format. As you have developed it, how did you come to access the spec of the "gray square"? Obviously you must have been able to read the data format in order to write a viewer for it.

I don't see why this has to be so difficult, considering how easy it is with other nodes like Network. In Network, you can export the whole thing in BEEF format and do whatever you want with it. This is just a special case of a network. Also, the algorithm works by sequential joins of two nodes into one, so an output of this is easily represented as a table.

Currently, I am considering setting something up using k-Medoids. The classification is flat, though. Or, writing the hierarchical clustering in Java or R.

aborg · August 12, 2015, 3:31pm

It accepts any numeric data, for example the columns from the distance was computed (though you can add other columns too or remove some with column filter nodes) is good for the Dendrogram with Heatmap node.

I do not remember the details -it was years ago when I have created these nodes-, but the internal details to create a view was not always public (I think in KNIME 2.10 this was not working), now they are available. This predates the network handling nodes, so do not expect a connection between the two.

The code is open source, the relevant parts are available around here.