cluster using Gower metric

noe91 · January 5, 2017, 12:01pm

Good afternoon!

I am trying to cluster my nominal and numeric attributes and I found that a Gower metric can help me. I also know that it is possible to run R script in knime but I didn't find how to do it.

So I would like to include this code in knime:

Dissimilarity matrix calculation

daisy.mat <- as.matrix(daisy(my.dataset, metric="gower"))

Clustering by pam algorithm

my.cluster <- pam(daisy.mat, k=desired number of clusters, diss = T)

Cluster plot

clusplot(daisy.mat, diss = T, my.cluster$clustering, color = T)

is it possible? How can I upload package? Is there a better solution?

Thank you in advance,

best

Geo · January 5, 2017, 3:24pm

R integration

KNIME offers a convenient integration with R but you'll have to split your code a bit into separate R nodes:

first use R to R for general tasks such as dissimilarity calculation and the clustering with pam;
then connect to the other R nodes:
- R View (Workspace) for the plot
- and R to Table for the dissimilarity matrix (use code: as.data.frame(daisy.mat)).
- For the clustering output, it depends on the object type output by pam - if it is a plain text output, R Std Output in the R to R node will contain the text printed during execution if you include the code: print(my.cluster).

The package "Rserve" needs to be installed in your R installation and ideally (but not necessarily) the R packages are installed in the R installation folder (if you're using Windows). The KNIME forum contains quite a few threads on how to technically set up the R integration in a KNIME workflow.

KNIME alternatives ?

By consulting this page, provided its information is accurate, Gower similarity coefficient appears to be a sort of convenience wrapper for calculating the similarity between observations caracterised by mixed type variables. It may be worth a shot, albeit a burdensome one, to implement the distance measure calculation yourself using KNIME nodes and see whether the metric really fits your exploration goals.

For example, influent ("outlying") observations on continuous variables will be effectively flattened due to the min-max way of comparison according to Gower. This may or may not be a good thing, depending on your exploration goals.

Transforming your nominal variables into numeric ones (via dummy coding and if necessary PCA), then applying the more traditional transformations (e.g. z-score) and distance calculations (Euclidean, Manhatten, etc.) may provide a decent alternative.

You could even apply seperate transformations and distance calculations for each variable type and then combine those using the Aggregated Distance node; similar to Gower but according to your own flavour ;-)

noe91 · January 5, 2017, 4:39pm

thank you!