Clustering question

Hi,

I’m sorry if this questions seems quite basic but I am new to Knime and I am a little confused on some aspects of clustering.

I have a dataset with categorical variables (10) and I would like to perform clustering on this data. I have used to k-means algorithm to try to cluster the data. I tried to hotencode all of my categorical variables using the one-to-many node. I have set the number of clusters to 5 and have a result. I guess my question is, is this the right way of approaching this problem or should I be using k-mode clustering instead of k-means?

Example of data:

Sex, Location, Language, Process Stage

Any help would be greatly appreciated!

Eric

Hi Erik, welcome to the knime forum.

I guess that one hot encoding can work with the categorical variables using k-means but it requires that all variables are continuous (see)

To be sincere I do not have much experience with K-medoids handling mixed data however I am attaching a workflow using the K-prototype method with R and calculating the “optimal” number of clusters with the silhouette coeficient with must be complemented with more methods.

Kprototypes.knwf (26.0 KB)

The dataset bank-full.csv can be found here in Kaggle

In my beginnings I have found a nice workflow done by @Fabien_Couprie which used PCA, the CalinskiHarabasz index and a nice 3D representation of the cluster. ( I beg Fabien if you upload it again or provided us the link)

Cheers

4 Likes

Hi mauuuuu5,

Thanks for the clarification.

I guess that certainly rules out k-means for clustering given my data. I’ll take a look at the workflow you linked and see how far that takes me.

Thanks for the quick reply!

w

2 Likes

Sure let me know if you have any question,

Mau

Hi Mau,

I took a look at the workflow example you gave kprototypes but I am a little unclear abut what it is doing. Is it determining the correct number of clusters depending on which clustering algorithm is used? Also why would it only use a sample of the records to determine the right amount of clusters?

Sorry if these are basic questions.

W

Hi w0rdz
The workflow is calculating in one bit the “optimal” number of clusters using a sample due to computation time, however is advisable to employ the whole dataset or a representative sample to obtaint the best silhoutte coef.

Regarding the algorithm it us using k-prototypes all the time, but in the lower part of the workflow I am using the EM algorithm provided by Weka , just to show you another approach my apologies for not mentioning it before.

I reordered the workflow to make it more understandable.

Let me know if you have more questions

Kprototypes.knwf (29.8 KB)

Hi Mau,

First, I wanted to say thanks for the additional workflow. That really helped to explain the steps. The one question I have is around the determining the cluster sizes. When using your workflow I can see that the kprototypes R snippet produces an output that says that 2 clusters is the optimal amount of clusters. Lower in the workflow you have the Weka EM and Cluster Assigner that uses 5 clusters.

I thought that with the silhouette coefficient showing that 2 is the ideal number of clusters, that Weka would also only use 2 clusters…

Why are they different?

W

Hi w0rdz, nice that you found the workflow useful.

Regarding the number of clusters the k-prototypes uses the silhouette coefficient while the EM uses an iterative process until the log-likehood stops increasing. (see)

image

So they are using different metrics and in this case they do not provide the same number of clusters I guess because the clustering algorithms differ. You can read this post about the methods to obtain the number of clusters.

Let me know if you have more questions

Mau

Hi mauuuuu5,

I think it was this one : exercice20.knwf (72.2 KB)

1 Like