Clustering question

w0rdz · March 28, 2021, 3:02pm

Hi,

I’m sorry if this questions seems quite basic but I am new to Knime and I am a little confused on some aspects of clustering.

I have a dataset with categorical variables (10) and I would like to perform clustering on this data. I have used to k-means algorithm to try to cluster the data. I tried to hotencode all of my categorical variables using the one-to-many node. I have set the number of clusters to 5 and have a result. I guess my question is, is this the right way of approaching this problem or should I be using k-mode clustering instead of k-means?

Example of data:

Sex, Location, Language, Process Stage

Any help would be greatly appreciated!

Eric

mauuuuu5 · March 28, 2021, 4:15pm

Hi Erik, welcome to the knime forum.

I guess that one hot encoding can work with the categorical variables using k-means but it requires that all variables are continuous (see)

To be sincere I do not have much experience with K-medoids handling mixed data however I am attaching a workflow using the K-prototype method with R and calculating the “optimal” number of clusters with the silhouette coeficient with must be complemented with more methods.

Kprototypes.knwf (26.0 KB)

The dataset bank-full.csv can be found here in Kaggle

In my beginnings I have found a nice workflow done by @Fabien_Couprie which used PCA, the CalinskiHarabasz index and a nice 3D representation of the cluster. ( I beg Fabien if you upload it again or provided us the link)

Cheers

w0rdz · March 29, 2021, 12:59am

Hi mauuuuu5,

Thanks for the clarification.

I guess that certainly rules out k-means for clustering given my data. I’ll take a look at the workflow you linked and see how far that takes me.

Thanks for the quick reply!

w

mauuuuu5 · March 29, 2021, 2:04am

Sure let me know if you have any question,

Mau

w0rdz · March 29, 2021, 8:54pm

Hi Mau,

I took a look at the workflow example you gave kprototypes but I am a little unclear abut what it is doing. Is it determining the correct number of clusters depending on which clustering algorithm is used? Also why would it only use a sample of the records to determine the right amount of clusters?

Sorry if these are basic questions.

W

mauuuuu5 · March 30, 2021, 3:13am

Hi w0rdz
The workflow is calculating in one bit the “optimal” number of clusters using a sample due to computation time, however is advisable to employ the whole dataset or a representative sample to obtaint the best silhoutte coef.

Regarding the algorithm it us using k-prototypes all the time, but in the lower part of the workflow I am using the EM algorithm provided by Weka , just to show you another approach my apologies for not mentioning it before.

I reordered the workflow to make it more understandable.

Let me know if you have more questions

Kprototypes.knwf (29.8 KB)

w0rdz · March 31, 2021, 9:38pm

Hi Mau,

First, I wanted to say thanks for the additional workflow. That really helped to explain the steps. The one question I have is around the determining the cluster sizes. When using your workflow I can see that the kprototypes R snippet produces an output that says that 2 clusters is the optimal amount of clusters. Lower in the workflow you have the Weka EM and Cluster Assigner that uses 5 clusters.

I thought that with the silhouette coefficient showing that 2 is the ideal number of clusters, that Weka would also only use 2 clusters…

Why are they different?

W

mauuuuu5 · March 31, 2021, 11:57pm

Hi w0rdz, nice that you found the workflow useful.

Regarding the number of clusters the k-prototypes uses the silhouette coefficient while the EM uses an iterative process until the log-likehood stops increasing. (see)

So they are using different metrics and in this case they do not provide the same number of clusters I guess because the clustering algorithms differ. You can read this post about the methods to obtain the number of clusters.

Let me know if you have more questions

Mau

Fabien_Couprie · April 8, 2021, 9:44am

Hi mauuuuu5,

I think it was this one : exercice20.knwf (72.2 KB)

system · October 7, 2021, 9:44pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.