I’m sorry if this questions seems quite basic but I am new to Knime and I am a little confused on some aspects of clustering.
I have a dataset with categorical variables (10) and I would like to perform clustering on this data. I have used to k-means algorithm to try to cluster the data. I tried to hotencode all of my categorical variables using the one-to-many node. I have set the number of clusters to 5 and have a result. I guess my question is, is this the right way of approaching this problem or should I be using k-mode clustering instead of k-means?
To be sincere I do not have much experience with K-medoids handling mixed data however I am attaching a workflow using the K-prototype method with R and calculating the “optimal” number of clusters with the silhouette coeficient with must be complemented with more methods.
The dataset bank-full.csv can be found here in Kaggle
In my beginnings I have found a nice workflow done by @Fabien_Couprie which used PCA, the CalinskiHarabasz index and a nice 3D representation of the cluster. ( I beg Fabien if you upload it again or provided us the link)
I took a look at the workflow example you gave kprototypes but I am a little unclear abut what it is doing. Is it determining the correct number of clusters depending on which clustering algorithm is used? Also why would it only use a sample of the records to determine the right amount of clusters?
Hi w0rdz
The workflow is calculating in one bit the “optimal” number of clusters using a sample due to computation time, however is advisable to employ the whole dataset or a representative sample to obtaint the best silhoutte coef.
Regarding the algorithm it us using k-prototypes all the time, but in the lower part of the workflow I am using the EM algorithm provided by Weka , just to show you another approach my apologies for not mentioning it before.
I reordered the workflow to make it more understandable.
First, I wanted to say thanks for the additional workflow. That really helped to explain the steps. The one question I have is around the determining the cluster sizes. When using your workflow I can see that the kprototypes R snippet produces an output that says that 2 clusters is the optimal amount of clusters. Lower in the workflow you have the Weka EM and Cluster Assigner that uses 5 clusters.
I thought that with the silhouette coefficient showing that 2 is the ideal number of clusters, that Weka would also only use 2 clusters…
Hi w0rdz, nice that you found the workflow useful.
Regarding the number of clusters the k-prototypes uses the silhouette coefficient while the EM uses an iterative process until the log-likehood stops increasing. (see)
So they are using different metrics and in this case they do not provide the same number of clusters I guess because the clustering algorithms differ. You can read this post about the methods to obtain the number of clusters.