I want to make document clustering but I don't know which clustering algorithm is better or gives me better results
the documents are high dimentional docs
I read that k-medoids can be used with any similarity measure but fuzzy c-means is used only with euclidean distance which is not better for document clustering or gives inaccurate results which gives advantage for k-medoids over fuzzy c-means. I read also that fuzzy c-means is more efficient than k-medoids
I don't know which one can I use
Also I want the document to be categorized to more than one cluster. In other words, the document logically can be assigned to more than one cluster example: one document can be categorized to the Information retrieval cluster and at the same time is categorized to Machine learning cluster (so, the winner cluster is IR & Machine learning)
please tell me which algorithm is better for this case (k-medoids or fuzzy c-means or orher algorithm) and please I want the steps for the requored algorithm (the nodes arrangement)
Thanks in advance
for high dimensional document vectors the cosine measure is usually the distance or similarity measure you want to use. In KNIME you can compute pairwise distance using the distance nodes i.e. Distance Matrix Calculate. These distances can then be used by clustering nodes later on e.g. the K-Medoids or the Hierarchical Clustering node.
Having a data point assigned to more then one cluster is not possible with K-Medoids. This is possible with Fuzzy C-Means however, this nodes does not take distance matrices as input.
I recommend to do a hierarchical clustering in case you have less then 4000 documents, otherwise use the K-Medoids. For both nodes you need to calculate the distance matrix before, preferably with cosine distance.
You can also calculate several K-Medoid clusterings with a loop over K and find the best one e.g. using the elbow method. There is an example workflow is on the example server: knime://EXAMPLES/08_Other_Analytics_Types/01_Text_Processing/17_TopicExtraction_with_the_ElbowMethod
Thank you so much Mr. Kilian for your answer
I tried your recommendation and found that K-medoids gives me better results with cosine distance measure
thanks for your recommended answer
I am confused about fuzzy c-means,I have used it to try the overlapped clusters but disjoint clusters are formed
attached to you my work flow to make document clustering using fuzzy c-means
why does fuzzy c-means gived me disjoint clusters ?
another question please
Are my steps correct to cluster documents?