which is better for document clustering k-medoids or fuzzy c-means?

Hi all,

I want to make document clustering but I don't know which clustering algorithm is better or gives me better results

the documents are high dimentional docs

I read that k-medoids can be used with any similarity measure but fuzzy c-means is used only with euclidean distance which is not better for document clustering or gives inaccurate results which gives advantage for k-medoids over fuzzy c-means. I read also that fuzzy c-means is more efficient than k-medoids

I don't know which one can I use

Also I want the document to be categorized to more than one cluster. In other words, the document logically can be assigned to more than one cluster  example: one document can be categorized to the Information retrieval cluster and at the same time is categorized to Machine learning cluster (so, the winner cluster is IR & Machine learning)

please tell me which algorithm is better for this case (k-medoids or fuzzy c-means or orher algorithm) and please I want the steps for the requored algorithm (the nodes arrangement)

Thanks in advance

 

Hi,

for high dimensional document vectors the cosine measure is usually the distance or similarity measure you want to use. In KNIME you can compute pairwise distance using the distance nodes i.e. Distance Matrix Calculate. These distances can then be used by clustering nodes later on e.g. the K-Medoids or the Hierarchical Clustering node.

Having a data point assigned to more then one cluster is not possible with K-Medoids. This is possible with Fuzzy C-Means however, this nodes does not take distance matrices as input.

I recommend to do a hierarchical clustering in case you have less then 4000 documents, otherwise use the K-Medoids. For both nodes you need to calculate the distance matrix before, preferably with cosine distance.

You can also calculate several K-Medoid clusterings with a loop over K and find the best one e.g. using the elbow method. There is an example workflow is on the example server: knime://EXAMPLES/08_Other_Analytics_Types/01_Text_Processing/17_TopicExtraction_with_the_ElbowMethod

Cheers, Kilian

Thank you so much Mr. Kilian for your answer

I tried your recommendation and found that K-medoids gives me better  results with cosine distance measure

thanks for your recommended answer

I am confused about fuzzy c-means,I have used it to try the overlapped clusters but disjoint clusters are formed

attached to you my work flow to make document clustering using fuzzy c-means

why does fuzzy c-means gived me disjoint clusters ?

another question please

Are my steps correct to cluster documents?