Entropy Scorer interpretation in clustering

mauuuuu5 · September 2, 2015, 6:33am

Hi guys, I am applying the Entropy Scorer to evaluate the cluster against a reference cluster and I beg you if you can help me with these questions.

1. Is the reference cluster something like a benchmark cluster to know if the cluster that I am using to make the comparisson is "better" or "worse"?

2. Despite the previous question I want to know if the following results are correct, basically I am comparing a Kmeans vs a Fuzzy C means and viceversa.

Fuzzy C means as Reference cluster vs Kmeans

Score	Value
Entropy:	0,477
Quality:	0,699

Kmeans as Reference vs Fuzzy C Means

Score	Value
Entropy:	0,4434
Quality:	0,7203

So I wonder if the Fuzzy C Means is better over the Kmeans as the quality is higher (0,7203) when I use the latter as reference.

Thank you

entropy_scorer.zip

wiswedel · October 29, 2015, 12:48pm

As for question 1: Yes, a reference clustering is like a benchmark. For each group in this reference it looks at the purity (entropy based) in your calculated clustering result. If it's a pure (all the same values) you have a low/0 entropy and the score goes up. If it's all mixed up (high entropy) the score goes down. This is done for all groups and the scores are aggregated. Note that when you have many groups in your reference you naturally get a higher score as the groups get smaller (think of 1 record per group - it's pure by definition).

Question 2) My first reaction was: No, makes no sense as you are missing the ground truth and the notion of "better" is fuzzy in clustering tasks. But the question is a good one as you presumably have the same count (c/k) and your algorithms also have the same bias (center based). I think all it means is that your fuzzy c-means clustering better matches your k-means than your k-means matches the c-means output (if that makes sense?) - but they can still both suck or be good.

Thanks for giving me 5min of interesting thoughts!