Entropy Scorer

Hi,

I want to find out about what the entropy is for clusters. But I’m only getting the entropy for their size rather than the amount of clusters. Any way I can fix that? I have the interval 2-10 clusters that I need to determine the entropy for

Thank you in advance

What clustering algorithm are you using? I assume you want to optimize the number of clusters.

1 Like

Sorry for any iconsistencies becuase I just started learning KNIME yesterday. But this is my workflow so far:

ARFF → Column Filter → k-Means → Category to Number → Column Filter → GroupBy → Entropy Scorer (I connected the second column filter to the entropy scorer as well as the GroupBy).

I thought the problem was that my cluster string still was showing up after I added category to number, so that is why I added an additional column filter to “remove” cluster string and only have the cluster to number, but the problem still remains,

Hi,
I’m not an expert in Clustering methods. But I found this link here Description of Entropy calculation and the calculation is exact the same like in the Entropy Scorer.
Have you checked your code?

250201_Entropy – KNIME Community Hub

1 Like

Thank you for the reply! I did not really understand the description, I think it was more about Python? I’m sure the calculation is the same, but I couldn’t really grasp it.

No the second answer gives you all the mathematics:

Unfortunately the Link in the Node description is broken:
image
Broken Link

1 Like

You might want to think about using either the elbow or silhoutte methods. I’m on my phone right now. I’ll send you some info when i get on my computer in a few hours.

2 Likes

Now I can see, thank you! That is exactly what I have done (or so I think). I’m getting numbers for the entropy, but those numbers are for the sizes of the clusters and not for the amount of clusters unfortunately.

Thank you for the reply! Yes, I’m supposed to use the elbow method after I have gotten the entropy for all the different clusters. Only issue is, as I mentioned above, that I’m getting the entropy for the clusters’ sizes rather than the amount/number of clusters (ranging from 2 to 10).

See if this helps. Is this a school assignment where you’re forced to us the entropy node?

1 Like

As far as I understood the Entropy metric it’s not possible to get an information how many clusters you need.
It’s rather used to rate your model in a case of pre-labeled data. E.g. the entropy compares the predicted clustering against a ground truth (labeled data).
In the case of unsupervised learning this truth is not available. There you can apply methods like “k-means” and use the elbow method to get an idea how many clusters are optimal.

1 Like

Thank you, I will look through these and see if it’s helpful! Yes it’s a school assignment so unfortunately I’m forced to use the entropy scorer.

Also this is my workflow and these are the numbers I’m getting

That’s fine but there’s nothing from the entropy node to feed to either the elbow or silhouette methods. They’re a separate issue.

1 Like

So if I understand you correctly my workflow is fine and those numbers the entropy scorer is showing are correct?

For the elbow method we are supposed to put together the numbers in an excel diagram and work through it from there rather than on KNIME.

Can you explain “put together the numbers in an excel diagram and work through it from there rather than on KNIME.”

For our assignment it says we should use the elbow method to find the optimal amount of clusters and use the numbers we get for the different cluster entropies and make a diagram on excel. After we evaluate the best amount of clusters with the help of the diagram.

I have no idea what you mean by “Excel diagram”

Your assignment might ask for something like this.

Having said that, entropy-based measure isn’t a good test to verify clustering numbers, it’s only good to count the total entropy after clustering is done. Elbow methods does not have to use entropy as y-value.

When I did my academic assignment years ago (which what led me to Knime, since I first discovered Knime from an academic journal), Silhouette Coefficient is enough to evaluate clustering.

Each to their own.

1 Like

Thank you! Yes, I think our diagram should look somewhat similar like that. I understand your point, but unfortantely I don’t have the option to choose which method. But it is definitely something I will look into.