Clustering a file with COVID data

Dear Knimers,
I have another situation for discovering the best k numbers for clustering my data:
a) I have generated (into Knime) the following Excel file, with my preprocessed data:
333_Regs_yyyy-MM_rates_pops_forkMeans.xlsx (31.9 KB)
In this file, I grouped Covid cases using official 21 regions in our State (RS, Brazil) and by month (I studied data along the 16 initial months of tthis pandemic). After, I presented total/female/male populations by regions, as well as counting of cases, hospitalizations and deaths by region and by month. Finally (as population/city varies quite a lot from the capital city to the remainer ones), I calculated rates: for incidence (= (number of new cases / total population) * 1000 inhabitants); and hospitalization rate (= number of hospitalized patients / number of cases in the same population and in the same month * 100%); and lethality rate (= number of deceased patients / number of cases in the same population and in the same month * 100%).
b) now I need to cluster these data, and select which loop would be good and simple to discover the best and the lowest k to adequately represent my data into dense and separate clusters.
c) and finally, I need to plot these clusters, in order to analyse them visually.
I tried the elbow method and the Silhouette coefficient in this task, but I am afraid I didn’t understand exactly how to do it.
Can someone help me?
Thank you all for any help.
B.R.,
Rogério.

Hi @rogerius1st,

Just a quick clarification question, are you having issues understanding theoretically how the elbow method and Silhouette coefficient find the optimal “k” number of clusters or are you having issues integrating these two methods into your workflows?

For the latter, here is a link to a workflow that may be helpful: Clustering_And_Elbow_Graph – KNIME Hub

Cheers,
Dashiell

Dear Dashiell,
Thanks for your answer, and sorry for the delay. I’ve been spending some time on trying to follow and adapt your suggestions to my situation.

  1. I’m not quite sure if my issues are a result of my limited theoretical knowledge about the Elbow Method or the Silhouette Coefficient. Indeed, I’ve already read a few things about both of them and I also applied them in a few exercises. Notwithstanding, I haven’t got yet suitable workflow configurations with any of them. I’ve found (here, at Knime.forum) some posts applying loops to find the best k values using loops, but I couldn’t integrate them into my workflows.
  2. I have preprocessed my data, and stored these data in an Excel file:
    333_Regs_yyyy-MM_rates_pops_for_kMeans_Diversos(2-25)k.xlsx (8.2 KB)
    The original 1,330,000 cases were grouped into 333 regions-months (which are the registers of several (21) neighboring municipalities during each of the 16 months of my research, minus 3 missing values).
  3. I tried two possible paths for these loops, both suggested in Knime.Forum, with: a) “Table row to variable loop”; and b) “Parameter optimization loop”. Here is the image of what I used:

    But unfortunately I couldn’t understand quite well the results of both loops.
  4. I downloaded your workflow example, but I could go no longer with it because it includes two nodes with Python and another node that uses Entropy for scoring (which I haven’t studied yet). And I have currently no Python capabilities. Indeed, I have no previous training in any formal programming languages. Therefore, I ask you for different options (in Knime, of course), but just using its “no-code/low-code nodes” (i.e., that require no/few written code lines).

Hi @rogerius1st,

Was the Optimized K-Means (Silhouette Coefficient) component close to what you’re looking for? It comes in handy for most cases when I’m trying to optimize K-Means. And if you click into the component it serves as a nice example of how to use the parameter optimization loop nodes with K-Means without Python code.

Also if you haven’t already, I’d recommend checking out KNIME’s [L4-ML] Introduction to Machine Learning Algorithms self-paced course. There’s a Clustering module in the course that was helpful for me when I was getting stuck with clustering workflows.

Cheers,
Dashiell

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.