Clustering using both numerical and categorical data Kmedoids

Hi everybody, I want to perform a clustering analysis using categorical and numercial data simultaneously and for that I used the k-medoids node. 

I beg if someone can check the attached workflow and see if I am doing an "statistical abomination" because the categorical variables are included into the algorithm as strings distances and then combined with the numeric distances.

I am also attching the input table as the workflow would weight 28MB

Best regards

 

 

I'd change the catagorical variable(s) to one-hot encoding using the 'one-to-many' node. Probably have to use 'normalize' to scale the other numerical columns to 0 -> 1 so that all columns have the same weight.

Use 'denormalize' and 'many-to-one' to get back to your original data format.

If the catagorical data is actually ordinal, then it is OK to encode them as integers then scale them as per the other numerical columns.

Hi,

I did a AFDM metanode in order to perform this. Use it before your classification algorithm and all your datas in output will be numerical and will have the same scale. The coordinates of individuals are on the fourth slot names "coordonnées des individus". Well sorry it's in french. Let me know if it helped.

Fabien

 

Hi,

I did a AFDM metanode in order to perform this. Use it before your classification algorithm and all your datas in output will be numerical and will have the same scale. The coordinates of individuals are on the fourth slot named "coordonnées des individus". Well sorry it's in french. Let me know if it helped.

Fabien

 

Thank you Fabien, for your help I wonder if you can provide me the "AUTOS2005subset.txt" data, I searched but it is only mentioned here: 

https://eric.univ-lyon2.fr/~ricco/cours/slides/AFDM.pdf

 

Thank you

 

Thank you I will such data preparation in that case,

Hi,

The file in attachment.

Thank you Fabien, I was looking at the output and I want check if I have to consider just the first part of the data (red dashed line) or the whole data?

Mau

 

 

 

Those are the eigenvectors (definition of new numerical variables as a matrix from old to new variables), some of them are 0 or null for you have categorical variables. If you want the transformation of your datas (individuals/lines) they are in the slot number 3 (so in the fourth as there is a 0 output slot) named "coordonnées des individus"

Thanks again Fabien the workflow is amazing and I found the proper table. Finally I want to know if the output of the "coordonnées des individus" is the result of applying a Factorial Analysis of Mixed Data?

Best Regards

Mau

 

Yes it is