Cluster with binary, categorical and numerical variables

iiiaaa · March 1, 2017, 2:04pm

Hi all.

I have a dataset with continuous,categorical and binary variables.

For sure I can't use K-mean because of binary variables.

What are the clustering algorithms that I can use to manage both numerical and binary data?

Do I still need to do normalization before? What kind of normalization (0-1; z-score) is suggested for binary data?

Thanks in advance

Regards

Geo · March 1, 2017, 10:24pm

Sure, you can use K-means or hierarchical clustering as long as you convert everything to numerical. Binary variables as in "true" / "false" ? If yes, use Rule Engine to do the conversion from "true" to 1 and from "false" to 0. For categorical variables, apply One To Many, then with Column Filter, delete one dummy variable to serve as reference category.

As of now, you have all numerical variables, either continuous or dummy (0/1). Dummy coded variables do not require any normalisation. Continuous ones should be z-score normalised if you don't want extreme observations to be flattened, and min-max normalised if you want to bring extreme observations closer to the rest of the observations.

If you run into the curse of dimensionality issue, you can try reducing the number of variables by e.g. grouping any related dummy coded variables, i.e. those which stem from the same ordered categorical variable, using a PCA transformation. If you are lucky, the dummy coded variables will be roughly ordered along the first principal component. Instead of the many related dummy variables, keep the 1st PC. That's how you transform a bunch of related dummy variables (coming from an initially ordered categorical variable) into a single continuous variable. Interpretation will also be straight-forward if you are lucky with the "roughly ordered" part.

iiiaaa · March 2, 2017, 9:32am

Hi Geo,

Thank you very much for your answer.

Which of these options would you suggest?

1) To apply several PCA trasformation (telling to the PCA node that I want 1 dimension): one trasformation for each related dummy coded variable (i.e. coming from an initially ordered categorical variable). Then I provide to the K-mean the normalized continous numerical variable and all the different PC derived from the different PCA trasformation applied to the related dummy coded variable

2) to apply one PCA trasformation (telling to the PCA node that I want 2 or more dimensions), on all the related dummy coded variable at the same time (i.e. coming from an initially ordered categorical variable). Then I provide to the K-mean the normalized continous numerical variable and the PC derived from the single PCA trasformation applied to all the related dummy coded variable at the same time.

3) to apply one PCA trasformation (telling to the PCA node that I want 2 or more dimensions), on all the related dummy coded variable and numerical variables at the same time (i.e. coming from an initially ordered categorical variable). Then I provide to the K-mean the normalized continous numerical variable and the PC derived from the single PCA trasformation applied to all the related dummy coded variable and to the continous numerical variable at the same time.

What do you think about this article that explain that it is not possible to use Kmean on binary variables?

http://www-01.ibm.com/support/docview.wss?uid=swg21477401

Thanks in advance.

iiiaaa · March 3, 2017, 12:01pm

Thanks

Geo · March 3, 2017, 1:56pm

Regarding the PCA transformation on binary variables:

apply a single PCA transformation on all the related dummies, or in other words as many as you have ordered categorical (not dummy) variables;
keep only the first PC, forget about the remaining PCs;
visualise the correlation plot for PC1 and PC2 to assess the meaningfulness of the variable created. If it is meaningless to you, there's not point in using it;

The main issue with clustering is really not the choice of algorithm, but how to interpret the clustering results. This requires both domain knowledge and a good understanding of the business problem. After that, you also need to know the consequences of your choices: which variables you select, which distance measure(s), whether or how to normalise, which clustering algorithm(s) to use, etc. There is no magic recipe.

Geo · March 3, 2017, 1:58pm

P.S.: kmeans for quantitative only indeed, hierarchical clustering is an option though (as the article points out), KNIME even allows you to aggregate several distance measures, so lots of flexibility there.

iiiaaa · March 7, 2017, 12:16pm

Ok thanks

sguarny · March 21, 2017, 7:26pm

Hi Geo,

I have the same "problem" of iiiaaa facing a Cluster with binary, categorical and numerical variables.

I found your solution very clear except for 1 step: "For categorical variables, apply One To Many, then with Column Filter, delete one dummy variable to serve as reference category".

I understand the One to Many step to create "a lot" of dummy variables but I don't understand the step with column filter to mantain only one of these.

Would you explain why this step in details?

Thanks a lot

Geo · March 21, 2017, 10:21pm

You maintain all of them, minus one binary variable per categorical variable. If all binary variables for any given categorical variable are false at the same time, that's the reference category. You can remove whichever binary variable, but usually one removes the binary variable having the most observations. Just the same as you would do in linear regression ...

sguarny · March 22, 2017, 8:04am

Ok,

now it's all clear.

Thanks a lot