Boolean or discretized values can be used for clustering?

Hello all, I need to do a clustering solution for the dataset I uploaded and I have a doubt. I have to use partitional clustering, hierarchical clustering and density clustering and I don’t know if I can use discretize variables beacuse I think that for these clusters I need to use continuous variables. These data set has some boolean variable and some discretize ones and I am not sure If i need to discard them. Any help?. Thanks
prestamo.xls (583 KB)

Hi @jma00049,
Welcome to the KNIME Forum! You can use numeric distance measures with boolean variables if they are encoded as 0 and 1. However, you should also min-max normalize the numeric features so all features are of equal importance to the clustering algorithm. However, be careful with the zip code. If you keep that as a numeric feature, you impose a certain order on zip codes, which does not make much sense. Your lowest zip code would become 0.0 and the highest 1.0. But it may make more sense to exclude it or only take the first 1-2 digits from the zip code and then treat that like a nominal value and do one-hot-encoding. Otherwise you would say that Anchorage, Alaska (zip code 99501), is 5298 “zips” away from Sacramento, California (zip code 94203). However, Sacramento would be 9202 “zips” away from Phoenix, Arizona. If you look at a map that makes not much sense, but of course it also depends on what you want to achieve with your clustering in the first place.
Kind regards,
Alexander

2 Likes

Hi @AlexanderFillbrunn and thanks for helping!
I see what you mean with the zip codes and I think I will exclude them. About the booleans variables, I will min-max normalize all the numeric features so I can keep them.
Regards,
Jesús.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.