Include High Cardinality Variables as supervised ratio

Hi everyone,

I’m trying to include high cardinality features such as postal code (I have 75k observations so they might have some predictive value) in my data modelling. I was wondering if it was possible in KNIME to transform the high cardinality feature into a numerical one, into for example the supervised ratio (the proportion of that value which belongs to the positive class). Should this be possible, is there also a way to impute these supervised ratio’s into new unclassified data, based on the training set?

Kind regards,

Hi @jeandony,
please have a look at the attached example. Here I use GroupBy, Math Formula, and Joiner to calculate the ratios. Have you thought about trimming the ZIP codes at the end? The last digits might not have such a high predictive value as the first and you would end up with considerably fewer values in total.
Kind regards
Alexander

Supervised Ratio.knwf (15.1 KB)

1 Like

Hi @AlexanderFillbrunn,

Thanks for your response! Trimming would work for the ZIP-codes, however there are also variables like company code for which trimming isn’t a solution.
Is there a way to also include the ZIP-codes which don’t have any positive instances in the dataset (so they would get a ratio of 0)? Because the final table in the workflow only presents the ZIP-codes which have atleast one positive instance. Moreover, is there a way to impute these final results into data which isn’t classified yet? In order that they would get the ratio’s of the training data.

Kind regards,

Hi,
oh yeah, that was a mistake in my workflow. In the final join, just don’t use an inner join but a right outer join followed by a Missing Value node that replaces missing values by 0. For imputing values in test data, just use another join on ZIP code.
Kind regards
Alexander

Hi @AlexanderFillbrunn,

Is there a node that can perform the trimming of the ZIP-codes? So every value that starts with 75… should be transformed to 75 etc.

Kind regards,

Hi,
Yes, this is easily possible using the String Manipulation node. If your zip code is stored as a number, do this:

toInt(substr(string($zipcode$), 0, 2))

If you store it as text, use this:

substr($zipcode$, 0, 2)

The substr function cuts a part out of a string and returns that. So here I am telling the node to cut from index 0 (the beginning of the zip code) and take 2 characters.
Kind regards
Alexander

2 Likes

Hi @AlexanderFillbrunn,

Thanks a lot! I really love how straightforward the software is. Keep up the good work!

Kind regards,

3 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.