Imputation using kNN and categorical variables


I’m working with a dataset with a few missing values (~28% of total observations), I was trying to deal with them using imputation with kNN.

I went through the documentation on K Nearest Neighbor – KNIME Hub and from what I understood, it dismisses categorical variables as predictores.

Since for my case, specifically, I need to consider categorical variables, does anyone know of a node where they are considered (using hamming distance, for instance)

Thanks, anyway!

Hello @beatriz1490
I understand that your topic is about categorical (qualitative ordinal / non ordinal) as dependent variables? That is a complex subject by itself.

Can you describe a bit more about how your data is structured? I mean, most of the keras ml libraries deals with categorical independent variables by using dummy variables (with all the related implications).

Are your variables qualitative ordinal?

I guess your data has some continuous variables as well; otherwise, why to think about K-NN?. Then, can you rely on them (continuous) for missing imputation?

And trying to to answer your request. I don’t think there’s a single node doing this job; it will probably require some iterations. That’s why understanding the data, would be important to design a logical approach beforehand node arrangements.



How do you want to impute? Why kNN? Why not using missing value with most frequent, previous or some constant value?


Hi there @beatriz1490

What you can do is some sort of “encoding” of your categorical variables; and then try to apply KNN for the missing values imputation, I understand that you want to apply a sophisticated heuristic (which make sense in cases of continuous variables, you can apply a regression for missing value imputation) to do this task but, as @gonhaddock mentioned, there are not a single node for this task.

In the simplest form that should be to transform your categorical variables into a numeric encoding.

For instance, you have an attribute with 3 different levels or values, then you want to replace those values with a number, so that the KNN alg. doesn’t “ignore” those attributes.

  1. “Red” → 0
  2. “Blue” → 1
  3. “Green” → 2

Also, what @Daniel_Weikert said is applicable and simpler, using the most frequent value, together with a deeper knowledge of the business or data set, you can manually fill the blanks based on that input. I’ve been there quite some times, missing categorical variables can be harsh.

Try not to overcomplicate yourself! :slightly_smiling_face:

I want to impute because the most frequent value (28 observations) in actually the missing value. The second most frequent has 9 observations. So if I were to impute all the 28 missing values with the most frequent it would reproduce unfaithful data, i think.

The question would then be whether it is better to simply drop the column instead of imputing it

Hello @beatriz1490 , any progress with this challenge? This topic is turning out of a simple KNIME question.

We can not evaluate with the current released information the necessity about this imputation. I am guessing that it is because you don’t have a large data set and you need it all; which is just fine. The problem of short data set with ML algorithms, it is to deal with the uncertainty when evaluating prediction performance.

You didn’t give us a brief minimum description of your data set. Working with classification ML usually requires some continuous or ordinal sense in the data, as they rely on euclidean distance (K-NN does). And this is as well why simple encoding as factor will just not work, random factors won’t make sense of distance.

What I would suggest -by having at least one ‘continuous’ or ‘qualitative ordinal’ (like the answer in a survey, task priority…) independent variable- ; You can design a fuzzy logic approach based in logistic regression.

This brainstorm is pointing me some ideas for my pending JustKnimeIt Challenge 40.


1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.