Replace missing with median conditioned on value of different column

I have housing data that includes neighborhood and lot area frontage. I want to replace missing values of lot area frontage with the median value for that neighborhood. I can use a GroupBy node to group by neighborhood and aggregate by median of lot area frontage. That produces a table of neighborhoods and median lot area frontage for each neighborhood.

I could do what I want to do with an R script (but, then, what’s the point of messing around with Knime to begin with?); I could also use a Rule Engine node but I would have to manually code the median value for each neighborhood (rather than just “feed in” my data table) - and then it would no longer work if I got data that contained information on a different neighborhood.

I want to “feed” the GroupBy data table to one or more Knime nodes to check if the particular row has a value of neighborhood that is also in the GroupBy data table. If it finds that neighborhood then it replaces the missing lot frontage area with the median value for that neighborhood.

I can come close to what I want using the Cell Replacer node. The problem there is that it assumes you want to replace the “looked up value” (in this case, the neighborhood). I don’t want to replace that value. I want to find that value, and then replace a different column value (the missing lot frontage area). How can I do that using one or more Knime nodes designed for the purpose?

Any guidance is appreciated.

Hi @doloop

When using a Group Loop you can impute the missing values with the median of lot area frontage for every neighborhood.
Schermafdruk van 2021-01-24 15-05-50
gr. Hans

3 Likes

I don’t see any way to edit my post, so I’ll reply to it (and replace previous “reply”).

I’ve learned that I can also manually code the missing value of LotFrontage using median neighborhood values using the Column Expressions node, but it suffers the same issue as does the Rule Engine, viz., the solution is brittle and will break if new data is used with a previously non-existent neighborhood.

Hello @doloop,

with Cell Replacer you can also add new column with matched values. Then you can use Column Merger node to deal with missing values or if need for more complex logic use Rule Engine node. If nothing works I suggest to share some (dummy) input and desired output and explanation what can differ in new input/iteration. Preferable way is with working workflow :wink:

Br,
Ivan

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.