Replace missing data baed on mean of alternative column

grabrail · July 18, 2017, 12:45pm

Hi

I'm very very new to Knime and just about manageing to get my head around it, and so far its ease of use is great.

I am using the Titanic Kaggle competition as my data set to learn, and have come to a point where I'm trying to do my method for replacing missing age data.

I have a colum called New.title, which contains the values Mr, Master, Mrs and Miss

what I want to do, is where there is an missing cell in age, I want it to look at the new.title cell and determine what its value is, if it is Master, then I want the mean of all other rows where the new.title = Master to be placed as the missing value. e.g.

row id	age	new.title
1	17	Master
2	2	Master
3	56	Mrs
4	?	Master

The mean of all the masters, (17 and 2) is 9.2 so I want row 4 age to be replaced by 9.2

I have managed to do this in a very long winded way (CSV Reader -> Row Splitter (based on new.title = master) -> csv riter -> Row Splitter (based on new.title = Mr) -> CSV Writer -> Row Splitter (based on new.title) = miss -> CSV writer -> Row Splitter (based on new.title = Mrs.) -> CSV Writer

This gives me 4 csvs all split by new.title, i can then do the mean calculation and replace missing data and then import the 4 csvs into one data set. Just seems a little long winded to me.

I believe I should be able to achieve this a better way, maybe using the Rule Engine? had a look at this but not sure on the Syntax

Any thoughts out there please on how I can do this more efficiently?

TIA
Jade

adaptagis · July 18, 2017, 2:34pm

hi there

I would use the group by node to group by "new.title" and getting the mean for "age".
having theses numbers, got to the main table and use the rule engine node and set it like:
$new.title$ LIKE "master" AND MISSING $age$ => 888

$new.title$ LIKE "mrs" AND MISSING $age$ => 888

$new.title$ LIKE "XYZ" AND MISSING $age$ => 888

where 888 is your dedicated mean that you got from the group by node..

NOTE you have to set the rule engine node to replace column age

good luck!