Hi
I'm very very new to Knime and just about manageing to get my head around it, and so far its ease of use is great.
I am using the Titanic Kaggle competition as my data set to learn, and have come to a point where I'm trying to do my method for replacing missing age data.
I have a colum called New.title, which contains the values Mr, Master, Mrs and Miss
what I want to do, is where there is an missing cell in age, I want it to look at the new.title cell and determine what its value is, if it is Master, then I want the mean of all other rows where the new.title = Master to be placed as the missing value. e.g.
row id | age | new.title |
1 | 17 | Master |
2 | 2 | Master |
3 | 56 | Mrs |
4 | ? | Master |
The mean of all the masters, (17 and 2) is 9.2 so I want row 4 age to be replaced by 9.2
I have managed to do this in a very long winded way (CSV Reader -> Row Splitter (based on new.title = master) -> csv riter -> Row Splitter (based on new.title = Mr) -> CSV Writer -> Row Splitter (based on new.title) = miss -> CSV writer -> Row Splitter (based on new.title = Mrs.) -> CSV Writer
This gives me 4 csvs all split by new.title, i can then do the mean calculation and replace missing data and then import the 4 csvs into one data set. Just seems a little long winded to me.
I believe I should be able to achieve this a better way, maybe using the Rule Engine? had a look at this but not sure on the Syntax
Any thoughts out there please on how I can do this more efficiently?
TIA
Jade