trying to find a smart way to handle missing values

Hello!

I am trying to find a good way to handle missing values.

The easiest way is to use the missing values node with specific rules for each variable.

I usualy replace missing values with the median for each variable.

I have a few questions:

a) I would like to assign those values according to a normal distribution specially if I can define the parameters of the normal distribution.

b) Even better, I would like a way to automatically find what is the distribution that most closely matches the distribution of a variable and then replace missing values according to this distribution

c) approaching the problem from a different angle, is it possible to replace missing values based on the values of this row for other variables? Example:

          var 1   Var 2  Var3

row1      0         2       9

row2      3         4       4

row3      0         3       ?

row4      2         3       3

row5      0         1       6

In this example, the missing value would be replaced by "9" because the other values for row3 are most similar to row 1.

d) Do you think it would be interesting to combine c) and d) for replacing missing values? i.e: try if there's a similar row with d) if not replace with c)? I guess the answer is no but you never know :)

Thanks anyway for your help!

Bernard

btw: Knime is awesome!

 

 

 

 

 

 

 

 

Hi Bernard, 

There is a set of distribution assigners (including gaussian), which you could use with the Empty Table Creator to create "normal" values on the fly.  This would require a bit of fiddling with flow variables, but I can see it working in the end. You could even derive the mean and stdev from the group of interest using a statistics node along with tablerow to variable. A more rigorous approach would be to actually fit the distribution using R:

http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/fitdistr.html

As for your 2nd approach, you could try using a Similarity Search node to find the closest match using some appropriate distance measure and then use that to replace your original data point. To do this, you would probably need something like: Chunk loop start > Similarity Search > Cell Replacer > Loop End. 

Should be easy to get an empirical kernel density and draw values from that. I'd probably use an R-snippet node.

e.g.: http://stats.stackexchange.com/questions/82797/how-to-draw-random-samples-from-a-non-parametric-estimated-distribution

Cheers,

Steve.

Thank you both!

I will try those 3 options.

Also, a friend recently pointed me to the R package "Amelia".  It looks to be quite useful, and can be used with our R integration. 

http://cran.r-project.org/web/packages/Amelia/vignettes/amelia.pdf

It looks like that's what I need!

Thank you Aaron!