using amelia to deal with missing values

Hi there,

I’ve read the example https://www.knime.com/blog/how-to-deal-with-missing-values
for dealing with missing values with R amelia package.
I’m wondering if I can also in a cross validation this package at the same time on trainings and test data apply?
and how can I use the results further in Knime with e.g. Decision tree classifier ?

Thanks.

Best Regards,
N.

Have any one idea about that?

I used R’s Amelia to impute a few missing values. I have attached a sample workflow. The steps basically are:

  • you have a flat file with a variable “Target” 0/1 and string as well as numeric variables
  • you decide which variables need imputation (the blue frames)
  • you create an artificial ID variable
  • you send the data you want to impute to R and Amelia and run 10 iterations
  • you read back the resulting CSV file into KNIME and take the mean value as the newly imputed value
  • you bring back together the original values (and strings) and the imputed ones
  • you deal with the remaining missing values (if you did not impute them)
  • you split your data into two 70/30 partitions for model training

I am sure for the determination of the CSV file location (R’s temporary folder) there could be more elegant solutions now but it did work back then.

Within the workflow there are several links to articles about Amelia that might help. I am not an expert in Amelia my fist goal was to get the workflow up and running.

One Excel File with statistics will be created with before and after statistics that might help you decide if Amelia did a good job at imputing the missing values.

Please note this is only a 500 lines sample I you have larger files you might need more power on your machine or you might have to set up a loop node (not the most elegant way to use R in KNIME but it does work).

There are graphics created also, you might have to toy around with them or limit the variables you are testing to be able to interpret them in a meaningful way.

kn_example_amelia.knar (2.5 MB)

2 Likes

Thanks… your reply is really helpful for me!

best regards,
N.

one more question…
instead of impute the missing values before splitting the dataset, is it also possible to use amelia package to impute the missing values direct on traingsdata in cross validation, and then apply it on the test data?

I am not sure if that is possible but I see your point. I was not able to find a method to sort of ‘save’ the model. I think for Amelia one way would be to just do the imputation on both files.

I cited these blog with several imputation methods. Maybe there is a way to save a model. As far as I remember from about two years ago it was not so easy.

2 Likes

Thanks, I’ll have a try.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.