I’ve read the example https://www.knime.com/blog/how-to-deal-with-missing-values
for dealing with missing values with R amelia package.
I’m wondering if I can also in a cross validation this package at the same time on trainings and test data apply?
and how can I use the results further in Knime with e.g. Decision tree classifier ?
I used R’s Amelia to impute a few missing values. I have attached a sample workflow. The steps basically are:
you have a flat file with a variable “Target” 0/1 and string as well as numeric variables
you decide which variables need imputation (the blue frames)
you create an artificial ID variable
you send the data you want to impute to R and Amelia and run 10 iterations
you read back the resulting CSV file into KNIME and take the mean value as the newly imputed value
you bring back together the original values (and strings) and the imputed ones
you deal with the remaining missing values (if you did not impute them)
you split your data into two 70/30 partitions for model training
I am sure for the determination of the CSV file location (R’s temporary folder) there could be more elegant solutions now but it did work back then.
Within the workflow there are several links to articles about Amelia that might help. I am not an expert in Amelia my fist goal was to get the workflow up and running.
One Excel File with statistics will be created with before and after statistics that might help you decide if Amelia did a good job at imputing the missing values.
Please note this is only a 500 lines sample I you have larger files you might need more power on your machine or you might have to set up a loop node (not the most elegant way to use R in KNIME but it does work).
There are graphics created also, you might have to toy around with them or limit the variables you are testing to be able to interpret them in a meaningful way.
one more question…
instead of impute the missing values before splitting the dataset, is it also possible to use amelia package to impute the missing values direct on traingsdata in cross validation, and then apply it on the test data?
I am not sure if that is possible but I see your point. I was not able to find a method to sort of ‘save’ the model. I think for Amelia one way would be to just do the imputation on both files.
I cited these blog with several imputation methods. Maybe there is a way to save a model. As far as I remember from about two years ago it was not so easy.