I used R’s Amelia to impute a few missing values. I have attached a sample workflow. The steps basically are:
- you have a flat file with a variable “Target” 0/1 and string as well as numeric variables
- you decide which variables need imputation (the blue frames)
- you create an artificial ID variable
- you send the data you want to impute to R and Amelia and run 10 iterations
- you read back the resulting CSV file into KNIME and take the mean value as the newly imputed value
- you bring back together the original values (and strings) and the imputed ones
- you deal with the remaining missing values (if you did not impute them)
- you split your data into two 70/30 partitions for model training
I am sure for the determination of the CSV file location (R’s temporary folder) there could be more elegant solutions now but it did work back then.
Within the workflow there are several links to articles about Amelia that might help. I am not an expert in Amelia my fist goal was to get the workflow up and running.
One Excel File with statistics will be created with before and after statistics that might help you decide if Amelia did a good job at imputing the missing values.
Please note this is only a 500 lines sample I you have larger files you might need more power on your machine or you might have to set up a loop node (not the most elegant way to use R in KNIME but it does work).
There are graphics created also, you might have to toy around with them or limit the variables you are testing to be able to interpret them in a meaningful way.
kn_example_amelia.knar (2.5 MB)