Hi, i need to know an information. I have a set of clustered data, with several missing value in different feature. I want to calculate for each cluster mean/median of all feature for use them to replace missing values of all istances. Anyone know a method for this?
Thank you for answers.
Hi @m1k and welcome to the KNIME forum
I guess you have a table with for every row data a column with its cluster number or label. If this is the case, you could use a - Group Start Loop - (ended with a generic -Loop End- node) which would select the rows by your different cluster label at each iteration. Inside the loop you could use the -Missing Values- node that can calculate the mean or the average of all the column values which are not missing in a cluster and this for all the columns.
If you can upload and share here your data, people could certainly help you further if this explanation is not clear enough.
Hope this helps.
It’s great! it was just what I was looking for! Thank you so much!
Another thing, this method is fine, but when I must manage the missing values of the test set, I cannot include them in this cycle, but I should only apply the management of the missing values (through the missing value (apply) node). In this case, how could I do?
@m1k glad it worked and thanks for your kind comments.
This second question is a bit more involved but what you could do is for each cluster and column determine which was the calculated missing value, keep it in a table and then use this table as a reference table for the missing values in your Test Set. This would be only possible if your Test Set is also tagged with cluster labels from the beginning.
Hope the idea described above is clear enough. If not, please let the forum fellows just know and we will try to help you from here.
Hope this helps.
@aworker 's solution uses the missing value node which has also an additional ouput port to apply on test data. Maybe you can leverage that output in his second proposal
Yes indeed, it is quite visible on the snapshot I posted as example
Could you please post here your “workflow solution” based on the output port ?
I’m curious to see what it will look like
sorry I have not built one because there was no data provided and also you have already done the “heavy lifting” and provided the solution. Great job, I really enjoy reading your solutions in the forum Kudos
Thanks Daniel !
Unfortunately I’m pretty short of time these days but I’ll try to provide the solution on that if @m1k needs more help
certainly helpful for everyone. From your solutions I noted you have a strong background in ML?
best wishes as well
Thanks again @Daniel_Weikert !
I very much appreciate it !
… and it motivated me to eventually put in place a possible solution
20211201 Pikairos Handling missing value in clustered data.knwf (687.0 KB)
@m1k here you have a workflow which should answer your second question. I did not use a set with missing values as reference but if you replace the data set used here by yours with missing values, it should do the job. Otherwise, let us know and we will amend it.
Hope it helps
Ps: @Daniel_Weikert yes my background is ML
Interesting, thanks for sharing.
Have you tried to apply the missing value output directly to the test data without the looping and compare results?
You need to loop over the table of models inside the second loop since each calculated “Missing Value” model is unique to its cluster.
Could you provide with feedback about the second solution I posted ? Did it solve your second question ? Generating missing data distributed over several different columns is a bit cumbersome so I would be grateful if you could share your missing data to verify on the workflow that it does the job as expected.
Thanks & regards,
Amazing! Congrats for your background!
This is a perfect solution for my problem, it works.
Imputing missing data with means or medians can be dangerous as it can alter the correlation structure of your data. An alternative would be to use a data analysis method which handles missing data like NIPALS-PCA which is available in the R package pcaMethods (part of bioconductor). This is easily implemented in an R node. NIPALS-PCA purely build models on the available data without imputation. The data however needs to be missing at random rather than in blocks.
It is not the case here because @m1k is calculating the average or the median values of samples within every individual cluster which makes sure that the samples with replaced missing values still remain within the convex hull of every cluster defined by their K-Means in the space of the descriptors. In other words, samples still remain within the space delimited by their assigned clusters.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.