Handling missing value in clustered data

m1k · November 29, 2021, 5:12pm

Hi, i need to know an information. I have a set of clustered data, with several missing value in different feature. I want to calculate for each cluster mean/median of all feature for use them to replace missing values of all istances. Anyone know a method for this?

Thank you for answers.

aworker · November 29, 2021, 5:42pm

Hi @m1k and welcome to the KNIME forum

I guess you have a table with for every row data a column with its cluster number or label. If this is the case, you could use a - Group Start Loop - (ended with a generic -Loop End- node) which would select the rows by your different cluster label at each iteration. Inside the loop you could use the -Missing Values- node that can calculate the mean or the average of all the column values which are not missing in a cluster and this for all the columns.

If you can upload and share here your data, people could certainly help you further if this explanation is not clear enough.

Hope this helps.

Best

Ael

m1k · November 29, 2021, 5:57pm

It’s great! it was just what I was looking for! Thank you so much!

Another thing, this method is fine, but when I must manage the missing values of the test set, I cannot include them in this cycle, but I should only apply the management of the missing values (through the missing value (apply) node). In this case, how could I do?

aworker · November 29, 2021, 6:23pm

@m1k glad it worked and thanks for your kind comments.

This second question is a bit more involved but what you could do is for each cluster and column determine which was the calculated missing value, keep it in a table and then use this table as a reference table for the missing values in your Test Set. This would be only possible if your Test Set is also tagged with cluster labels from the beginning.

Hope the idea described above is clear enough. If not, please let the forum fellows just know and we will try to help you from here.

Hope this helps.

Best

Ael

Daniel_Weikert · November 30, 2021, 4:57pm

@aworker 's solution uses the missing value node which has also an additional ouput port to apply on test data. Maybe you can leverage that output in his second proposal
br

aworker · November 30, 2021, 5:07pm

Hi @Daniel_Weikert

Yes indeed, it is quite visible on the snapshot I posted as example

Could you please post here your “workflow solution” based on the output port ?

I’m curious to see what it will look like

Best

Ael

Daniel_Weikert · November 30, 2021, 5:16pm

HI @aworker
sorry I have not built one because there was no data provided and also you have already done the “heavy lifting” and provided the solution. Great job, I really enjoy reading your solutions in the forum Kudos
br

aworker · November 30, 2021, 5:23pm

Thanks Daniel !

Unfortunately I’m pretty short of time these days but I’ll try to provide the solution on that if @m1k needs more help

Best wishes

Ael

Daniel_Weikert · November 30, 2021, 6:23pm

certainly helpful for everyone. From your solutions I noted you have a strong background in ML?
best wishes as well

aworker · November 30, 2021, 7:29pm

Thanks again @Daniel_Weikert !
I very much appreciate it !

… and it motivated me to eventually put in place a possible solution

20211201 Pikairos Handling missing value in clustered data.knwf (687.0 KB)

@m1k here you have a workflow which should answer your second question. I did not use a set with missing values as reference but if you replace the data set used here by yours with missing values, it should do the job. Otherwise, let us know and we will amend it.

Hope it helps

Best

Ael

Ps: @Daniel_Weikert yes my background is ML

Daniel_Weikert · December 1, 2021, 6:04pm

Interesting, thanks for sharing.
Have you tried to apply the missing value output directly to the test data without the looping and compare results?
br

aworker · December 1, 2021, 6:12pm

Hi Daniel,

You need to loop over the table of models inside the second loop since each calculated “Missing Value” model is unique to its cluster.

Best

Ael

aworker · December 1, 2021, 6:23pm

Hi @m1k

Could you provide with feedback about the second solution I posted ? Did it solve your second question ? Generating missing data distributed over several different columns is a bit cumbersome so I would be grateful if you could share your missing data to verify on the workflow that it does the job as expected.

Thanks & regards,

Ael

m1k · December 1, 2021, 6:25pm

Amazing! Congrats for your background!

This is a perfect solution for my problem, it works.

Mark_Earll · December 2, 2021, 11:51am

Imputing missing data with means or medians can be dangerous as it can alter the correlation structure of your data. An alternative would be to use a data analysis method which handles missing data like NIPALS-PCA which is available in the R package pcaMethods (part of bioconductor). This is easily implemented in an R node. NIPALS-PCA purely build models on the available data without imputation. The data however needs to be missing at random rather than in blocks.

aworker · December 2, 2021, 1:34pm

Hi @Mark_Earll

It is not the case here because @m1k is calculating the average or the median values of samples within every individual cluster which makes sure that the samples with replaced missing values still remain within the convex hull of every cluster defined by their K-Means in the space of the descriptors. In other words, samples still remain within the space delimited by their assigned clusters.

Best

Ael

system · December 9, 2021, 1:34pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.