Splitting data for clustering?

KKERROXXX · September 15, 2020, 9:13pm

Hi everyone,

I studied supervised algorithms mostly, but there is a topic which is puzzling me. In supervised learning, splitting data into train and test is crucial for model accuracy, but in unsupervised learning do we need to split the data into train and test? Splitting data into train and test is to avoid any overfitting and underfitting problems. In clustering, overfitting and underfitting issue can be pass by selecting optimal cluster value like elbow method, silhoutte coefficient etc.

So, in clustering do we need to split the data into train and test? Some say that, splitting data into train and test is unrelated to supervised and unsupervised. Is it correct? If it is correct, how?

Thanks in advance,

janina · September 18, 2020, 3:09pm

Hi @KKERROXXX,

welcome to the KNIME Forum!

I think for clustering there is no need to split your data into a training set and test set. You can’t evaluate the performance of algorithm with your test set anyway, because you don’t know the “ground truth” for your data.

Best,
Janina

system · March 20, 2021, 3:10am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.