Splitting data for clustering?

Hi everyone,

I studied supervised algorithms mostly, but there is a topic which is puzzling me. In supervised learning, splitting data into train and test is crucial for model accuracy, but in unsupervised learning do we need to split the data into train and test? Splitting data into train and test is to avoid any overfitting and underfitting problems. In clustering, overfitting and underfitting issue can be pass by selecting optimal cluster value like elbow method, silhoutte coefficient etc.

So, in clustering do we need to split the data into train and test? Some say that, splitting data into train and test is unrelated to supervised and unsupervised. Is it correct? If it is correct, how?

Thanks in advance,

Hi @KKERROXXX,

welcome to the KNIME Forum!

I think for clustering there is no need to split your data into a training set and test set. You can’t evaluate the performance of algorithm with your test set anyway, because you don’t know the “ground truth” for your data.

Best,
Janina

2 Likes