Elbow Method problem: irregular pattern in sum of squared errors

Hello everyone
I am working on a LDA to get some insights from a email archive. I am using the Elbow method to find out what is the number of cluster that I have to use in the LDA.
I got this graph with 20 iterations:

The SSE should decrease with increasing k. Why do I have some peaks in the graph?

I am using the topic extraction with elbow method example in the knime directory. However I found some imperfections in the model. When it comes to vectorising the documents, it might be useful to reduct the dimension of the matrix. The example uses PCA to reduct the dimension of the matrix, but still, the dimension is to big (it has been used 99% as min fraction of information).
My data set consists of 800 documents. After enrichment and preprocessing techniques were applied, I have vectorized my documents. The matrix is huge.
Now, according to the model proposed in the examples, the sum of squared errors is calculated as result of k-means node in loop mode.
However, the within cluster sum of square errors looks to high and it is hard to see how it approaches zero as the number of clusters k (set in the k-mean node according to the loop node) increases.
I supposed that the java code in the java snippet node is wrong or probably the pca dimensioning riduction does not work really well. Moreover, it can be supposed that my dataset, as the one used in the example, does not fit with elbow method.
Do you have any suggestions?

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.