Question on using PCA in Elbow Method to Inform number of topics in Topic Extractor Node

Hi everyone!

I found this workflow (Learn the Elbow Method to Optimize Topic Extraction | KNIME) using the Elbow Method to automatically inform the Topic Extractor node how many topics to find. In order to do this the authors use PCA and then K-means. When configuring K-means they include absolutely all columns (words from the document vector node and the PCA dimensions identified by the PCA node). Should this include just the PCA Dimension columns? Why did they included all columns?

Many thanks!
Best Regards,

Ricardo

Hey @rmonterosapri,

I took a look at the article you linked, and hopefully I can help with getting a second opinion based on your question.

Like you say, you can configure k-means to include just the PCA dimension columns if the goal is to reduce the feature space and focus only on the most significant components of the data we are working with. This is probably the common way of approach, however, including all columns as they do was probably done for the k means to have access to both the original features and the reduced version through PCA. I imagine this was done as it may potentially lead to better clustering, as before the clustering, they preprocessed the data to get rid of any unwanted words.

To expand on the previous paragraph, I think the biggest reason they included both is due to input size, as they mention:

if the dimension of the feature space is still too large, it can be useful to apply PCA to reduce the dimensionality but keeping the loss of important information minimal

The data being worked with was probably small enough to where they can just include everything into the clustering.

Hope this helps,
TL

3 Likes

Thanks a lot thor_landstrom. This is very helpful. Have a great day!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.