Hi, I want to use the TopicExtraction with the Elbowmethod to determine the optimal number of topics.
However, the PCA node is extremely slow (to the point of not seeing any progress after running a full night).
I’m using the 2019 headlines from the Kaggle dataset “A million news headlines”. This has 34,060 headlines and (after some pruning) 19,705 terms. The headlines are formatted as a Document column. The Terms are the column names. The content of the table are the frequency counts.
I’ve tried lowering the minimum information fraction (to 80%) but that doesn’t seem to make much difference. Would you have any sugestions to improve on the node’s performance?