PCA node slow in TopicExtraction_with_the_ElbowMethod

Hi, I want to use the TopicExtraction with the Elbowmethod to determine the optimal number of topics.
However, the PCA node is extremely slow (to the point of not seeing any progress after running a full night).

I’m using the 2019 headlines from the Kaggle dataset “A million news headlines”. This has 34,060 headlines and (after some pruning) 19,705 terms. The headlines are formatted as a Document column. The Terms are the column names. The content of the table are the frequency counts.

I’ve tried lowering the minimum information fraction (to 80%) but that doesn’t seem to make much difference. Would you have any sugestions to improve on the node’s performance?

Thanks,
Pieter

1 Like

Hi @pieterVR and welcome to the forum. I assume you are using this workflow from the Hub?

At first glance, your dataset is large enough that you might need to dedicate more memory to KNIME, especially if you only have it set to something like 4GB or so initially. You can adjust the knime.ini file as described here:

https://www.knime.com/faq#q4_2

I myself set the -Xmx value to about 12GB on my laptop with 16GB RAM.

Hopefully this will help. I’ll also see if I can do a bit of stress testing with your Kaggle dataset if I can easily get my hands on it.

3 Likes

Hi ScottF,

Thanks for your reply! Unfortunarely, The slowness of the PCA-node manifested itself while I already had expanded my memory to 16GB.

The Kaggle dataset should be easy to find a “A million news headlines”.

Best regards,
Pieter

1 Like

Hi @pieterVR

Welcome to the KNIME forum. As mentioned by @ScottF, PCA is inherently a memory greedy Variable Dimension reduction algorithm. but this is not the only drawback of PCA. It is also computationally expensive in terms of the number of Variables and Rows as explained here below in slide 20:

The way the PCA algorithm is implemented (for instance optimized for dealing with special cases s.a. sparsity, ratio between n. Variables & Rows, etc.) can make a big difference and finally, KNIME is a platform that has its own internal memory organization of tables which may need conversion into matrix format before doing matrix calculation as it is the case for PCA, adding extra burden. All these reasons make that achieving PCA on big matrices directly with the KNIME node may not be the optimal option to adopt. While staying with KNIME, I would recommend to chose a Python node and use the numpy library to speed your PCA calculation, if eventually you are planning to return a table with a low number of columns (Dimensions).

Would you need help on implementing the python PCA on your data, please post a minimalist workflow with your KAGGLE data already converted on which you need to perform the PCA, and I’ll complete it with a Python solution.

Best regards,

Ael

2 Likes