PCA node is slow compared to R Snippet / Python Script

Aswin · March 11, 2022, 11:37am

Dear Knimers,

I just wanted to mention that the PCA node is a bit slow.

I have a table of 122 rows and 1627 columns (all doubles), and I want to calculate the first 4 principal components.

Performing the PCA in a Python Script node

from sklearn.decomposition import PCA
import pandas as pd
x = input_table_1.copy()
pca = PCA(n_components=4)
principalComponents = pca.fit_transform(x)
output_table_1 = pd.DataFrame(data = principalComponents, index = x.index)

or an R Snippet node

knime.out <- prcomp(knime.in)
knime.out <- knime.out$x[,1:4]

takes 2 seconds. Using the standard Knime PCA node takes 42 seconds, more than 20x slower.

As dimension reduction is the whole point of the PCA node, it should be able to handle wide tables a bit better I think.

Best,
Aswin

p.s. My Java heap space setting is -Xmx24576m, I am on a 64 GB Ubuntu 20.04 PC with Knime version 4.5.1.

Kathrin · March 14, 2022, 12:45pm

Hi @Aswin,

thank you for the feedback!

I tried to reproduce the behavior to create a ticket for our developer team, but with my example data (150 rows and 1500 columns) the time difference is much small (5869s vs 4945). What kind of data are you using? Is this some dataset, which you could share with us?

Cheers
Kathrin

Aswin · March 14, 2022, 5:30pm

Dear @Kathrin,

please check the attached workflow.

In that workflow, I carry out a PCA on a data tables of 150x1500 and 122x1627.

I do this on two types of data. The first is a normally distributed fully random table. In that case I see the KNIME PCA is 20x slower than the R/Python PCAs.

Then I thought: hmm, perhaps the sluggishness is caused by the data being perfect noise and having no correlations. To check this hypothesis I also generated tables of the same size as before but generated from just 4 random variables, with a tiny bit of additional noise. In this case the 4 components should describe the data very well.

For this data, the KNIME PCA node is a lot faster (though still slower than R/Python). When I increase the standard deviation of the noise, the time the KNIME PCA node needs also goes up (quite abruptly!). Apparently, the KNIME PCA node does not handle noisy data very well.

Below the table of benchmark times in seconds:

Best
Aswin

p.s. Well, for now you will just have to take my word for it, because for some reason the workflow export function in knime somehow stopped working

Aswin · March 14, 2022, 9:49pm

Dear @Kathrin

I solved the “Export workflow” issue. Attached is the workflow with the PCA benchmarking examples.

Best
Aswin
KNIME_project27.knwf (157.4 KB)

Kathrin · March 15, 2022, 7:26am

Thank you @Aswin for the workflow and the detailed tests you’ve done!

I created an enhancement ticket so I our developers can look into it.

Cheers
Kathrin